This endpoint creates a bidirectional WebSocket connection for real-time speech transcription.
Our STT endpoint enables sending in a stream of audio as bytes, and provides transcription results as they become available.
Usage Pattern:
finalize as a text message to flush any remaining audio (receives flush_done acknowledgment)done as a text message to close the session cleanly (receives done acknowledgment and closes)Performance Recommendation: For best performance, it is recommended to resample audio before streaming and send audio chunks in pcm_s16le format at 16kHz sample rate.
Pricing: Speech-to-text streaming is priced at 1 credit per 1 second of audio streamed in.
For WebSocket connection limits, see the concurrency limits and timeouts page.
Documentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.