Speech to Text (Streaming)
This endpoint creates a bidirectional WebSocket connection for real-time speech transcription.
Our STT endpoint enables sending in a stream of audio as bytes encoded as PCM 16K samples, and provides transcription results as they become available.
Usage Pattern:
- Connect to the WebSocket with appropriate query parameters
- Send audio chunks as binary WebSocket messages in pcm_s16le format at 16K sample rate
- Receive transcription messages as JSON
- Send
finalize
as a text message to flush any remaining audio (receivesflush_done
acknowledgment) - Send
done
as a text message to close the session cleanly (receivesdone
acknowledgment and closes)
HandshakeTry it
Headers
Query parameters
ID of the model to use for transcription. Use ink-whisper
for the latest Cartesia Whisper model.
The language of the input audio in ISO-639-1 format. Defaults to en
.
The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.
Currently supported: pcm_s16le
- 16-bit signed integer PCM, little-endian (default)
The sample rate of the audio in Hz, only 16000
is supported (default). Must match the actual sample rate of your audio data.
You can specify this instead of the X-API-Key
header. This is particularly useful for use in the browser, where WebSockets do not support headers.
You do not need to specify this if you are passing the header.
Send
In Practice:
- Send binary WebSocket messages containing raw audio data in the format specified by
encoding
parameter - Send text WebSocket messages with commands:
finalize
- Flush any remaining audio and receive flush_done acknowledgmentdone
- Flush remaining audio, close session, and receive done acknowledgment
Timeout Behavior:
- If no audio data is sent for 20 seconds, the WebSocket will automatically disconnect
- The timeout resets with each message (audio data or text command) sent to the server
Audio Requirements:
- Send audio in small chunks (e.g., 100ms intervals) for optimal latency
- Audio format must match the
encoding
andsample_rate
parameters
Receive
The server will send transcription results as they become available. Messages can be of type transcript
, flush_done
, done
, or error
.