{}{
"type": "close"
}{
"type": "connected",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.start",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.update",
"transcript": "Hey can you help",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.eager_end",
"transcript": "Hey can you help me with something?",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.resume",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.end",
"transcript": "Hey can you help me with something?",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "error",
"title": "Invalid model",
"message": "The model is not valid, make sure it is a valid model ID.",
"error_code": "model_not_found",
"doc_url": "https://docs.cartesia.ai/build-with-cartesia/stt-models/latest",
"status_code": 400,
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}Realtime Speech-to-Text (Auto)
Realtime speech transcription with built-in turn detection
This endpoint is English only right now.
We expect to add more languages in the coming months.
{}{
"type": "close"
}{
"type": "connected",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.start",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.update",
"transcript": "Hey can you help",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.eager_end",
"transcript": "Hey can you help me with something?",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.resume",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.end",
"transcript": "Hey can you help me with something?",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "error",
"title": "Invalid model",
"message": "The model is not valid, make sure it is a valid model ID.",
"error_code": "model_not_found",
"doc_url": "https://docs.cartesia.ai/build-with-cartesia/stt-models/latest",
"status_code": 400,
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}ID of the model to use for transcription, e.g. ink-2.
See Models for available models.
The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.
Supported encodings: pcm_s16le, pcm_s32le, pcm_f16le, pcm_f32le, pcm_mulaw, pcm_alaw.
For guidance on choosing an encoding, see Audio encodings.
The sample rate of the audio in Hz.
API version, e.g. 2026-03-01
API key passed in a header.
A short-lived access token passed in a query param to make API requests from a client. This is particularly useful in the browser, where WebSockets do not support headers. See Authenticate client apps to generate an access token.
Send WebSocket binary messages containing raw audio data as specified by the encoding and sample_rate query parameters.
Audio Requirements:
- Send audio in small chunks, e.g. 100 ms
- Audio format must match the
encodingandsample_rateparameters
Send a JSON encoded close command as WebSocket text message to close the session cleanly. All buffered audio will be processed by the model into events.
Fires once when the WebSocket connection is established.
You do not need to wait for this event before sending audio.
Marks the start of a user turn. Fires quickly after the user begins speaking.
This event can be used to interrupt your agent to avoid talking over the user.
Fires repeatedly as the model transcribes the current user turn.
Fires when the model predicts that the user might be done speaking.
Fires after turn.eager_end if the user turn has not actually ended.
Marks the end of a user turn.
Error information for STT WebSocket connections.
Was this page helpful?