{}{
"type": "close"
}{
"type": "config",
"turn": {
"start_threshold": 0.8,
"eager_end_threshold": 0.4,
"end_threshold": 0.2,
"end_timeout_ms": 5600
}
}{
"type": "connected",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.start",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.update",
"transcript": "Hey can you help",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.eager_end",
"transcript": "Hey can you help me with something?",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.resume",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.end",
"transcript": "Hey can you help me with something?",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "error",
"title": "Invalid model",
"message": "The model is not valid, make sure it is a valid model ID.",
"error_code": "model_not_found",
"doc_url": "https://docs.cartesia.ai/build-with-cartesia/stt/latest",
"status_code": 400,
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}Realtime Speech-to-Text (Auto)
Realtime speech transcription with built-in turn detection
This endpoint is English only right now.
We expect to add more languages in the coming months.
{}{
"type": "close"
}{
"type": "config",
"turn": {
"start_threshold": 0.8,
"eager_end_threshold": 0.4,
"end_threshold": 0.2,
"end_timeout_ms": 5600
}
}{
"type": "connected",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.start",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.update",
"transcript": "Hey can you help",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.eager_end",
"transcript": "Hey can you help me with something?",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.resume",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "turn.end",
"transcript": "Hey can you help me with something?",
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}{
"type": "error",
"title": "Invalid model",
"message": "The model is not valid, make sure it is a valid model ID.",
"error_code": "model_not_found",
"doc_url": "https://docs.cartesia.ai/build-with-cartesia/stt/latest",
"status_code": 400,
"request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}ID of the model to use for transcription, e.g. ink-2.
See Models for available models.
The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.
Supported encodings: pcm_s16le, pcm_s32le, pcm_f16le, pcm_f32le, pcm_mulaw, pcm_alaw.
For guidance on choosing an encoding, see Audio Input.
The sample rate of the audio in Hz.
Threshold above which to start the turn. Range: 0.5–0.9. Must stay above the eager end threshold.
See Configuring turn detection for details.
Threshold below which to eager end the turn. Range: 0.3–0.6. Must stay between the end and start thresholds.
See Configuring turn detection for details.
Threshold below which to end the turn. Range: 0.05–0.5. Must stay below the eager end threshold.
See Configuring turn detection for details.
Maximum amount of time in milliseconds that the model will wait after the user stops speaking before ending the turn. Range: 640–11200.
See Configuring turn detection for details.
API version, e.g. 2026-03-01
API key passed in a header.
A short-lived access token passed in a query param to make API requests from a client. This is particularly useful in the browser, where WebSockets do not support headers. See Authenticate client apps to generate an access token.
Send WebSocket binary messages containing raw audio data as specified by the encoding and sample_rate query parameters.
Audio Requirements:
- Send audio in small chunks, e.g. 100 ms
- Audio format must match the
encodingandsample_rateparameters
Send a JSON encoded close command as WebSocket text message to close the session cleanly. All buffered audio will be processed by the model into events.
Send a JSON encoded config command as a WebSocket text message to update model settings.
Fires once when the WebSocket connection is established.
You do not need to wait for this event before sending audio.
Marks the start of a user turn. Fires quickly after the user begins speaking.
This event can be used to interrupt your agent to avoid talking over the user.
Fires repeatedly as the model transcribes the current user turn.
Fires when the model predicts that the user might be done speaking.
Fires after turn.eager_end if the user turn has not actually ended.
Marks the end of a user turn.
Error information for STT WebSocket connections.
Was this page helpful?