Documentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.
High word error rate (WER)
High word error rate (WER)
Are you formatting the transcript?
This is the single most common mistake. The model emits exactly the spacing it intends: every leading space, trailing space, and bit of punctuation is deliberate. Stripping whitespace or inserting your own corrupts the transcript and inflates your word error rate.Join the model’s text verbatim. Neverstrip() it, normalize it, or add your own separators. How you join depends on the endpoint, because the two emit text differently.On /stt/websocket, each transcript event carries a delta since the last final transcript, not the full transcript for the audio. Append the text from every event where is_final is true:-
Drop final transcript events
-
Trim
text -
Joining with a space inserts spaces in the middle of words
/stt/turns/websocket, the transcript field is cumulative within a turn — each turn.update, turn.eager_end, and turn.end event already holds the full text of the turn so far.If you only care about the final transcript: take the transcript from each turn.end, one per completed turn:turn.update and turn.eager_end events is a classic source of duplicated text: because each update is cumulative, joining them repeats parts of the transcript.
Consider turn.update and turn.eager_end as updates to the turn state, not transcript chunks.Read turn.end only for the final transcript.Whatever the endpoint, join the model’s text verbatim — never strip(), normalize, or add separators:| Endpoint | Event | Text semantics | How to combine |
|---|---|---|---|
/stt/websocket | transcript (is_final: true) | Delta since the last final transcript | Append text exactly as received |
/stt/turns/websocket | turn.update | Cumulative within the turn | Replace the turn’s transcript; don’t concatenate |
/stt/turns/websocket | turn.eager_end | Cumulative within the turn | Replace the turn’s transcript; don’t concatenate |
/stt/turns/websocket | turn.end | Cumulative within the turn | Replace the turn’s transcript; don’t concatenate |
/stt/turns/websocket | — | Complete transcript across all turns | Concatenate transcript from all turn.end events exactly as received |
Did you drain all events?
Audio you’ve already sent is still being transcribed when you ask to close. Cut the read loop short and you lose the tail of the speech.On/stt/websocket, after you send close the server flushes any buffered audio, emits the remaining transcript events, then sends done immediately before closing the socket:/stt/turns/websocket, the close command is JSON. Send it, then keep reading messages until the socket closes.Our API will process all buffered audio, output events, then close the socket for you.Are you using a supported language?
Ink 2 only supports English right now. It has no concept of other languages and will try to transcribe everything as English.Missing words
Missing words
Are you using the right sample rate and encoding?
The model decodes your bytes using theencoding and sample_rate you declared in the connection. Declare values that don’t match the actual audio and the model reconstructs garbled samples — words drop out or never register.Verify what you’re really sending. Save the raw PCM you stream and play it back with ffplay, using the same format you passed to the API:encoding or sample_rate doesn’t match the data. Correct it so the audio plays back cleanly, then send those same values to the API.High latency
High latency
Are you sending finalize?
On /stt/websocket, transcription is triggered by the finalize command. Send it the instant your user signals that they are done speaking or VAD detects that the user stopped speaking:finalize to get a transcript mid-session as soon as the user expects a result.
You can send finalize as many times as necessary, not to be confused with close, which tells the server that no more audio will be sent.Note that you must only send finalize at sensible moments in the audio stream. Finalizing mid-speech will produce transcription errors.Are you using the right endpoint?
If you don’t actually know when the user starts and stops speaking, don’t sit on/stt/websocket guessing when to finalize. Use /stt/turns/websocket: the model detects turn boundaries and emits final transcripts on its own, with no finalize required.Already on /stt/turns/websocket and want to shave more latency? Start generating your reply on the turn.eager_end event instead of waiting for turn.end. See Turn Events for the pattern.Turn detection isn't working
Turn detection isn't working
Are you using the right endpoint?
/stt/websocket has no turn detection: it never emits turn.start or turn.end. Sending finalize flushes a transcript without telling you whether the user is done. Detecting turns on top of it means reimplementing VAD yourself.For native turn detection, use /stt/turns/websocket. The model signals when each user turn begins and ends, so your agent reacts to turn events rather than running its own VAD. Unsure which endpoint fits? See Compare STT Endpoints.Internal server errors
Internal server errors
Are you chunking audio in realtime?
The streaming endpoints expect audio to arrive at roughly the rate it’s spoken. Push a large batch of audio into the socket at once and you overload the server-side buffer, which surfaces as an internal server error.Stream in small chunks (50–200ms each) and pace them to realtime, averaging one second of audio sent per second of wall-clock time.To transcribe a complete file in one shot, consider using the batch endpoint/stt/transcribe, which takes the whole file in a single request.Where to go next
Compare STT endpoints
Choose between turn detection and external VAD
Understand turn detection
See how user turn events work in voice agents