Common STT Pitfalls - Cartesia Docs

High word error rate (WER)

Are you formatting the transcript?

This is the single most common mistake. The model emits exactly the spacing it intends: every leading space, trailing space, and bit of punctuation is deliberate. Stripping whitespace or inserting your own corrupts the transcript and inflates your word error rate.Join the model’s text verbatim. Never strip() it, normalize it, or add your own separators. How you join depends on the endpoint, because the two emit text differently.On /stt/websocket, each transcript event carries a delta since the last final transcript, not the full transcript for the audio. Append the text from every event where is_final is true:

import json

transcript = ""
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "transcript" and event["is_final"]:
        transcript += event["text"]  # delta — append exactly as received

Do not:

Drop final transcript events
```
"This is a"
" single sentence."
```

Trim text

"Trimming may"
" join words."

"Trimming mayjoin words."

Joining with a space inserts spaces in the middle of words

"Insert"
"ing spaces is not safe"

"Insert ing spaces is not safe"

On /stt/turns/websocket, the transcript field is cumulative within a turn — each turn.update, turn.eager_end, and turn.end event already holds the full text of the turn so far.If you only care about the final transcript: take the transcript from each turn.end, one per completed turn:

import json

full_audio_transcript = ""
turns: list[str] = []
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "turn.end":
        # transcripts across turns are also concatenated without formatting!
        # the model includes spaces so you can simply concatenate all turn.end events together
        full_audio_transcript += event["transcript"]

        # final, per-turn transcript
        turns.append(event["transcript"])

Concatenating transcripts from turn.update and turn.eager_end events is a classic source of duplicated text: because each update is cumulative, joining them repeats parts of the transcript. Consider turn.update and turn.eager_end as updates to the turn state, not transcript chunks.Read turn.end only for the final transcript.Whatever the endpoint, join the model’s text verbatim — never strip(), normalize, or add separators:

Endpoint	Event	Text semantics	How to combine
`/stt/websocket`	`transcript` (`is_final: true`)	Delta since the last final transcript	Append `text` exactly as received
`/stt/turns/websocket`	`turn.update`	Cumulative within the turn	Replace the turn’s `transcript`; don’t concatenate
`/stt/turns/websocket`	`turn.eager_end`	Cumulative within the turn	Replace the turn’s `transcript`; don’t concatenate
`/stt/turns/websocket`	`turn.end`	Cumulative within the turn	Replace the turn’s `transcript`; don’t concatenate
`/stt/turns/websocket`	—	Complete transcript across all turns	Concatenate `transcript` from all `turn.end` events exactly as received

Did you drain all events?

Audio you’ve already sent is still being transcribed when you ask to close. Cut the read loop short and you lose the tail of the speech.On /stt/websocket, after you send close the server flushes any buffered audio, emits the remaining transcript events, then sends done immediately before closing the socket:

await websocket.send("close")
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "transcript" and event["is_final"]:
        transcript += event["text"]
    elif event["type"] == "done":
        print("done! expect the server to close the connection soon with code=1000")
print("server closed the connection now")

On /stt/turns/websocket, the close command is JSON. Send it, then keep reading messages until the socket closes.Our API will process all buffered audio, output events, then close the socket for you.

await websocket.send(json.dumps({"type": "close"}))
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "turn.end":
        turns.append(event["transcript"])
print("server closed the connection")

Are you using a supported language?

Ink 2 only supports English right now. It has no concept of other languages and will try to transcribe everything as English.

Missing words

Are you using the right sample rate and encoding?

The model decodes your bytes using the encoding and sample_rate you declared in the connection. Declare values that don’t match the actual audio and the model reconstructs garbled samples — words drop out or never register.Verify what you’re really sending. Save the raw PCM you stream and play it back with ffplay, using the same format you passed to the API:

# encoding=pcm_s16le
# sample_rate=16000
# 1 channel (our api only supports mono)
ffplay -f s16le -ar 16000 -ac 1 audio.raw

# format
ffplay -f <encoding_without_pcm_prefix> -ar <sample_rate> -ac <num_channels> <file_path>

If the playback sounds wrong (it should be quite obvious), then your encoding or sample_rate doesn’t match the data. Correct it so the audio plays back cleanly, then send those same values to the API.

High latency

Are you sending `finalize`?

On /stt/websocket, transcription is triggered by the finalize command. Send it the instant your user signals that they are done speaking or VAD detects that the user stopped speaking:

await websocket.send("finalize")

Without it, the model falls back to silence-based auto-finalization. That’s slower by design: it waits out a pause to be sure the user is done.Use finalize to get a transcript mid-session as soon as the user expects a result. You can send finalize as many times as necessary, not to be confused with close, which tells the server that no more audio will be sent.Note that you must only send finalize at sensible moments in the audio stream. Finalizing mid-speech will produce transcription errors.

Are you using the right endpoint?

If you don’t actually know when the user starts and stops speaking, don’t sit on /stt/websocket guessing when to finalize. Use /stt/turns/websocket: the model detects turn boundaries and emits final transcripts on its own, with no finalize required.Already on /stt/turns/websocket and want to shave more latency? Start generating your reply on the turn.eager_end event instead of waiting for turn.end. See Turn Events for the pattern.

Turn detection isn't working

Are you using the right endpoint?

/stt/websocket has no turn detection: it never emits turn.start or turn.end. Sending finalize flushes a transcript without telling you whether the user is done. Detecting turns on top of it means reimplementing VAD yourself.For native turn detection, use /stt/turns/websocket. The model signals when each user turn begins and ends, so your agent reacts to turn events rather than running its own VAD. Unsure which endpoint fits? See Compare STT Endpoints.

Internal server errors

Are you chunking audio in realtime?

The streaming endpoints expect audio to arrive at roughly the rate it’s spoken. Push a large batch of audio into the socket at once and you overload the server-side buffer, which surfaces as an internal server error.Stream in small chunks (50–200ms each) and pace them to realtime, averaging one second of audio sent per second of wall-clock time.To transcribe a complete file in one shot, consider using the batch endpoint /stt/transcribe, which takes the whole file in a single request.

Compare STT endpoints

Choose between turn detection and external VAD

Understand turn detection

See how user turn events work in voice agents

Documentation Index

​Are you formatting the transcript?

​Did you drain all events?

​Are you using a supported language?

​Are you using the right sample rate and encoding?

​Are you sending finalize?

​Are you using the right endpoint?

​Are you using the right endpoint?

​Are you chunking audio in realtime?

​Where to go next

Compare STT endpoints

Understand turn detection

Are you formatting the transcript?

Did you drain all events?

Are you using a supported language?

Are you using the right sample rate and encoding?

Are you sending `finalize`?

Are you using the right endpoint?

Are you using the right endpoint?

Are you chunking audio in realtime?

Where to go next