> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Troubleshooting

> Transcription errors, high latency, server errors

## Troubleshooting Realtime STT (Auto)

We refer to our `/stt/turns/websocket` endpoint as "Realtime STT (**Auto**)" since user turns are **automatically finalized** by our model.

### Realtime STT (Auto): Transcript errors

<AccordionGroup>
  <Accordion title="Are you joining transcripts correctly?">
    The `transcript` field is **cumulative within a turn** — each `turn.update`, `turn.eager_end`, and `turn.end` event already holds the full text of the turn so far.

    If you only care about the final transcript: take the `transcript` property from each `turn.end`, one per completed turn. **Join `transcript` verbatim. Never `strip()` it, normalize it, or add your own separators.**

    ```python theme={null}
    import json

    full_audio_transcript = ""
    turns: list[str] = []
    async for message in websocket:
        event = json.loads(message)
        if event["type"] == "turn.end":
            # transcripts across turns should be
            # concatenated without formatting!
            full_audio_transcript += event["transcript"]

            # per-turn transcript
            turns.append(event["transcript"])
    ```

    Concatenating transcripts from `turn.update` and `turn.eager_end` events is a classic source of duplicated text: because each update is cumulative, joining them repeats parts of the transcript.
    Consider `turn.update` and `turn.eager_end` as updates to the turn state, not transcript chunks.

    Read `turn.end` only for the final transcript.
  </Accordion>

  <Accordion title="Did you drain all events?">
    Once you are done sending all audio for a session, send `{"type": "close"}` to tell the model to flush any buffered audio and emit remaining events. The server will close the socket for you once the model is done.

    The server buffers some audio to improve transcription accuracy. If you don't send the close command or stop reading messages early, that buffered audio will not be processed. This is okay if you don't care about the last second of audio.

    ```python theme={null}
    await websocket.send(json.dumps({"type": "close"}))
    async for message in websocket:
        event = json.loads(message)
        if event["type"] == "turn.end":
            turns.append(event["transcript"])
            # do not stop reading from the websocket!
    print("server closed the connection")
    ```
  </Accordion>

  <Accordion title="Are you using a supported language?">
    Ink 2 only supports English right now. It has no concept of other languages and will try to transcribe everything as English.
  </Accordion>

  <Accordion title="Are you using the right sample rate and encoding?">
    The model decodes your bytes using the `encoding` and `sample_rate` you declared in the connection. Our server **might not error** if these parameters are incorrect.

    You can validate your parameters by saving your audio data and playing it back with [ffplay](https://ffmpeg.org/ffplay.html):

    ```bash theme={null}
    # encoding=pcm_s16le
    # sample_rate=16000
    # 1 channel (the API expects mono)
    ffplay -f s16le -ar 16000 -ac 1 audio.raw

    # general format
    ffplay -f <encoding_without_pcm_prefix> -ar <sample_rate> -ac <num_channels_must_be_one> <file_path>
    ```

    If the playback sounds wrong (it should be quite obvious), then your `encoding` or `sample_rate` doesn't match the data. Correct it so your audio plays back cleanly, then send those same values to the API.

    See [STT Input Audio Encodings](/build-with-cartesia/capability-guides/stt-input-encodings) for help finding the right parameters.
  </Accordion>
</AccordionGroup>

### Realtime STT (Auto): High latency

<AccordionGroup>
  <Accordion title="Did you stop sending audio?">
    Our API expects a continuous stream of audio.
    If you stop sending audio, the server will wait for more audio chunks to arrive rather than assuming that the user is silent.

    This is normally desired behavior to handle network lag, but it does mean that your client needs to send silence (all zeros) when your audio input is muted.
  </Accordion>

  <Accordion title="Are you using the right endpoint?">
    If you're building a push-to-talk style app (e.g. user holds a button to speak) or you would like to "flush" the transcript at predetermined points (e.g. certain evals),
    you can consider switching to [Realtime STT (Manual)](/api-reference/stt/websocket).

    Turn detection adds some delay to the final transcript carried by `turn.end`, something on the order of half a second or so.
    If your setup allows for it, using the manual endpoint and sending `"finalize"` when the user is done speaking can cut out the latency overhead from turn detection.
  </Accordion>
</AccordionGroup>

### Realtime STT (Auto): Server errors

<Accordion title="Are you chunking audio?">
  Our realtime WebSocket endpoints expect audio to arrive at roughly the rate it's spoken.
  Pushing a large batch of audio into the socket at once can overload the server-side buffer,
  which may surface as an internal server error.

  Stream in small chunks (50–200ms each) and pace them to realtime, averaging one second of audio sent per second of wall-clock time. Here's a [JavaScript example](https://github.com/cartesia-ai/cartesia-js/blob/v3.2.0/examples/browser_examples.ts#L29-L66).

  To transcribe a complete file in one shot, consider using [Batch STT](/api-reference/stt/transcribe), which takes the whole file in a single request.
</Accordion>

## Troubleshooting Realtime STT (Manual)

We refer to our `/stt/websocket` endpoint as "Realtime STT (**Manual**)" since user turns are **manually finalized** by your client.

### Realtime STT (Manual): Transcript errors

<AccordionGroup>
  <Accordion title="Are you joining transcripts correctly?">
    Each `transcript` event carries a **delta** since the last final transcript, not the full transcript for the audio. Append the `text` from every event where `is_final` is `true`:

    ```python theme={null}
    import json

    transcript = ""
    async for message in websocket:
        event = json.loads(message)
        if event["type"] == "transcript" and event["is_final"]:
            # delta, appended exactly as received
            transcript += event["text"]
    ```

    Be sure to include all `transcript` events where `is_final` is true.

    ```json lines theme={null}
    { "type": "transcript", "is_final": false, "text": "Ignore this" }
    { "type": "transcript", "is_final": true, "text": "This is a" }
    { "type": "transcript", "is_final": true, "text": " single sentence." }
    ```

    Do not trim `text`

    ```json lines theme={null}
    "Trimming may"
    " join words."
    ```

    ```json theme={null}
    "Trimming mayjoin words."
    ```

    Do not join `text` with a space in between

    ```json lines theme={null}
    "Insert"
    "ing spaces is not safe"
    ```

    ```json theme={null}
    "Insert ing spaces is not safe"
    ```
  </Accordion>

  <Accordion title="Did you drain all events?">
    Once you are done sending all audio for a session, send `"close"` to tell the model to flush any buffered audio and emit remaining `transcript` events. The server will send `{ "type": "done" }` after all audio has been processed, then close the socket for you.

    The server buffers some audio to improve transcription accuracy. If you don't send the close command or stop reading messages early, that buffered audio will not be processed. This is okay if you don't care about the last second of audio.

    ```python theme={null}
    await websocket.send("close")
    async for message in websocket:
        event = json.loads(message)
        if event["type"] == "transcript" and event["is_final"]:
            transcript += event["text"]
        elif event["type"] == "done":
            print("done! expect the server to close the connection soon with code=1000")
            # optional: stop reading messages and close the socket yourself
    print("server closed the connection now")
    ```
  </Accordion>

  <Accordion title="Did you specify the language?">
    Be sure to include `?language=xx` (replace `xx` with an ISO 639-1 language code) as a query param when establishing your WebSocket connection. This endpoint does not support language detection yet.

    See [Models](/build-with-cartesia/stt-models/latest) for supported languages.
  </Accordion>

  <Accordion title="Are you using the right sample rate and encoding?">
    The model decodes your bytes using the `encoding` and `sample_rate` you declared in the connection. Our server **might not error** if these parameters are incorrect.

    You can validate your parameters by saving your audio data and playing it back with [ffplay](https://ffmpeg.org/ffplay.html):

    ```bash theme={null}
    # encoding=pcm_s16le
    # sample_rate=16000
    # 1 channel (the API expects mono)
    ffplay -f s16le -ar 16000 -ac 1 audio.raw

    # general format
    ffplay -f <encoding_without_pcm_prefix> -ar <sample_rate> -ac <num_channels_must_be_one> <file_path>
    ```

    If the playback sounds wrong (it should be quite obvious), then your `encoding` or `sample_rate` doesn't match the data. Correct it so your audio plays back cleanly, then send those same values to the API.

    See [STT Input Audio Encodings](/build-with-cartesia/capability-guides/stt-input-encodings) for help finding the right parameters.
  </Accordion>

  <Accordion title="Are you finalizing too often?">
    Make sure you're only sending `finalize` after the user is finished speaking. Finalizing mid-speech will produce transcription errors.
  </Accordion>
</AccordionGroup>

### Realtime STT (Manual): High latency

<AccordionGroup>
  <Accordion title="Are you sending the finalize command?">
    Transcription is triggered by the `finalize` command. Send it after your user signals that they are done speaking or VAD detects that the user stopped speaking to "finalize the turn":

    ```python theme={null}
    await websocket.send("finalize")
    ```

    Without it, the model falls back to silence-based auto-finalization. That's slower by design: it waits out a pause to be sure the user is done.

    You should send `finalize` as many times as necessary, not to be confused with `close`, which closes the session permanently.

    You must only send `finalize` at sensible moments in the audio stream. Finalizing mid-speech will produce transcription errors.
  </Accordion>

  <Accordion title="Are you using the right endpoint?">
    If you don't know when your user starts and stops speaking,
    try [Realtime STT (Auto)](/api-reference/stt/turns/websocket)
    to allow our model to detect turn boundaries and emit final transcripts as soon as your user is done speaking.

    Switching from "manual" to "auto" will improve final transcript latency out-of-the-box
    since the "manual" endpoint will hang onto the last transcript chunk from user speech in expectation that your client will send `finalize`.

    The "auto" endpoint does not expect your client to send anything besides audio and will send the final transcript in a `turn.end` event as soon as it's ready.
  </Accordion>

  <Accordion title="Did you stop sending audio?">
    Our API expects a continuous stream of audio.
    If you stop sending audio, the server will wait for more audio chunks to arrive rather than assuming that the user is silent.

    This is normally desired behavior to handle network lag, but it does mean that your client needs to send silence (all zeros) when your audio input is muted.
  </Accordion>
</AccordionGroup>

### Realtime STT (Manual): Server errors

<Accordion title="Are you chunking audio?">
  Our realtime WebSocket endpoints expect audio to arrive at roughly the rate it's spoken.
  Pushing a large batch of audio into the socket at once can overload the server-side buffer,
  which may surface as an internal server error.

  Stream in small chunks (50–200ms each) and pace them to realtime, averaging one second of audio sent per second of wall-clock time. Here's a [JavaScript example](https://github.com/cartesia-ai/cartesia-js/blob/v3.2.0/examples/browser_examples.ts#L29-L66).

  To transcribe a complete file in one shot, consider using [Batch STT](/api-reference/stt/transcribe), which takes the whole file in a single request.
</Accordion>
