Realtime STT (Auto)

Tips for other endpoints can be found on the top-level troubleshooting page.

Transcript errors

Are you joining transcripts correctly?

The transcript field is cumulative within a turn — each turn.update, turn.eager_end, and turn.end event already holds the full text of the turn so far.If you only care about the final transcript: take the transcript property from each turn.end, one per completed turn. Join transcript verbatim. Never strip() it, normalize it, or add your own separators.

import json

full_audio_transcript = ""
turns: list[str] = []
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "turn.end":
        # transcripts across turns should be
        # concatenated without formatting!
        full_audio_transcript += event["transcript"]

        # per-turn transcript
        turns.append(event["transcript"])

Concatenating transcripts from turn.update and turn.eager_end events is a classic source of duplicated text: because each update is cumulative, joining them repeats parts of the transcript. Consider turn.update and turn.eager_end as updates to the turn state, not transcript chunks.Read turn.end only for the final transcript.

Did you drain all events?

Once you are done sending all audio for a session, send {"type": "close"} to tell the model to flush any buffered audio and emit remaining events. The server will close the socket for you once the model is done.The server buffers some audio to improve transcription accuracy. If you don’t send the close command or stop reading messages early, that buffered audio will not be processed. This is okay if you don’t care about the last second of audio.

await websocket.send(json.dumps({"type": "close"}))
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "turn.end":
        turns.append(event["transcript"])
        # do not stop reading from the websocket!
print("server closed the connection")

Are you using a supported language?

Ink 2 only supports English right now. It has no concept of other languages and will try to transcribe everything as English.

Are you using the right sample rate and encoding?

The model decodes your bytes using the encoding and sample_rate you declared in the connection. Our server might not error if these parameters are incorrect.You can validate your parameters by saving your audio data and playing it back with ffplay:

# encoding=pcm_s16le
# sample_rate=16000
# 1 channel (the API expects mono)
ffplay -f s16le -ar 16000 -ac 1 audio.raw

# general format
ffplay -f <encoding_without_pcm_prefix> -ar <sample_rate> -ac <num_channels_must_be_one> <file_path>

If the playback sounds wrong (it should be quite obvious), then your encoding or sample_rate doesn’t match the data. Correct it so your audio plays back cleanly, then send those same values to the API.See Audio Input for help finding the right parameters.

High latency

Did you stop sending audio?

Our API expects a continuous stream of audio. If you stop sending audio, the server will wait for more audio chunks to arrive rather than assuming that the user is silent.This is normally desired behavior to handle network lag, but it does mean that your client needs to send silence (all zeros) when your audio input is muted.

Are you using the right endpoint?

If you’re building a push-to-talk style app (e.g. user holds a button to speak) or you would like to “flush” the transcript at predetermined points (e.g. certain evals), you can consider switching to Realtime STT (Manual).Turn detection adds some delay to the final transcript carried by turn.end, something on the order of half a second or so. If your setup allows for it, using the manual endpoint and sending "finalize" when the user is done speaking can cut out the latency overhead from turn detection.

Server errors

Are you chunking audio?

Our realtime WebSocket endpoints expect audio to arrive at roughly the rate it’s spoken. Pushing a large batch of audio into the socket at once can overload the server-side buffer, which may surface as an internal server error.Stream in small chunks (~100 ms each) and pace them to realtime, averaging one second of audio sent per second of wall-clock time. Here’s a JavaScript example.To transcribe a complete file in one shot, consider using Batch STT, which takes the whole file in a single request.

Troubleshooting

Realtime STT (Manual)

⌘I

Get Started

Text-to-Speech

Speech-to-Text

Tools

Integrations

Enterprise

Transcript errors

High latency

Server errors

​Transcript errors

​High latency

​Server errors

Transcript errors

High latency

Server errors