Realtime STT (Manual)

Tips for other endpoints can be found on the top-level troubleshooting page.

Transcript errors

Are you joining transcripts correctly?

Each transcript event carries a delta since the last final transcript, not the full transcript for the audio. Append the text from every event where is_final is true:

import json

transcript = ""
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "transcript" and event["is_final"]:
        # delta, appended exactly as received
        transcript += event["text"]

Be sure to include all transcript events where is_final is true.

{ "type": "transcript", "is_final": false, "text": "Ignore this" }
{ "type": "transcript", "is_final": true, "text": "This is a" }
{ "type": "transcript", "is_final": true, "text": " single sentence." }

Do not trim text

"Trimming may"
" join words."

"Trimming mayjoin words."

Do not join text with a space in between

"Insert"
"ing spaces is not safe"

"Insert ing spaces is not safe"

Did you drain all events?

Once you are done sending all audio for a session, send "close" to tell the model to flush any buffered audio and emit remaining transcript events. The server will send { "type": "done" } after all audio has been processed, then close the socket for you.The server buffers some audio to improve transcription accuracy. If you don’t send the close command or stop reading messages early, that buffered audio will not be processed. This is okay if you don’t care about the last second of audio.

await websocket.send("close")
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "transcript" and event["is_final"]:
        transcript += event["text"]
    elif event["type"] == "done":
        print("done! expect the server to close the connection soon with code=1000")
        # optional: stop reading messages and close the socket yourself
print("server closed the connection now")

Did you specify the language?

Be sure to include ?language=xx (replace xx with an ISO 639-1 language code) as a query param when establishing your WebSocket connection. This endpoint does not support language detection yet.See Models for supported languages.

Are you using the right sample rate and encoding?

The model decodes your bytes using the encoding and sample_rate you declared in the connection. Our server might not error if these parameters are incorrect.You can validate your parameters by saving your audio data and playing it back with ffplay:

# encoding=pcm_s16le
# sample_rate=16000
# 1 channel (the API expects mono)
ffplay -f s16le -ar 16000 -ac 1 audio.raw

# general format
ffplay -f <encoding_without_pcm_prefix> -ar <sample_rate> -ac <num_channels_must_be_one> <file_path>

If the playback sounds wrong (it should be quite obvious), then your encoding or sample_rate doesn’t match the data. Correct it so your audio plays back cleanly, then send those same values to the API.See Audio Input for help finding the right parameters.

Are you finalizing too often?

Make sure you’re only sending finalize after the user is finished speaking. Finalizing mid-speech will produce transcription errors.

High latency

Are you sending the finalize command?

Transcription is triggered by the finalize command. Send it after your user signals that they are done speaking or VAD detects that the user stopped speaking to “finalize the turn”:

await websocket.send("finalize")

Without it, the model falls back to silence-based auto-finalization. That’s slower by design: it waits out a pause to be sure the user is done.You should send finalize as many times as necessary, not to be confused with close, which closes the session permanently.You must only send finalize at sensible moments in the audio stream. Finalizing mid-speech will produce transcription errors.

Are you using the right endpoint?

If you don’t know when your user starts and stops speaking, try Realtime STT (Auto) to allow our model to detect turn boundaries and emit final transcripts as soon as your user is done speaking.Switching from “manual” to “auto” will improve final transcript latency out-of-the-box since the “manual” endpoint will hang onto the last transcript chunk from user speech in expectation that your client will send finalize.The “auto” endpoint does not expect your client to send anything besides audio and will send the final transcript in a turn.end event as soon as it’s ready.

Did you stop sending audio?

Our API expects a continuous stream of audio. If you stop sending audio, the server will wait for more audio chunks to arrive rather than assuming that the user is silent.This is normally desired behavior to handle network lag, but it does mean that your client needs to send silence (all zeros) when your audio input is muted.

Server errors

Are you chunking audio?

Our realtime WebSocket endpoints expect audio to arrive at roughly the rate it’s spoken. Pushing a large batch of audio into the socket at once can overload the server-side buffer, which may surface as an internal server error.Stream in small chunks (~100 ms each) and pace them to realtime, averaging one second of audio sent per second of wall-clock time. Here’s a JavaScript example.To transcribe a complete file in one shot, consider using Batch STT, which takes the whole file in a single request.

Get Started

Text-to-Speech

Speech-to-Text

Tools

Integrations

Enterprise

Transcript errors

High latency

Server errors

​Transcript errors

​High latency

​Server errors

Transcript errors

High latency

Server errors