Turn Detection

The Realtime Speech-to-Text (Auto) API detects user turns for you, so your voice agent doesn’t need its own voice activity detector (VAD). Traditional VAD solutions decide whether someone is speaking from audio energy alone. Our model detects turns semantically: it also considers the conversational context and whether the user’s sentence is linguistically complete. It can tell whether the user is hesitating, thinking mid-sentence, or genuinely done talking.

Turn lifecycle

Between user turns, the session is idle. When the user begins speaking, turn.start fires, followed by turn.update events as the transcript builds. The API emits these events to describe the state of the conversation.

Event	Fires when	Carries transcript?
`turn.start`	The user begins speaking.	No
`turn.update`	Repeatedly, as the model transcribes the user’s speech.	Yes
`turn.eager_end`	The model predicts the user might be done speaking.	Yes
`turn.resume`	The user turn is continuing and the last `turn.eager_end` event should be ignored.	No
`turn.end`	The user turn is definitively complete.	Yes

The lifecycle comes with a few guarantees:

The first event in every turn is turn.start.
turn.eager_end is always followed by turn.end or turn.resume.
turn.resume only fires after a preceding turn.eager_end.
turn.end always closes a turn; the next turn begins with a new turn.start.

Transcript behavior

The transcript property is cumulative within a turn: it contains the full text transcribed so far in this user turn, not a delta. You do not need to concatenate partial results across events. All emitted text is final: the model never revises text it has already sent. You can use the partial transcript from a turn.update the moment it arrives without worrying about it changing.

Example: one turn

The user says “Hi I need to cancel my subscription please.”

{ "type": "turn.start" }
{ "type": "turn.update",    "transcript": "Hi I" }
{ "type": "turn.update",    "transcript": "Hi I need to" }
{ "type": "turn.eager_end", "transcript": "Hi I need to cancel" }
{ "type": "turn.resume" }
{ "type": "turn.update",    "transcript": "Hi I need to cancel my subscription" }
{ "type": "turn.eager_end", "transcript": "Hi I need to cancel my subscription please." }
{ "type": "turn.end",       "transcript": "Hi I need to cancel my subscription please." }

The first turn.eager_end fires early, after “cancel,” but the user keeps going, so turn.resume follows. The second turn.eager_end is correct, and turn.end confirms it.

Configuring turn detection

Turn detection is driven by a single signal: at each moment, the model estimates the likelihood that the user is in an active turn, a value between 0 and 1. The state machine compares this likelihood against three thresholds to decide when to fire turn.start, turn.eager_end, and turn.end. Turn detection ships with defaults that work well for most voice agents. Tune it to optimize your agent’s conversational flow for your use case, balancing latency against how accurately the model detects where each turn starts and ends.

Parameter	Description	Default	Range
`start_threshold`	Likelihood above which the model emits `turn.start`.	`0.8`	`0.5`–`0.9`
`eager_end_threshold`	Likelihood below which the model emits `turn.eager_end`, an early signal that the user may be done speaking.	`0.4`	`0.3`–`0.6`
`end_threshold`	Likelihood below which the model emits `turn.end`, signaling that the user turn is complete.	`0.2`	`0.05`–`0.5`
`end_timeout_ms`	Maximum time in milliseconds to wait after the user stops speaking before emitting `turn.end`, even if the likelihood does not fall below `end_threshold`.	`5600`	`640`–`11200`

Each parameter trades latency against accuracy. The table below shows what moving it in either direction does.

Parameter	Raise it to	Lower it to
`start_threshold`	Require stronger evidence that the user has started speaking, reducing false starts on background noise but reacting more slowly when the user really does start	Get faster, more responsive interruption handling, at the risk of false starts
`eager_end_threshold`	Trigger eager ends sooner, improving perceived latency but increasing the chance that the user continues speaking	Make eager ends more conservative, reducing false eager ends and cancellations when users pause mid-thought
`end_threshold`	End turns faster, improving latency but increasing the risk of ending too early	Wait for stronger evidence that the user is finished, reducing mid-thought turn endings for hesitant or slow speakers
`end_timeout_ms`	Give users more time to pause, think, or continue before the agent responds	Enforce a stricter latency bound when the model remains uncertain

The three thresholds are strictly ordered: start_threshold > eager_end_threshold > end_threshold. In addition to the ranges above, each value is constrained by its neighbors to preserve this ordering, so you cannot set an eager end threshold above the start threshold or an end threshold above the eager end threshold. Set these per connection with query parameters (turn_start_threshold, turn_eager_end_threshold, turn_end_threshold, turn_end_timeout_ms), or change them mid-session by sending a config command.

Common configurations

Below are some example configurations that make useful starting points. Pick the one closest to your use case, then tune individual thresholds from there.

Profile	`start_threshold`	`eager_end_threshold`	`end_threshold`	`end_timeout_ms`
Balanced (default)	`0.8`	`0.4`	`0.2`	`5600`
Responsive	`0.7`	`0.5`	`0.4`	`4500`
Patient	`0.8`	`0.3`	`0.1`	`8000`

Balanced is a good place to start and works well for many voice agent conversations.
Responsive is best when latency is a high priority, such as fast conversational back-and-forth.
Patient is best when accuracy is a high priority, such as when users pause to think or look up information, or where cutting them off mid-turn is costly.

Example code

Handle turn.start and turn.end to get a working agent: interrupt when the user starts speaking, and generate a reply when they finish. To cut latency, also handle turn.eager_end: start generating a reply the moment it fires, then cancel that work if turn.resume arrives, or play it the instant turn.end confirms the user is done.

async for message in websocket:
    event = json.loads(message)
    if event["type"] == "turn.start":
        tts.interrupt()
    elif event["type"] == "turn.end":
        reply = llm.generate(event["transcript"])
        tts.speak(reply)

pending_reply = None

async for message in websocket:
    event = json.loads(message)
    match event["type"]:
        case "turn.start":
            tts.interrupt()

        case "turn.eager_end":
            pending_reply = llm.generate_async(event["transcript"])

        case "turn.resume":
            pending_reply.cancel()
            pending_reply = None

        case "turn.end":
            if pending_reply:
                reply = pending_reply
                pending_reply = None
                tts.speak(reply)
            else:
                tts.speak(llm.generate(event["transcript"]))

Edge cases

No audio vs silence

Our API expects a continuous stream of audio. If you stop sending audio, the server will wait for more audio chunks to arrive rather than assuming that the user is silent. This is normally desired behavior to handle network lag, but it does mean that your client needs to send silence (all zeros) when your audio input is muted.

Draining events

Once you are done sending all audio for a session, send {"type": "close"} to tell the model to flush any buffered audio and emit remaining events. The server will close the socket for you once the model is done. The server buffers some audio to improve transcription accuracy. If you don’t send the close command or stop reading messages early, that buffered audio will not be processed. This is okay if you don’t care about the last second of audio.

await websocket.send(json.dumps({"type": "close"}))
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "turn.end":
        turns.append(event["transcript"])
        # do not stop reading from the websocket!
print("server closed the connection")

Joining transcripts

The transcript field is cumulative within a turn — each turn.update, turn.eager_end, and turn.end event already holds the full text of the turn so far. If you only care about the final transcript: take the transcript property from each turn.end, one per completed turn. Join transcript verbatim. Never strip() it, normalize it, or add your own separators.

import json

full_audio_transcript = ""
turns: list[str] = []
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "turn.end":
        # transcripts across turns should be
        # concatenated without formatting!
        full_audio_transcript += event["transcript"]

        # per-turn transcript
        turns.append(event["transcript"])

Concatenating transcripts from turn.update and turn.eager_end events is a classic source of duplicated text: because each update is cumulative, joining them repeats parts of the transcript. Consider turn.update and turn.eager_end as updates to the turn state, not transcript chunks. Read turn.end only for the final transcript.

Where to go next

Try it out online

See turn detection in action with no sign-up or code required

Use the API

Start building with our Realtime STT API

Use the SDK

Take a look at some real code

Get Started

Text-to-Speech

Speech-to-Text

Tools

Integrations

Enterprise

Turn lifecycle

Transcript behavior

Example: one turn

Configuring turn detection

Common configurations

Example code

Edge cases

No audio vs silence

Draining events

Joining transcripts

Where to go next

Try it out online

Use the API

Use the SDK

​Turn lifecycle

​Transcript behavior

​Example: one turn

​Configuring turn detection

​Common configurations

​Example code

​Edge cases

​No audio vs silence

​Draining events

​Joining transcripts

​Where to go next

Try it out online

Use the API

Use the SDK

Turn lifecycle

Transcript behavior

Example: one turn

Configuring turn detection

Common configurations

Example code

Edge cases

No audio vs silence

Draining events

Joining transcripts

Where to go next