Turn Events

The Realtime Speech-to-Text API organizes transcription around user turns, not raw transcript segments. The model itself signals when a user turn begins and ends, so your voice agent reacts to events rather than running its own voice activity detection. The end result is a more human-like voice agent that:

Handles pauses and phone numbers
Pulls in context from the conversation for more accuracy

async for message in websocket:
    event = json.loads(message)
    if event["type"] == "turn.start":
        tts.interrupt()
    elif event["type"] == "turn.end":
        reply = llm.generate(event["transcript"])
        tts.speak(reply)

Events

The API emits these events to describe the state of the conversation.

Event	Fires when	Carries transcript?
`turn.start`	The user begins speaking.	No
`turn.update`	Repeatedly, as the model transcribes the user’s speech.	Yes
`turn.eager_end` [PREVIEW]	The model predicts the user might be done speaking.	Yes
`turn.resume` [PREVIEW]	The user turn is continuing and the last `turn.eager_end` event should be ignored.	No
`turn.end`	The uesr turn is definitively complete.	Yes

Every transcript field is cumulative within a turn — it contains the full text transcribed so far in this user turn, not a delta. You do not need to concatenate partial results across events. All emitted text is final. Later events only append to the transcript; the model never revises text it has already sent. You can use the partial transcript from a turn.update the moment it arrives without worrying about it changing. A separate connected event fires once when the WebSocket is established. You do not need to wait for it before sending audio.

State machine

Between user turns, the session is idle. When the user begins speaking, turn.start fires, followed by turn.update events as the transcript builds. The turn ends in one of two ways:

Confident end — turn.end fires directly. The user turn is over.
Eager end — turn.eager_end fires first, flagging that the user might be done. Then either turn.end confirms the user turn is over, or turn.resume fires and the user turn continues.

Using `turn.eager_end` to cut latency [PREVIEW]

turn.eager_end lets your agent start generating a reply before the model is confident that the user is done speaking. The moment it fires, send the transcript to your LLM — you’ll have a response ready to play the instant turn.end arrives. This is an optimization that might not be necessary for your use case. We would recommend focusing on turn.start and turn.end events earlier in development, then thinking about incorporating more events later on as your agent matures. Two things can happen after turn.eager_end:

turn.resume: the user kept talking. Cancel any in-progress LLM and TTS generation and wait for the turn to end.
turn.end: the user really is done. Play the prepared response.

This is a preview feature as we’re still tuning how turn.eager_end and turn.resume work.

pending_reply = None

async for message in websocket:
    event = json.loads(message)
    match event["type"]:
        case "turn.eager_end":
            pending_reply = llm.generate_async(event["transcript"])

        case "turn.resume":
            pending_reply.cancel()
            pending_reply = None

        case "turn.end":
            if pending_reply:
                confident_reply = pending_reply
                pending_reply = None
                tts.speak(confident_reply)
            else:
                tts.speak(llm.generate(event["transcript"]))

Example: one turn

The user says “Hi I need to cancel my subscription please.”

{ "type": "turn.start" }
{ "type": "turn.update",    "transcript": "Hi I" }
{ "type": "turn.update",    "transcript": "Hi I need to" }
{ "type": "turn.eager_end", "transcript": "Hi I need to cancel" }
{ "type": "turn.resume" }
{ "type": "turn.update",    "transcript": "Hi I need to cancel my subscription" }
{ "type": "turn.eager_end", "transcript": "Hi I need to cancel my subscription please." }
{ "type": "turn.end",       "transcript": "Hi I need to cancel my subscription please." }

An agent listening to this stream would:

Start preparing a reply on the first turn.eager_end.
Cancel it on turn.resume.
Start preparing again on the second turn.eager_end.
Speak the prepared reply on turn.end.

Guarantees

The first event in every turn is turn.start.
turn.eager_end will be followed by turn.end or turn.resume.
turn.resume only fires after a preceding turn.eager_end.
turn.resume will always fire if the transcript from turn.eager_end is not complete.
turn.end always closes a turn; the next turn begins with a new turn.start.
Events append to the turn’s transcript without modifying earlier text.

Edge cases

Client disconnects mid-turn

If the client stops sending audio while the user is still speaking, the session may not emit turn.eager_end or turn.end for that turn.

Use the API

API Status

TTS

STT

Agents

Voices

Voice Changer

Auth

Datasets

Fine Tunes

Infill

Pronunciation Dicts

Admin

Events

State machine

Using `turn.eager_end` to cut latency [PREVIEW]

Example: one turn

Guarantees

Edge cases

Client disconnects mid-turn

Use the API

API Status

TTS

STT

Agents

Voices

Voice Changer

Auth

Datasets

Fine Tunes

Infill

Pronunciation Dicts

Admin

Documentation Index

​Events

​State machine

​Using turn.eager_end to cut latency [PREVIEW]

​Example: one turn

​Guarantees

​Edge cases

​Client disconnects mid-turn

Events

State machine

Using `turn.eager_end` to cut latency [PREVIEW]

Example: one turn

Guarantees

Edge cases

Client disconnects mid-turn