- Handles pauses and phone numbers
- Pulls in context from the conversation for more accuracy
State machine
Between user turns, the session is idle. When the user begins speaking,turn.start fires, followed by turn.update events as the transcript builds. The turn ends in one of two ways:
- Confident end —
turn.endfires directly. The user turn is over. - Eager end —
turn.eager_endfires first, flagging that the user might be done. Then eitherturn.endconfirms the user turn is over, orturn.resumefires and the user turn continues.
Events
The API emits these events to describe the state of the conversation.| Event | Fires when | Carries transcript? |
|---|---|---|
turn.start | The user begins speaking. | No |
turn.update | Repeatedly, as the model transcribes the user’s speech. | Yes |
turn.eager_end preview | The model predicts the user might be done speaking. | Yes |
turn.resume preview | The user turn is continuing and the last turn.eager_end event should be ignored. | No |
turn.end | The user turn is definitively complete. | Yes |
transcript property is cumulative within a turn — it contains the full text transcribed so far in this user turn, not a delta. You do not need to concatenate partial results across events.
All emitted text is final—the model never revises text it has already sent. You can use the partial transcript from a turn.update the moment it arrives without worrying about it changing.
A separate connected event fires once when the WebSocket is established. You do not need to wait for it before sending audio.
Using turn.eager_end to cut latency Preview
turn.eager_end lets your agent start generating a reply before the model is confident that the user is done speaking. The moment it fires, send the transcript to your LLM — you’ll have a response ready to play the instant turn.end arrives.
This is an optimization that might not be necessary for your use case. We would recommend focusing on turn.start and turn.end events earlier in development, then thinking about incorporating more events later on as your agent matures.
Two things can happen after turn.eager_end:
turn.resume: the user kept talking. Cancel any in-progress LLM and TTS generation and wait for the turn to end.turn.end: the user really is done. Play the prepared response.
turn.eager_end and turn.resume work.
Example: one turn
The user says “Hi I need to cancel my subscription please.”- Start preparing a reply on the first
turn.eager_end. - Cancel it on
turn.resume. - Start preparing again on the second
turn.eager_end. - Speak the prepared reply on
turn.end.
Guarantees
- The first event in every turn is
turn.start. turn.eager_endwill be followed byturn.endorturn.resume.turn.resumeonly fires after a precedingturn.eager_end.turn.resumewill always fire if the transcript fromturn.eager_endis not complete.turn.endalways closes a turn; the next turn begins with a newturn.start.- Events append to the turn’s transcript without modifying earlier text.
Edge cases
No audio vs silence
Our API expects a continuous stream of audio. If you stop sending audio, the server will wait for more audio chunks to arrive rather than assuming that the user is silent. This is normally desired behavior to handle network lag, but it does mean that your client needs to send silence (all zeros) when your audio input is muted.Draining events
Once you are done sending all audio for a session, send{"type": "close"} to tell the model to flush any buffered audio and emit remaining events. The server will close the socket for you once the model is done.
The server buffers some audio to improve transcription accuracy. If you don’t send the close command or stop reading messages early, that buffered audio will not be processed. This is okay if you don’t care about the last second of audio.
Joining transcripts
Thetranscript field is cumulative within a turn — each turn.update, turn.eager_end, and turn.end event already holds the full text of the turn so far.
If you only care about the final transcript: take the transcript property from each turn.end, one per completed turn. Join transcript verbatim. Never strip() it, normalize it, or add your own separators.
turn.update and turn.eager_end events is a classic source of duplicated text: because each update is cumulative, joining them repeats parts of the transcript.
Consider turn.update and turn.eager_end as updates to the turn state, not transcript chunks.
Read turn.end only for the final transcript.
Where to go next
Try it out online
See turn detection in action with no sign-up or code required
Use the API
Start building with our Realtime STT API
Use the SDK
Take a look at some real code