> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Turn Detection with Ink

The [Realtime Speech-to-Text (Auto) API](/api-reference/stt/turns/websocket) organizes transcription around **user turns**, not raw transcript segments. The model itself signals when a user turn begins and ends, so your voice agent reacts to events rather than running its own voice activity detection.

The end result is a more human-like voice agent that:

* Handles pauses and phone numbers
* Pulls in context from the conversation for more accuracy

```python theme={null}
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "turn.start":
        tts.interrupt()
    elif event["type"] == "turn.end":
        reply = llm.generate(event["transcript"])
        tts.speak(reply)
```

## State machine

```mermaid theme={null}
stateDiagram-v2
    direction LR
    [*] --> Idle
    Idle --> Speaking: turn.start
    Speaking --> EagerEnded: turn.eager_end
    Speaking --> Idle: turn.end
    EagerEnded --> Speaking: turn.resume
    EagerEnded --> Idle: turn.end
```

Between user turns, the session is idle. When the user begins speaking, `turn.start` fires, followed by `turn.update` events as the transcript builds. The turn ends in one of two ways:

* **Confident end** — `turn.end` fires directly. The user turn is over.
* **Eager end** — `turn.eager_end` fires first, flagging that the user *might* be done. Then either `turn.end` confirms the user turn is over, or `turn.resume` fires and the user turn continues.

## Events

The API emits these events to describe the state of the conversation.

| Event                       | Fires when                                                                         | Carries transcript? |
| --------------------------- | ---------------------------------------------------------------------------------- | ------------------- |
| `turn.start`                | The user begins speaking.                                                          | No                  |
| `turn.update`               | Repeatedly, as the model transcribes the user's speech.                            | Yes                 |
| `turn.eager_end` \[PREVIEW] | The model predicts the user might be done speaking.                                | Yes                 |
| `turn.resume` \[PREVIEW]    | The user turn is continuing and the last `turn.eager_end` event should be ignored. | No                  |
| `turn.end`                  | The user turn is definitively complete.                                            | Yes                 |

The `transcript` property is **cumulative within a turn** — it contains the full text transcribed so far in this user turn, not a delta. You do not need to concatenate partial results across events.

All emitted text is **final**—the model never revises text it has already sent. You can use the partial transcript from a `turn.update` the moment it arrives without worrying about it changing.

A separate `connected` event fires once when the WebSocket is established. You do not need to wait for it before sending audio.

## Using `turn.eager_end` to cut latency \[PREVIEW]

`turn.eager_end` lets your agent start generating a reply before the model is confident that the user is done speaking. The moment it fires, send the transcript to your LLM — you'll have a response ready to play the instant `turn.end` arrives.

This is an optimization that might not be necessary for your use case. We would recommend focusing on `turn.start` and `turn.end` events earlier in development, then thinking about incorporating more events later on as your agent matures.

Two things can happen after `turn.eager_end`:

* `turn.resume`: the user kept talking. **Cancel any in-progress LLM and TTS generation** and wait for the turn to end.
* `turn.end`: the user really is done. Play the prepared response.

This is a preview feature as we're still tuning how `turn.eager_end` and `turn.resume` work.

```python theme={null}
pending_reply = None

async for message in websocket:
    event = json.loads(message)
    match event["type"]:
        case "turn.eager_end":
            pending_reply = llm.generate_async(event["transcript"])

        case "turn.resume":
            pending_reply.cancel()
            pending_reply = None

        case "turn.end":
            if pending_reply:
                confident_reply = pending_reply
                pending_reply = None
                tts.speak(confident_reply)
            else:
                tts.speak(llm.generate(event["transcript"]))
```

## Example: one turn

The user says *"Hi I need to cancel my subscription please."*

```json lines theme={null}
{ "type": "turn.start" }
{ "type": "turn.update",    "transcript": "Hi I" }
{ "type": "turn.update",    "transcript": "Hi I need to" }
{ "type": "turn.eager_end", "transcript": "Hi I need to cancel" }
{ "type": "turn.resume" }
{ "type": "turn.update",    "transcript": "Hi I need to cancel my subscription" }
{ "type": "turn.eager_end", "transcript": "Hi I need to cancel my subscription please." }
{ "type": "turn.end",       "transcript": "Hi I need to cancel my subscription please." }
```

An agent listening to this stream would:

1. Start preparing a reply on the first `turn.eager_end`.
2. Cancel it on `turn.resume`.
3. Start preparing again on the second `turn.eager_end`.
4. Speak the prepared reply on `turn.end`.

## Guarantees

* The first event in every turn is `turn.start`.
* `turn.eager_end` will be followed by `turn.end` or `turn.resume`.
* `turn.resume` only fires after a preceding `turn.eager_end`.
* `turn.resume` will always fire if the transcript from `turn.eager_end` is not complete.
* `turn.end` always closes a turn; the next turn begins with a new `turn.start`.
* Events append to the turn's transcript without modifying earlier text.

## Edge cases

### No audio vs silence

Our API expects a continuous stream of audio.
If you stop sending audio, the server will wait for more audio chunks to arrive rather than assuming that the user is silent.

This is normally desired behavior to handle network lag, but it does mean that your client needs to send silence (all zeros) when your audio input is muted.

### Draining events

Once you are done sending all audio for a session, send `{"type": "close"}` to tell the model to flush any buffered audio and emit remaining events. The server will close the socket for you once the model is done.

The server buffers some audio to improve transcription accuracy. If you don't send the close command or stop reading messages early, that buffered audio will not be processed. This is okay if you don't care about the last second of audio.

```python theme={null}
await websocket.send(json.dumps({"type": "close"}))
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "turn.end":
        turns.append(event["transcript"])
        # do not stop reading from the websocket!
print("server closed the connection")
```

### Joining transcripts

The `transcript` field is **cumulative within a turn** — each `turn.update`, `turn.eager_end`, and `turn.end` event already holds the full text of the turn so far.

If you only care about the final transcript: take the `transcript` property from each `turn.end`, one per completed turn. **Join `transcript` verbatim. Never `strip()` it, normalize it, or add your own separators.**

```python theme={null}
import json

full_audio_transcript = ""
turns: list[str] = []
async for message in websocket:
    event = json.loads(message)
    if event["type"] == "turn.end":
        # transcripts across turns should be
        # concatenated without formatting!
        full_audio_transcript += event["transcript"]

        # per-turn transcript
        turns.append(event["transcript"])
```

Concatenating transcripts from `turn.update` and `turn.eager_end` events is a classic source of duplicated text: because each update is cumulative, joining them repeats parts of the transcript.
Consider `turn.update` and `turn.eager_end` as updates to the turn state, not transcript chunks.

Read `turn.end` only for the final transcript.

## Where to go next

<CardGroup cols={3}>
  <Card title="Try it out online" icon="arrow-pointer" href="https://www.cartesia.ai/ink">
    See turn detection in action with no sign-up or code required
  </Card>

  <Card title="Use the API" icon="code" href="/api-reference/stt/turns/websocket">
    Start building with our Realtime STT API
  </Card>

  <Card title="Use the SDK" icon="brackets-curly" href="/examples/stt-auto-finalize-websocket">
    Take a look at some real code
  </Card>
</CardGroup>
