Feature comparison
| Multiple generations per connection | Timestamps | Continuations | |
|---|---|---|---|
| WebSocket | Yes | Yes | Yes |
| Bytes | No (one POST per generation) | No | No |
| SSE | No (one POST per generation) | Yes | No |
context_id. See Stream inputs using continuations and contexts.
If you care about time-to-first-byte on every turn, remember that a new HTTPS request pays for TCP and TLS again; that overhead is often on the same order as TTFB for the audio itself. WebSocket amortizes that cost when you keep the socket open.
SSE is still supported for stacks that already consume Server-Sent Events or when you want timestamps while staying on HTTP. For audio only, bytes is usually the better HTTP choice (smaller encoding than JSON-wrapped chunks).
Pick an endpoint in one minute
| What you are building | Use this | Short label |
|---|---|---|
| Full transcript in one request; you want a streaming HTTP body (efficient; same pattern as the playground) | POST /tts/bytes | Stream speech (bytes) |
| Full transcript in one request; you need timestamps without WebSocket, or your stack already uses SSE | POST /tts/sse | Stream speech with timestamps (SSE) |
| Long-lived session, partial transcript (for example LLM tokens), lowest latency across many turns, timestamps, or several utterances on one socket | WebSocket /tts/websocket | Live session (WebSocket) |
Bytes (POST /tts/bytes)
Best for batch jobs, caching files, notifications, and anywhere one POST per generation is enough.
The response body streams while audio is generated. You can read progressively or buffer to the end. For many output formats this is leaner on the wire than SSE because you receive raw or file bytes instead of JSON-wrapped chunks.
Typical flow:
- One JSON payload with the full
transcript, voice, model, and output format (WAV, MP3, raw PCM, and so on). POSTto/tts/bytes.- Read the body as data arrives, or consume it to completion.
POST.
See bytes reference.
SSE (POST /tts/sse)
Best when you need timestamps while staying on HTTP without WebSocket, or when your integration already uses SSE. If you only need audio and not SSE-shaped events, bytes is usually simpler. WebSocket is otherwise the full-featured option for real-time use and supports timestamps as well.
SSE remains available largely for backward compatibility and for teams that standardize on Server-Sent Events.
Typical flow:
- Same as bytes: one JSON body with the full transcript.
POSTto/tts/sse.- Consume Server-Sent Events; each event carries the next chunk until completion.
| Bytes | SSE | |
|---|---|---|
| Shape | One streaming response body (raw or file bytes) | Many SSE events (often JSON plus base64 audio) |
| Timestamps | No | Yes (in the event payload) |
POSTs. An optional context_id is echoed for your logs; it does not merge multiple HTTP requests into one utterance. To send text in pieces over time, use WebSocket.
See SSE reference.
WebSocket (/tts/websocket)
Best for assistants, games, telephony-style stacks, or any case where the connection stays open and transcript text may arrive over time.
Why people choose WebSocket:
- Latency: you pay connect cost once; later generations avoid extra TCP/TLS round trips (often tens to low hundreds of ms per turn).
- Streaming input: send fragments as they arrive (for example from an LLM) and keep prosody across them. See continuations and contexts.
- Timestamps: word- or segment-level timing (model and language limits apply; see WebSocket docs).
- Multiplexing: multiple
context_idvalues on one connection for overlapping utterances.
- Open the WebSocket.
- Send JSON messages. When one utterance is split across messages, use
context_idandcontinue: setcontinue: trueon partials, andcontinue: falseon the last part of that utterance (or use the empty-transcript pattern in contexts if you cannot know the final string yet). - Read audio until the server finishes that context.
Continuations
If you are not streaming text from a model, start with bytes or SSE, not continuations. When you do use WebSocket streaming, keep one stablecontext_id per utterance, continue: true on every partial, and continue: false on the final message for that utterance (see contexts).
Things that break text or prosody:
- Concatenation: chunks are joined verbatim. A missing space produces
"...world!How..."instead of"...world! How...". - SSML and numbers: avoid splitting tokens that must stay together (for example decimals in SSML). See
max_buffer_delay_msin the continuations guide.
continue: true longer than you meant, contexts eventually expire on their own and audio is still generated and flushed according to server rules. It is not a runaway failure mode. You should still send continue: false when you know the utterance is complete so your client state matches the server. Do not reuse old context_id values for unrelated utterances.
Why WebSocket uses context_id (and HTTP does not)
On POST /tts/bytes and POST /tts/sse, you send a complete transcript in one JSON body. There is no continuation protocol across requests.
context_id and continue matter on WebSocket when one utterance’s text is split across multiple messages. The server concatenates chunks that share a context_id. continue: true means more text is coming; continue: false finalizes that utterance.
Mental model:
- Whole line of speech in one string: bytes or SSE. No context API.
- Text arrives in pieces: WebSocket, one
context_idper utterance, with continuations.
API ergonomics (all endpoints)
- For server-side calls, prefer the API key in the
Authorizationheader instead of query strings (headers are less likely to appear in access logs). WebSocket URLs in browsers may need different patterns for your platform. - Model IDs, voices, and core generation parameters match across bytes, SSE, and WebSocket. What differs is wire format, how chunks are exposed, and whether input can be streamed with continuations.
Where to go next
Stream speech (bytes)
One POST, streaming response body
Stream speech with timestamps (SSE)
Timestamps and SSE-chunked audio
Live session (WebSocket)
Streaming input, multiplexing, lowest latency across turns