Text-to-Speech (SSE)
Stream audio with extra metadata from a complete transcript
Authorizations
A short-lived access token to make API requests from a client.
Headers
API version header.
2026-03-01 "2026-03-01"
Body
The language that the given voice should speak the transcript in. This may depend on the model you're using. See Models for details.
en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa Whether to return word-level timestamps. If false (default), no word timestamps will be produced at all. If true, the server will return timestamp events containing word-level timing information.
Whether to return phoneme-level timestamps. If false (default), no phoneme timestamps will be produced. If true, the server will return timestamp events containing phoneme-level timing information.
Whether to use normalized timestamps (True) or original timestamps (False).
The ID of a pronunciation dictionary to use for the generation. Pronunciation dictionaries are supported by sonic-3 models and newer.
Configure the various attributes of the generated speech. Available on sonic-3 and sonic-3.5; not available on earlier models.
See Volume, Speed, and Emotion for a guide on this option.
This property is deprecated and may not work for all voices. Use generation_config.speed instead.
Influences the speed of the generated speech.
slow, normal, fast This can be any string value you find useful. The server will echo back the same context_id in events that it sends.
Contexts on the TTS (WebSocket) endpoint are used for continuations. > The TTS (SSE) endpoint does not support continuations, so most users just ignore this property.
Response
Server-sent events stream. Each frame is data: <json>\n\n where the JSON payload matches TTSSSEEvent.
- TTSSSEChunkEvent
- TTSSSETimestampsEvent
- TTSSSEPhonemeTimestampsEvent
- TTSSSEDoneEvent
- TTSSSEErrorEvent
Audio data chunk.
Event type identifier.
chunk Whether this is the final event for the request. Always false for chunk events.
false Base64-encoded audio data.
Server-side processing time for this chunk in milliseconds.
HTTP-style status code.
The context ID echoed back from the request, if one was provided.