Skip to main content
POST
/
tts
/
sse
Text-to-Speech (SSE)
curl --request POST \
  --url https://api.cartesia.ai/tts/sse \
  --header 'Authorization: Bearer <token>' \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model_id": "sonic-3.5",
  "transcript": "<string>",
  "voice": {
    "id": "<string>"
  },
  "output_format": {},
  "add_timestamps": false,
  "add_phoneme_timestamps": false,
  "use_normalized_timestamps": true,
  "pronunciation_dict_id": "<string>",
  "generation_config": {
    "volume": 1,
    "speed": 1
  },
  "speed": "normal",
  "context_id": "<string>"
}
'
{
  "type": "chunk",
  "done": false,
  "status_code": 206,
  "step_time": 123,
  "context_id": "50dc3b5e-5841-4aa1-9f94-60cfb9aead79",
  "data": "aSDinaTvuI8gbWludGxpZnk="
}

Authorizations

Authorization
string
header
required

A short-lived access token to make API requests from a client.

Headers

Cartesia-Version
enum<string>
required

API version header.

Available options:
2026-03-01
Example:

"2026-03-01"

Body

application/json
model_id
enum<string>
required

The ID of the model to use for the generation. See Models all options.

Available options:
sonic-3.5,
sonic-3,
sonic-latest
Example:

"sonic-3.5"

transcript
string
required
voice
TTSRequestVoiceSpecifier · object
required
output_format
SSEOutputFormat · object
required
language
enum<string>

The language that the given voice should speak the transcript in. This may depend on the model you're using. See Models for details.

Available options:
en,
fr,
de,
es,
pt,
zh,
ja,
hi,
it,
ko,
nl,
pl,
ru,
sv,
tr,
tl,
bg,
ro,
ar,
cs,
el,
fi,
hr,
ms,
sk,
da,
ta,
uk,
hu,
no,
vi,
bn,
th,
he,
ka,
id,
te,
gu,
kn,
ml,
mr,
pa
add_timestamps
boolean | null
default:false

Whether to return word-level timestamps. If false (default), no word timestamps will be produced at all. If true, the server will return timestamp events containing word-level timing information.

add_phoneme_timestamps
boolean | null
default:false

Whether to return phoneme-level timestamps. If false (default), no phoneme timestamps will be produced. If true, the server will return timestamp events containing phoneme-level timing information.

use_normalized_timestamps
boolean | null

Whether to use normalized timestamps (True) or original timestamps (False).

pronunciation_dict_id
string | null

The ID of a pronunciation dictionary to use for the generation. Pronunciation dictionaries are supported by sonic-3 models and newer.

generation_config
GenerationConfig · object

Configure the various attributes of the generated speech. Available on sonic-3 and sonic-3.5; not available on earlier models.

See Volume, Speed, and Emotion for a guide on this option.

speed
enum<string>
default:normal
deprecated

This property is deprecated and may not work for all voices. Use generation_config.speed instead. Influences the speed of the generated speech.

Available options:
slow,
normal,
fast
context_id
string | null

This can be any string value you find useful. The server will echo back the same context_id in events that it sends.

Contexts on the TTS (WebSocket) endpoint are used for continuations. > The TTS (SSE) endpoint does not support continuations, so most users just ignore this property.

Response

200 - text/event-stream

Server-sent events stream. Each frame is data: <json>\n\n where the JSON payload matches TTSSSEEvent.

Audio data chunk.

type
enum<string>
required

Event type identifier.

Available options:
chunk
done
enum<boolean>
required

Whether this is the final event for the request. Always false for chunk events.

Available options:
false
data
string
required

Base64-encoded audio data.

step_time
number
required

Server-side processing time for this chunk in milliseconds.

status_code
integer
required

HTTP-style status code.

context_id
string | null

The context ID echoed back from the request, if one was provided.