Skip to main content
POST
/
tts
/
sse
Text-to-Speech (SSE)
curl --request POST \
  --url https://api.cartesia.ai/tts/sse \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: application/json' \
  --header 'X-API-Key: <api-key>' \
  --data '
{
  "model_id": "sonic-3.5",
  "transcript": "<string>",
  "voice": {
    "id": "<string>",
    "__experimental_controls": {
      "speed": 123,
      "emotion": []
    }
  },
  "output_format": {
    "sample_rate": 123
  },
  "context_id": "<string>",
  "duration": 123,
  "add_timestamps": true,
  "add_phoneme_timestamps": true,
  "use_normalized_timestamps": true,
  "speed": "normal"
}
'
{
  "type": "chunk",
  "done": false,
  "status_code": 206,
  "step_time": 123,
  "context_id": "50dc3b5e-5841-4aa1-9f94-60cfb9aead79",
  "data": "aSDinaTvuI8gbWludGxpZnk="
}

Authorizations

X-API-Key
string
header
required

Headers

Cartesia-Version
enum<string>
required

API version header.

Available options:
2024-06-10
Example:

"2024-06-10"

Body

application/json
model_id
enum<string>
required

The ID of the model to use for the generation. See Models all options.

Available options:
sonic-3.5,
sonic-3,
sonic-latest
Example:

"sonic-3.5"

transcript
string
required
voice
TTSRequestIdSpecifier · object
required
output_format
SSEOutputFormat · object
required
language
enum<string> | null

The language that the given voice should speak the transcript in.

Available options:
en,
fr,
de,
es,
pt,
zh,
ja,
hi,
it,
ko,
nl,
pl,
ru,
sv,
tr
context_id
string | null

This can be any string value you find useful. The server will echo back the same context_id in events that it sends.

Contexts on the TTS (WebSocket) endpoint are used for continuations. > The TTS (SSE) endpoint does not support continuations, so most users just ignore this property.

duration
number<double> | null

The maximum duration of the audio in seconds. You do not usually need to specify this. If the duration is not appropriate for the length of the transcript, the output audio may be truncated.

add_timestamps
boolean | null

Whether to return word-level timestamps. If false (default), no word timestamps will be produced at all. If true, the server will return timestamp events containing word-level timing information.

add_phoneme_timestamps
boolean | null

Whether to return phoneme-level timestamps. If false (default), no phoneme timestamps will be produced - if add_timestamps is true, the produced timestamps will be word timestamps instead. If true, the server will return timestamp events containing phoneme-level timing information.

use_normalized_timestamps
boolean | null

Whether to use normalized timestamps (True) or original timestamps (False).

speed
enum<string> | null
default:normal
deprecated

Influences the speed of the generated speech. Faster speeds may reduce hallucination rate.

This feature is experimental and may not work for all voices.

Available options:
slow,
normal,
fast

Response

200 - text/event-stream

Server-sent events stream. Each frame is data: <json>\n\n where the JSON payload matches TTSSSEEvent.

Audio data chunk.

type
enum<string>
required

Event type identifier.

Available options:
chunk
done
enum<boolean>
required

Whether this is the final event for the request. Always false for chunk events.

Available options:
false
data
string
required

Base64-encoded audio data.

step_time
number
required

Server-side processing time for this chunk in milliseconds.

status_code
integer
required

HTTP-style status code.

context_id
string | null

The context ID echoed back from the request, if one was provided.