Text to Speech (SSE)

curl --request POST \
  --url https://api.cartesia.ai/tts/sse \
  --header 'Authorization: Bearer <token>' \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model_id": "<string>",
  "transcript": "<string>",
  "voice": {
    "mode": "id",
    "id": "<string>"
  },
  "output_format": {
    "container": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 8000
  },
  "generation_config": {
    "volume": 1,
    "speed": 1,
    "emotion": "neutral"
  },
  "language": "en",
  "speed": "normal",
  "add_timestamps": false,
  "add_phoneme_timestamps": false,
  "use_normalized_timestamps": true,
  "pronunciation_dict_id": "<string>",
  "context_id": "<string>"
}
'

POST

tts

sse

curl --request POST \
  --url https://api.cartesia.ai/tts/sse \
  --header 'Authorization: Bearer <token>' \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model_id": "<string>",
  "transcript": "<string>",
  "voice": {
    "mode": "id",
    "id": "<string>"
  },
  "output_format": {
    "container": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 8000
  },
  "generation_config": {
    "volume": 1,
    "speed": 1,
    "emotion": "neutral"
  },
  "language": "en",
  "speed": "normal",
  "add_timestamps": false,
  "add_phoneme_timestamps": false,
  "use_normalized_timestamps": true,
  "pronunciation_dict_id": "<string>",
  "context_id": "<string>"
}
'

Authorizations

Authorization

string

header

required

An Access Token

Headers

Cartesia-Version

enum<string>

required

API version header. Must be set to the API version, e.g. '2024-06-10'.

Available options:

2024-06-10,

2024-11-13,

2025-04-16

Example:

"2025-04-16"

Body

application/json

model_id

string

required

The ID of the model to use for the generation. See Models for available models.

transcript

string

required

voice

TTSRequestVoiceSpecifier · object

required

Show child attributes

output_format

SSEOutputFormat · object

required

Show child attributes

generation_config

GenerationConfig · object

Configure the various attributes of the generated speech. These are only for sonic-3 and have no effect on earlier models.

See Volume, Speed, and Emotion in Sonic-3 for a guide on this option.

Show child attributes

language

enum<string>

The language that the given voice should speak the transcript in. For valid options, see Models.

Available options:

en,

fr,

de,

es,

pt,

zh,

ja,

hi,

it,

ko,

nl,

pl,

ru,

sv,

tr,

tl,

bg,

ro,

ar,

cs,

el,

fi,

hr,

ms,

sk,

da,

ta,

uk,

hu,

no,

vi,

bn,

th,

he,

ka,

id,

te,

gu,

kn,

ml,

mr,

pa

speed

enum<string>

default:normal

deprecated

Use generation_config.speed for sonic-3. Speed setting for the model. Defaults to normal. This feature is experimental and may not work for all voices. Influences the speed of the generated speech. Faster speeds may reduce hallucination rate.

Available options:

slow,

normal,

fast

add_timestamps

boolean | null

default:false

Whether to return word-level timestamps. If false (default), no word timestamps will be produced at all. If true, the server will return timestamp events containing word-level timing information.

add_phoneme_timestamps

boolean | null

default:false

Whether to return phoneme-level timestamps. If false (default), no phoneme timestamps will be produced. If true, the server will return timestamp events containing phoneme-level timing information.

use_normalized_timestamps

boolean | null

Whether to use normalized timestamps (True) or original timestamps (False).

pronunciation_dict_id

string | null

The ID of a pronunciation dictionary to use for the generation. Pronunciation dictionaries are supported by sonic-3 models and newer.

context_id

string | null

Optional context ID for this request.

Response

204 - undefined

⌘I