Skip to main content
POST
/
tts
/
sse
Text to Speech (SSE)
curl --request POST \
  --url https://api.cartesia.ai/tts/sse \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: application/json' \
  --header 'X-API-Key: <api-key>' \
  --data '{
  "model_id": "<string>",
  "transcript": "<string>",
  "voice": {
    "mode": "id",
    "id": "<string>",
    "__experimental_controls": {
      "speed": 123,
      "emotion": [
        "anger:lowest"
      ]
    }
  },
  "language": "en",
  "output_format": {
    "container": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 123
  },
  "duration": 123,
  "speed": "slow",
  "add_timestamps": true,
  "add_phoneme_timestamps": true,
  "use_normalized_timestamps": true,
  "context_id": "<string>"
}'

Authorizations

X-API-Key
string
header
required

Headers

Cartesia-Version
enum<string>
required

API version header. Must be set to the API version, e.g. '2024-06-10'.

Available options:
2024-06-10,
2024-11-13,
2025-04-16
Example:

"2024-11-13"

Body

application/json
model_id
string
required

The ID of the model to use for the generation. See Models for available models.

transcript
string
required
voice
object
required
  • TTSRequestIdSpecifier
  • TTSRequestEmbeddingSpecifier
output_format
object
required
language
enum<string>

The language that the given voice should speak the transcript in.

Options: English (en), French (fr), German (de), Spanish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Italian (it), Korean (ko), Dutch (nl), Polish (pl), Russian (ru), Swedish (sv), Turkish (tr).

Available options:
en,
fr,
de,
es,
pt,
zh,
ja,
hi,
it,
ko,
nl,
pl,
ru,
sv,
tr
duration
number | null

The maximum duration of the audio in seconds. You do not usually need to specify this. If the duration is not appropriate for the length of the transcript, the output audio may be truncated.

speed
enum<string>

This feature is experimental and may not work for all voices.

Speed setting for the model. Defaults to normal.

Influences the speed of the generated speech. Faster speeds may reduce hallucination rate.

Available options:
slow,
normal,
fast
add_timestamps
boolean | null

Whether to return word-level timestamps. If false (default), no word timestamps will be produced at all. If true, the server will return timestamp events containing word-level timing information.

add_phoneme_timestamps
boolean | null

Whether to return phoneme-level timestamps. If false (default), no phoneme timestamps will be produced - if add_timestamps is true, the produced timestamps will be word timestamps instead. If true, the server will return timestamp events containing phoneme-level timing information.

use_normalized_timestamps
boolean | null

Whether to use normalized timestamps (True) or original timestamps (False).

context_id
string

Optional context ID for this request.

I