Skip to main content
POST
/
tts
/
bytes
Text-to-Speech (Bytes)
curl --request POST \
  --url https://api.cartesia.ai/tts/bytes \
  --header 'Authorization: Bearer <token>' \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model_id": "sonic-3.5",
  "transcript": "<string>",
  "voice": {
    "mode": "id",
    "id": "<string>"
  },
  "output_format": {
    "container": "raw"
  },
  "pronunciation_dict_id": "<string>",
  "generation_config": {
    "volume": 1,
    "speed": 1
  },
  "speed": "normal"
}
'
"<string>"

Authorizations

Authorization
string
header
required

A short-lived access token to make API requests from a client.

Headers

Cartesia-Version
enum<string>
required

API version header.

Available options:
2026-03-01
Example:

"2026-03-01"

Body

application/json
model_id
enum<string>
required

The ID of the model to use for the generation. See Models all options.

Available options:
sonic-3.5,
sonic-3,
sonic-latest
Example:

"sonic-3.5"

transcript
string
required
voice
TTSRequestVoiceSpecifier · object
required
output_format
RAWOutputFormat · object
required
language
enum<string> | null

The language that the given voice should speak the transcript in. This may depend on the model you're using. See Models for details.

Available options:
en,
fr,
de,
es,
pt,
zh,
ja,
hi,
it,
ko,
nl,
pl,
ru,
sv,
tr,
tl,
bg,
ro,
ar,
cs,
el,
fi,
hr,
ms,
sk,
da,
ta,
uk,
hu,
no,
vi,
bn,
th,
he,
ka,
id,
te,
gu,
kn,
ml,
mr,
pa
pronunciation_dict_id
string | null

The ID of a pronunciation dictionary to use for the generation. Pronunciation dictionaries are supported by sonic-3 models and newer.

generation_config
GenerationConfig · object

Configure the various attributes of the generated speech. Available on sonic-3 and sonic-3.5; not available on earlier models.

See Volume, Speed, and Emotion for a guide on this option.

speed
enum<string>
default:normal
deprecated

This property is deprecated and may not work for all voices. Use generation_config.speed instead. Influences the speed of the generated speech.

Available options:
slow,
normal,
fast

Response

200 - audio/*

Audio bytes

The response is of type file.