Text to Speech (Bytes) - Cartesia Docs

curl --request POST \ --url https://api.cartesia.ai/tts/bytes \ --header 'Authorization: Bearer <token>' \ --header 'Cartesia-Version: <cartesia-version>' \ --header 'Content-Type: application/json' \ --data ' { "model_id": "sonic-3.5", "transcript": "<string>", "voice": { "id": "<string>" }, "output_format": {}, "pronunciation_dict_id": "<string>", "generation_config": { "volume": 1, "speed": 1 }, "speed": "normal" } '

Authorizations

Authorization

string

header

required

A short-lived access token to make API requests from a client.

Headers

Cartesia-Version

enum<string>

required

API version header.

Available options:

2026-03-01

Example:

"2026-03-01"

Body

application/json

model_id

enum<string>

required

The ID of the model to use for the generation. See Models all options.

Available options:

sonic-3.5,

sonic-3,

sonic-latest

Example:

"sonic-3.5"

transcript

string

required

voice

TTSRequestVoiceSpecifier · object

required

Show child attributes

output_format

RAWOutputFormat · object

required

RAWOutputFormat
WAVOutputFormat
MP3OutputFormat

Show child attributes

language

enum<string> | null

The language that the given voice should speak the transcript in. This may depend on the model you're using. See Models for details.

Available options:

en,

fr,

de,

es,

pt,

zh,

ja,

hi,

it,

ko,

nl,

pl,

ru,

sv,

tr,

tl,

bg,

ro,

ar,

cs,

el,

fi,

hr,

ms,

sk,

da,

ta,

uk,

hu,

no,

vi,

bn,

th,

he,

ka,

id,

te,

gu,

kn,

ml,

mr,

pa

pronunciation_dict_id

string | null

The ID of a pronunciation dictionary to use for the generation. Pronunciation dictionaries are supported by sonic-3 models and newer.

generation_config

GenerationConfig · object

Configure the various attributes of the generated speech. Available on sonic-3 and sonic-3.5; not available on earlier models.

See Volume, Speed, and Emotion for a guide on this option.

Show child attributes

speed

enum<string>

default:normal

deprecated

Use generation_config.speed for sonic-3. Speed setting for the model. Defaults to normal. This feature is experimental and may not work for all voices. Influences the speed of the generated speech. Faster speeds may reduce hallucination rate.

Available options:

slow,

normal,

fast

Response

200 - audio/*

Audio bytes

The response is of type file.

Documentation Index

Authorizations

Headers

Body

Response