Skip to main content
POST
/
tts
/
bytes
Text-to-Speech (Bytes)
curl --request POST \
  --url https://api.cartesia.ai/tts/bytes \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: application/json' \
  --header 'X-API-Key: <api-key>' \
  --data '
{
  "model_id": "sonic-3.5",
  "transcript": "<string>",
  "voice": {
    "mode": "id",
    "id": "<string>",
    "__experimental_controls": {
      "speed": 123,
      "emotion": []
    }
  },
  "output_format": {
    "container": "raw",
    "sample_rate": 123,
    "bit_rate": 123
  },
  "duration": 123,
  "speed": "normal"
}
'
"<string>"

Authorizations

X-API-Key
string
header
required

Headers

Cartesia-Version
enum<string>
required

API version header.

Available options:
2024-11-13
Example:

"2024-11-13"

Body

application/json
model_id
enum<string>
required

The ID of the model to use for the generation. See Models all options.

Available options:
sonic-3.5,
sonic-3,
sonic-latest
Example:

"sonic-3.5"

transcript
string
required
voice
TTSRequestIdSpecifier · object
required
output_format
RawOutputFormat · object
required
language
enum<string> | null

The language that the given voice should speak the transcript in.

Available options:
en,
fr,
de,
es,
pt,
zh,
ja,
hi,
it,
ko,
nl,
pl,
ru,
sv,
tr
duration
number<double> | null

The maximum duration of the audio in seconds. You do not usually need to specify this. If the duration is not appropriate for the length of the transcript, the output audio may be truncated.

speed
enum<string> | null
default:normal
deprecated

Influences the speed of the generated speech. Faster speeds may reduce hallucination rate.

This feature is experimental and may not work for all voices.

Available options:
slow,
normal,
fast

Response

200 - audio/*

Audio bytes

The response is of type file.