Text to Speech (SSE)

Authorizations

X-API-Key

string

header

required

Headers

Cartesia-Version

enum<string>

required

API version header.

Available options:

2024-06-10

Example:

"2024-06-10"

Body

application/json

model_id

string

required

The ID of the model to use for the generation. See Models for available models.

transcript

string

required

voice

TTSRequestIdSpecifier · object

required

TTSRequestIdSpecifier
TTSRequestEmbeddingSpecifier

Show child attributes

output_format

SSEOutputFormat · object

required

Show child attributes

language

enum<string> | null

The language that the given voice should speak the transcript in.

Options: English (en), French (fr), German (de), Spanish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Italian (it), Korean (ko), Dutch (nl), Polish (pl), Russian (ru), Swedish (sv), Turkish (tr).

Available options:

en,

fr,

de,

es,

pt,

zh,

ja,

hi,

it,

ko,

nl,

pl,

ru,

sv,

tr

context_id

string | null

Optional context ID for this request.

duration

number<double> | null

The maximum duration of the audio in seconds. You do not usually need to specify this. If the duration is not appropriate for the length of the transcript, the output audio may be truncated.

add_timestamps

boolean | null

Whether to return word-level timestamps. If false (default), no word timestamps will be produced at all. If true, the server will return timestamp events containing word-level timing information.

add_phoneme_timestamps

boolean | null

Whether to return phoneme-level timestamps. If false (default), no phoneme timestamps will be produced - if add_timestamps is true, the produced timestamps will be word timestamps instead. If true, the server will return timestamp events containing phoneme-level timing information.

use_normalized_timestamps

boolean | null

Whether to use normalized timestamps (True) or original timestamps (False).

speed

enum<string> | null

deprecated

This feature is experimental and may not work for all voices.

Speed setting for the model. Defaults to normal.

Influences the speed of the generated speech. Faster speeds may reduce hallucination rate.

Available options:

slow,

normal,

fast

Response

200 - text/event-stream

Server-sent events stream. Each frame is data: <json>\n\n where the JSON payload matches TTSSSEEvent.

TTSSSEChunkEvent
TTSSSETimestampsEvent
TTSSSEPhonemeTimestampsEvent
TTSSSEDoneEvent
TTSSSEErrorEvent

Audio data chunk.

type

enum<string>

required

Event type identifier.

Available options:

chunk

done

enum<boolean>

required

Whether this is the final event for the request. Always false for chunk events.

Available options:

false

data

string

required

Base64-encoded audio data.

step_time

number

required

Server-side processing time for this chunk in milliseconds.

status_code

integer

required

HTTP-style status code.

context_id

string | null

The context ID echoed back from the request, if one was provided.

Use the API

API Status

TTS

STT

Voices

Voice Changer

Auth

Infill

Authorizations

Headers

Body

Response

Use the API

API Status

TTS

STT

Voices

Voice Changer

Auth

Infill

Documentation Index

Authorizations

Headers

Body

Response