Text to Speech (SSE)

Request

This endpoint expects an object.

model_idstringRequired

The ID of the model to use for the generation. See Models for available models.

transcriptstringRequired

voiceobjectRequired

output_formatobjectRequired

languageenumOptional

The language that the given voice should speak the transcript in.

Options: English (en), French (fr), German (de), Spanish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Italian (it), Korean (ko), Dutch (nl), Polish (pl), Russian (ru), Swedish (sv), Turkish (tr).

durationdoubleOptional

The maximum duration of the audio in seconds. You do not usually need to specify this. If the duration is not appropriate for the length of the transcript, the output audio may be truncated.

speedenumOptional

This feature is experimental and may not work for all voices.

Speed setting for the model. Defaults to normal.

Influences the speed of the generated speech. Faster speeds may reduce hallucination rate.

Allowed values:

add_timestampsbooleanOptional

Whether to return word-level timestamps. If false (default), no word timestamps will be produced at all. If true, the server will return timestamp events containing word-level timing information.

add_phoneme_timestampsbooleanOptional

Whether to return phoneme-level timestamps. If false (default), no phoneme timestamps will be produced. If true, the server will return timestamp events containing phoneme-level timing information.

use_normalized_timestampsbooleanOptional

Whether to use normalized timestamps (True) or original timestamps (False).

pronunciation_dict_idslist of stringsOptional

A list of pronunciation dict IDs to use for the generation. This will be applied in addition to the pinned pronunciation dict, which will be treated as the first element of the list. If there are conflicts with dict items, the latest dict will take precedence.

context_idstringOptional

Optional context ID for this request.

Response

This endpoint returns a stream of object.

chunkobject

flush_doneobject

doneobject

timestampsobject

errorobject

phoneme_timestampsobject

1	import requests
2
3	url = "https://api.cartesia.ai/tts/sse"
4
5	payload = {
6	"model_id": "sonic-2",
7	"transcript": "Hello, world!",
8	"voice": {
9	"mode": "id",
10	"id": "694f9389-aac1-45b6-b726-9d9369183238"
11	},
12	"output_format": {
13	"container": "raw",
14	"encoding": "pcm_f32le",
15	"sample_rate": 44100
16	},
17	"language": "en"
18	}
19	headers = {
20	"Cartesia-Version": "2025-04-16",
21	"Authorization": "Bearer <token>",
22	"Content-Type": "application/json"
23	}
24
25	response = requests.post(url, json=payload, headers=headers)
26
27	print(response.json())

Headers

Request

Response