Stream Speech (Server-Sent Events)

POST
Generate audio from a transcript using a given voice and model. The audio is streamed out as Server-Sent Events.

Headers

Auth
X-API-KeystringRequired
Cartesia-VersionstringRequired
The version of the Cartesia API to use.

Request

This endpoint expects an object.
model_idstringRequired
transcriptstringRequired
A transcript for the generation. Should not be empty and should not be only puncutation.
voiceobjectRequired

The voice to use for the speech. Can be either an ID or an embedding, specified by the mode field.

output_formatobjectRequired
durationintegerOptional
The maximum duration of the audio in seconds.
languageenumOptional

Language of the generation. Options are: en (English), de (German), es (Spanish), fr (French), ja (Japanese), pt (Portuguese), zh (Chinese), hi (Hindi), it (Italian), ko (Korean), nl (Dutch), pl (Polish), ru (Russian), sv (Swedish), tr (Turkish).

add_timestampsbooleanOptional

Whether to add timestamps to the audio. This is only supported on tts/sse and WebSocket endpoints.

Response

This endpoint returns a stream of object.
JSON Chunk Responseobject
OR
JSON Done Responseobject
OR
JSON Timestamp Responseobject
OR
JSON Error Responseobject