Stream Speech (Bytes)

POST
Generate audio from a transcript using a given voice and model. The audio is streamed out as raw bytes.

Headers

Auth
X-API-KeystringRequired
Cartesia-VersionstringRequired
The version of the Cartesia API to use.

Request

This endpoint expects an object.
model_idstringRequired
transcriptstringRequired
A transcript for the generation. Should not be empty and should not be only puncutation.
voiceobjectRequired

The voice to use for the speech. Can be either an ID or an embedding, specified by the mode field.

output_formatobjectRequired
durationintegerOptional
The maximum duration of the audio in seconds.
languageenumOptional

Language of the generation. Options are: en (English), de (German), es (Spanish), fr (French), ja (Japanese), pt (Portuguese), zh (Chinese), hi (Hindi), it (Italian), ko (Korean), nl (Dutch), pl (Polish), ru (Russian), sv (Swedish), tr (Turkish).

Response

This endpoint returns a file.