Text to Speech (SSE)

Headers

X-API-KeystringRequired
Cartesia-Version"2024-11-13"Required

Request

This endpoint expects an object.
model_idstringRequired

The ID of the model to use for the generation. See Models for available models.

transcriptstringRequired
voiceobjectRequired
output_formatobjectRequired
languageenumOptional

The language that the given voice should speak the transcript in.

Options: English (en), French (fr), German (de), Spanish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Italian (it), Korean (ko), Dutch (nl), Polish (pl), Russian (ru), Swedish (sv), Turkish (tr).

durationdoubleOptional

The maximum duration of the audio in seconds. You do not usually need to specify this. If the duration is not appropriate for the length of the transcript, the output audio may be truncated.

text_cfgdoubleOptional

The text classifier-free guidance value for the request.

Higher values causes the model to attend more to the text but speed up the generation. Lower values reduce the speaking rate but can increase the risk of hallucinations. The default value is 3.0. For a slower speaking rate, we recommend values between 2.0 and 3.0. Values are supported between 1.5 and 3.0.

This parameter is only supported for sonic-2 models.

Response

This endpoint returns a stream of object.
chunkobject
OR
flush_doneobject
OR
doneobject
OR
timestampsobject
OR
errorobject
OR
phoneme_timestampsobject