Generate audio from a transcript using a given voice and model. The audio is streamed out as Server-Sent Events.
The version of the Cartesia API to use.
A transcript for the generation. Should not be empty and should not be only puncutation.
The voice to use for the speech. Can be either an ID or an embedding, specified by the mode
field.
The maximum duration of the audio in seconds.
Language of the generation. Options are: en
(English), de
(German), es
(Spanish), fr
(French), ja
(Japanese), pt
(Portuguese), zh
(Chinese), hi
(Hindi), it
(Italian), ko
(Korean), nl
(Dutch), pl
(Polish), ru
(Russian), sv
(Swedish), tr
(Turkish).
Whether to add timestamps to the audio. This is only supported on tts/sse
and WebSocket endpoints.