Stream Speech (WebSocket)
Connect to Cartesia over a WebSocket and generate speech from a transcript using a given voice and model. The audio is streamed out as Base64-encoded raw bytes.
You can try out WebSockets using wscat
. If you have Node installed, just run:
GET /tts/websocket?api_key=<YOUR_API_KEY>&cartesia_version=<API_VERSION>
Initiate a bidirectional WebSocket connection. The connection supports multiplexing, so you can send multiple requests and receive the corresponding responses in parallel. The connection times out 5 minutes after the last message you send.
WebSocket Request
Send a JSON-encoded message on the WebSocket. The schema of said message should be identical to the Server-Sent Events request body, except that you must additionally specify a context_id
field containing a unique identifier for the request. (You can use a UUIDv4 or a human ID.)
You may also cancel outgoing requests through the websocket. This will only halt requests that have not begun generating a response yet.
WebSocket Responses
After you send a message body on the WebSocket, the API will respond with a series of JSON chunks with the same schema as the data in Server-Sent Events responses.
If add_timestamps
is set to true
, we will also return messages of the following form in addition to the audio chunks and done message:
Input Streaming with Contexts
In many real time use cases, you don’t have your transcripts available upfront—like when you’re generating them using an LLM. For these cases, Sonic supports input streaming.
The context IDs you pass to the Cartesia API identify speech contexts. Contexts maintain prosody between their inputs—so you can send a transcript in multiple parts and receive seamless speech in return.
To stream in inputs on a context, just pass a continue
flag (set to true
) for every input that you expect will be followed by more inputs. (By default, this flag is set to false
.)
To finish a context, just set continue
to false
. If you do not know the last transcript in advance, you can send an input with an empty transcript and continue
set to false
.
Whether this input may be followed by more inputs.
Input Format
- Inputs on the same context must keep all fields except
transcript
,continue
, andduration
the same. - Transcripts are concatenated verbatim, so make sure they form a valid transcript when joined together. Make sure to include any spaces between words or punctuations as necessary. For example, in languages with spaces, you should include a space at the end of the preceding transcript, e.g. transcript 1 is
Thanks for coming,
and transcript 2 isit was great to see you.
- It’s important to buffer the first request transcript to at least 3 or 4 words for best performance.
Example
Let’s say you’re trying to generate speech for “Hello, Sonic! I’m streaming inputs.” You should stream in the following inputs (repeated fields omitted for brevity). Note: all other fields (e.g. model_id
, language
) are required and should be passed unchanged between requests with input streaming.
If you don’t know the last transcript in advance, you can send an input with an empty transcript and continue
set to false
:
Output
You will only receive done: true
after outputs for the entire context have been returned.
Outputs for a given context will always be in order of the inputs you streamed in. (That is, if you send input A and then input B on a context, you will first receive the chunks corresponding to input A, and then the chunks corresponding to input B.)
Cancelling Requests
You may also cancel outgoing requests through the websocket.
To cancel a request, send a JSON message with the following structure:
When you send a cancel request:
- It will only halt requests that have not begun generating a response yet.
- Any currently generating request will continue sending responses until completion.
The context_id
in the cancel request should match the context_id
of the request you want to cancel.