Stream Speech (WebSocket)

Connect to Cartesia over a WebSocket and generate speech from a transcript using a given voice and model. The audio is streamed out as Base64-encoded raw bytes.

You can try out WebSockets using wscat. If you have Node installed, just run:

In Your Shell
1npx wscat -c "wss://api.cartesia.ai/tts/websocket?api_key=<YOUR_API_KEY>&cartesia_version=2024-06-10"

GET /tts/websocket?api_key=<YOUR_API_KEY>&cartesia_version=<API_VERSION>

Initiate a bidirectional WebSocket connection. The connection supports multiplexing, so you can send multiple requests and receive the corresponding responses in parallel. The connection times out 5 minutes after the last message you send.

WebSocket Request

Send a JSON-encoded message on the WebSocket. The schema of said message should be identical to the Server-Sent Events request body, except that you must additionally specify a context_id field containing a unique identifier for the request. (You can use a UUIDv4 or a human ID.)

WebSocket Request
1{
2 "context_id": "happy-monkeys-fly",
3 "model_id": "sonic-english",
4 "transcript": "Hello, world! I'\''m generating audio on Cartesia.",
5 "duration": 180,
6 "voice": {
7 "mode": "id",
8 "id": "a0e99841-438c-4a64-b679-ae501e7d6091",
9 "__experimental_controls": {
10 "speed": "normal",
11 "emotion": ["positivity:highest", "curiosity"]
12 }
13 },
14 "output_format": {
15 "container": "raw",
16 "encoding": "pcm_s16le",
17 "sample_rate": 8000
18 },
19 "language": "en",
20 "add_timestamps": false
21}

You may also cancel outgoing requests through the websocket. This will only halt requests that have not begun generating a response yet.

WebSocket Request
1{
2 "context_id": "happy-monkeys-fly",
3 "cancel": true,
4}

WebSocket Responses

After you send a message body on the WebSocket, the API will respond with a series of JSON chunks with the same schema as the data in Server-Sent Events responses.

WebSocket Response
1{
2 "status_code": 206,
3 "done": false,
4 "type": "chunk",
5 "data": "aSDinaTvuI8gbWludGxpZnk=",
6 "step_time": 123,
7 "context_id": "happy-monkeys-fly"
8}

If add_timestamps is set to true, we will also return messages of the following form in addition to the audio chunks and done message:

WebSocket Response
1{
2 "status_code": 206,
3 "done": false,
4 "context_id": "happy-monkeys-fly",
5 "type": "timestamps",
6 "word_timestamps": {
7 "words": ["Hello"],
8 "start": [0.0],
9 "end": [1.0]
10 }
11}

Input Streaming with Contexts

In many real time use cases, you don’t have your transcripts available upfront—like when you’re generating them using an LLM. For these cases, Sonic supports input streaming.

The context IDs you pass to the Cartesia API identify speech contexts. Contexts maintain prosody between their inputs—so you can send a transcript in multiple parts and receive seamless speech in return.

To stream in inputs on a context, just pass a continue flag (set to true) for every input that you expect will be followed by more inputs. (By default, this flag is set to false.)

To finish a context, just set continue to false. If you do not know the last transcript in advance, you can send an input with an empty transcript and continue set to false.

Contexts automatically expire 5 seconds after the last input that was streamed in, and attempting to send another input on the same context ID will implicitly create a new context.
continue
boolean

Whether this input may be followed by more inputs.

Input Format

  1. Inputs on the same context must keep all fields except transcript, continue, and duration the same.
  2. Transcripts are concatenated verbatim, so make sure they form a valid transcript when joined together. Make sure to include any spaces between words or punctuations as necessary. For example, in languages with spaces, you should include a space at the end of the preceding transcript, e.g. transcript 1 is Thanks for coming, and transcript 2 is it was great to see you.
  3. It’s important to buffer the first request transcript to at least 3 or 4 words for best performance.

Example

Let’s say you’re trying to generate speech for “Hello, Sonic! I’m streaming inputs.” You should stream in the following inputs (repeated fields omitted for brevity). Note: all other fields (e.g. model_id, language) are required and should be passed unchanged between requests with input streaming.

Input Streaming
1{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
2{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
3{"transcript": "inputs.", "continue": false, "context_id": "happy-monkeys-fly"}

If you don’t know the last transcript in advance, you can send an input with an empty transcript and continue set to false:

Input Streaming
1{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
2{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
3{"transcript": "inputs.", "continue": true, "context_id": "happy-monkeys-fly"}
4{"transcript": "", "continue": false, "context_id": "happy-monkeys-fly"}

Output

You will only receive done: true after outputs for the entire context have been returned.

Outputs for a given context will always be in order of the inputs you streamed in. (That is, if you send input A and then input B on a context, you will first receive the chunks corresponding to input A, and then the chunks corresponding to input B.)

Cancelling Requests

You may also cancel outgoing requests through the websocket.

To cancel a request, send a JSON message with the following structure:

WebSocket Request
1{
2 "context_id": "happy-monkeys-fly",
3 "cancel": true
4}

When you send a cancel request:

  1. It will only halt requests that have not begun generating a response yet.
  2. Any currently generating request will continue sending responses until completion.

The context_id in the cancel request should match the context_id of the request you want to cancel.