Text to Speech (WebSocket)

This endpoint creates a bidirectional WebSocket connection. The connection supports multiplexing, so you can send multiple requests and receive the corresponding responses in parallel.

The WebSocket API is built around contexts:

  • When you send a generation request, you pass a context_id. Further inputs on the same context_id will continue the generation, maintaining prosody.
  • Responses for a context contain the context_id you passed in so that you can match requests and responses.

Read the guide on working with contexts to learn more.

For the best performance, we recommend the following usage pattern:

  1. Do many generations over a single WebSocket. Just use a separate context for each generation. The WebSocket scales up to dozens of concurrent generations.
  2. Set up the WebSocket before the first generation. This ensures you don’t incur latency when you start generating speech.
  3. Include necessary spaces and punctuation: This allows Sonic to generate speech more accurately and with better prosody.
  4. Use max_buffer_delay_ms to let the model intelligently manage buffering up to the specified maximum delay.

For conversational agent use cases, we recommend the following usage pattern:

  1. Each turn in a conversation should correspond to a context: For example, if you are using Sonic to power a voice agent, each turn in the conversation should be a new context.
  2. Start a new context for interruptions: If the user interrupts the agent, start a new context for the agent’s response.

HandshakeTry it

GET
wss://api.cartesia.ai/tts/websocket

Headers

Cartesia-Version"2025-04-16"Required

Query parameters

cartesia_versionstringRequired

You can specify this instead of the Cartesia-Version header. This is particularly useful for use in the browser, where WebSockets do not support headers.

You do not need to specify this if you are passing the header.

api_keystringRequired

You can specify this instead of the X-API-Key header. This is particularly useful for use in the browser, where WebSockets do not support headers.

You do not need to specify this if you are passing the header.

Send

Generation RequestobjectRequired

Use this to generate speech for a transcript.

OR
Cancel Context RequestobjectRequired

Use this to cancel a context, so that no more messages are generated for that context.

Receive

ReceiveobjectRequired

The server will send you back a stream of messages with the same context_id as your request. The messages can be of type chunk, timestamps, phoneme_timestamps error, or done.