Text to Speech (WebSocket)
This endpoint creates a bidirectional WebSocket connection. The connection supports multiplexing, so you can send multiple requests and receive the corresponding responses in parallel.
The WebSocket API is built around contexts:
- When you send a generation request, you pass a
context_id
. Further inputs on the samecontext_id
will continue the generation, maintaining prosody. - Responses for a context contain the
context_id
you passed in so that you can match requests and responses.
Read the guide on working with contexts to learn more.
For the best performance, we recommend the following usage pattern:
- Do many generations over a single WebSocket. Just use a separate context for each generation. The WebSocket scales up to dozens of concurrent generations.
- Set up the WebSocket before the first generation. This ensures you don’t incur latency when you start generating speech.
- Buffer the first input on a context to at least 3 or 4 words for optimizing both latency and prosody.
- Split inputs into sentences: Sending inputs in sentences allows Sonic to generate speech more accurately and with better prosody. Include necessary spaces and punctuation. For conversational agent use cases, we recommend the following usage pattern:
- Each turn in a conversation should correspond to a context: For example, if you are using Sonic to power a voice agent, each turn in the conversation should be a new context.
- Start a new context for interruptions: If the user interrupts the agent, start a new context for the agent’s response.
Handshake
Headers
Query parameters
You can specify this instead of the Cartesia-Version
header. This is particularly useful for use in the browser, where WebSockets do not support headers.
You do not need to specify this if you are passing the header.
You can specify this instead of the X-API-Key
header. This is particularly useful for use in the browser, where WebSockets do not support headers.
You do not need to specify this if you are passing the header.
Send
Use this to generate speech for a transcript.
Use this to cancel a context, so that no more messages are generated for that context.
Receive
The server will send you back a stream of messages with the same context_id
as your request.
The messages can be of type chunk
, timestamp
, error
, or done
.