Developer Quickstart
Getting started with the API
Generate your first words and learn API conventions.
Optimizing latency for realtime performance
We recommend using the WebSockets API for low-latency applications. The WebSockets endpoint supports a persistent, bidirectional connection that allows you to send and receive messages in realtime.
Using WebSockets is ideal for:
- Ultra Low-latency: Applications that require low-latency, such as conversational agents and interactive voice systems.
- Streaming audio to other services: Streaming audio in realtime, such as for live broadcasts or telephony applications.
- Input streaming: When you don’t have the transcript available upfront like when streaming from a language model.
API Reference for the WebSockets endpoint.
Input streaming for conversational applications
In many real-time use cases, you don’t have your transcripts available upfront—like when you’re generating them using a language model. For these cases, Sonic supports input streaming using WebSocket contexts. This allows you to stream a transcript in multiple chunks to Sonic and receive seamless speech in return.
API Reference for Input Streaming and working with Contexts.
Read up the Input Streaming section in the WebSocket API Reference before you start building your conversational applications.
For the best performance, we recommend the following usage pattern:
- Set up a WebSocket at start of the conversation and maintain it throughout the conversation. This incurs a one-time latency cost and optimizes latency for subsequent turns.
- One turn should correspond to one context: Use one context for each turn in the conversation. Contexts maintain prosody between their inputs, so you can send a transcript in multiple parts and receive seamless speech in return.
- Buffer the first request transcript to at least 3 or 4 words for optimizing both latency and prosody.
- Split inputs into sentences: Sending inputs in sentences allows Sonic to generate speech more accurately and with better prosody. Include necessary spaces and punctuation.
- Start a new context for interruptions: If the user interrupts the conversation, start a new context for the agent’s response.
- Finish a context with an empty transcript: If you don’t know the last transcript in advance, you can send an input with an empty transcript to end the context.
Check out this Pipecat plugin for an excellent example of how to build conversational agents with Sonic.
Telephony with Sonic
Sonic is optimized for telephony use cases, such as call centers, voice broadcasting, and IVRs. You can integrate Sonic with Twilio to build telephony applications that generate speech in real-time.
Learn how to use Sonic with Twilio for telephony applications.
Concurrency and rate limiting
We measure concurrency in terms of the number of unique context_id
s active at a given time. Note that we persist context_id
s for 5 seconds after the last generation if input streaming is used for a given context ID. For conversational use cases, you can typically support more users and connections than the concurrency limit on your subscription plan.
If you exceed your concurrency limit, you will receive a 429 Too Many Requests
error. You can check your concurrency limit and upgrade it at play.cartesia.ai.
Improving speech & cloning quality
Best practices to improve the quality of your generated speech.
Learn how to clone voices and generate speech with a specific voice.
Controlling style and pronunciation
Learn how to control the speed and emotion of your generated speech.
Learn how to customize pronunciation for your transcripts.