Developer Quickstart

Getting started with the API

Using the API

Generate your first words and learn API conventions.

Optimizing latency for realtime performance

We recommend using the WebSockets API for low-latency applications. The WebSockets endpoint supports a persistent, bidirectional connection that allows you to send and receive messages in realtime.

Using WebSockets is ideal for:

Using WebSockets

API Reference for the WebSockets endpoint.

Input streaming for conversational applications

In many real-time use cases, you don’t have your transcripts available upfront—like when you’re generating them using a language model. For these cases, Sonic supports input streaming using WebSocket contexts. This allows you to stream a transcript in multiple chunks to Sonic and receive seamless speech in return.

Input Streaming with WebSockets

API Reference for Input Streaming and working with Contexts.

Read up the Input Streaming section in the WebSocket API Reference before you start building your conversational applications.

For the best performance, we recommend the following usage pattern:

  1. Set up a WebSocket at start of the conversation and maintain it throughout the conversation. This incurs a one-time latency cost and optimizes latency for subsequent turns.
  2. One turn should correspond to one context: Use one context for each turn in the conversation. Contexts maintain prosody between their inputs, so you can send a transcript in multiple parts and receive seamless speech in return.
  3. Buffer the first request transcript to at least 3 or 4 words for optimizing both latency and prosody.
  4. Split inputs into sentences: Sending inputs in sentences allows Sonic to generate speech more accurately and with better prosody. Include necessary spaces and punctuation.
  5. Start a new context for interruptions: If the user interrupts the conversation, start a new context for the agent’s response.
  6. Finish a context with an empty transcript: If you don’t know the last transcript in advance, you can send an input with an empty transcript to end the context.

Check out this Pipecat plugin for an excellent example of how to build conversational agents with Sonic.

Telephony with Sonic

Sonic is optimized for telephony use cases, such as call centers, voice broadcasting, and IVRs. You can integrate Sonic with Twilio to build telephony applications that generate speech in real-time.

Cartesia <> Twilio

Learn how to use Sonic with Twilio for telephony applications.

Concurrency and rate limiting

We measure concurrency in terms of the number of unique context_ids active at a given time. Note that we persist context_ids for 5 seconds after the last generation if input streaming is used for a given context ID. For conversational use cases, you can typically support more users and connections than the concurrency limit on your subscription plan.

If you exceed your concurrency limit, you will receive a 429 Too Many Requests error. You can check your concurrency limit and upgrade it at play.cartesia.ai.

Improving speech & cloning quality

Improving Speech Quality

Best practices to improve the quality of your generated speech.

Voice Cloning

Learn how to clone voices and generate speech with a specific voice.

Controlling style and pronunciation

Controlling Speed & Emotion

Learn how to control the speed and emotion of your generated speech.

Custom Pronunciations

Learn how to customize pronunciation for your transcripts.