# Delete Agent
Source: https://docs.cartesia.ai/api-reference/agents/agents/delete
/latest.yml DELETE /agents/{agent_id}
# Get Agent
Source: https://docs.cartesia.ai/api-reference/agents/agents/get
/latest.yml GET /agents/{agent_id}
Returns the details of a specific agent. To create an agent, use the CLI or the Playground for the best experience and integration with Github.
# List Agents
Source: https://docs.cartesia.ai/api-reference/agents/agents/list
/latest.yml GET /agents
Lists all agents associated with your account.
# List Phone Numbers
Source: https://docs.cartesia.ai/api-reference/agents/agents/phone-numbers
/latest.yml GET /agents/{agent_id}/phone-numbers
List the phone numbers associated with an agent. Currently, you can only have one phone number per agent and these are provisioned by Cartesia.
# List Templates
Source: https://docs.cartesia.ai/api-reference/agents/agents/templates
/latest.yml GET /agents/templates
List of public, Cartesia-provided agent templates to help you get started.
# Update Agent
Source: https://docs.cartesia.ai/api-reference/agents/agents/update
/latest.yml PATCH /agents/{agent_id}
# Download Call Audio
Source: https://docs.cartesia.ai/api-reference/agents/calls/download-call-audio
/latest.yml GET /agents/calls/{call_id}/audio
The downloaded audio file is in .wav format. This endpoint streams the audio file content (WAV format) to the client.
# Get Call
Source: https://docs.cartesia.ai/api-reference/agents/calls/get-call
/latest.yml GET /agents/calls/{call_id}
# Get Call Runtime Logs
Source: https://docs.cartesia.ai/api-reference/agents/calls/get-call-logs
/latest.yml GET /agents/calls/{call_id}/logs
Returns the runtime logs for a specific call. These are the logs produced by your agent's code during the call. Logs may not be available if the call is still in progress or if they have been removed due to data retention settings.
# List Calls
Source: https://docs.cartesia.ai/api-reference/agents/calls/list-calls
/latest.yml GET /agents/calls
Lists calls sorted by start time in descending order for a specific agent. `agent_id` is required and if you want to include `transcript` in the response, add `expand=transcript` to the request. This endpoint is paginated.
# Get Deployment
Source: https://docs.cartesia.ai/api-reference/agents/deployments/get-deployment
/latest.yml GET /agents/deployments/{deployment_id}
Get a deployment by its ID.
# List Deployments
Source: https://docs.cartesia.ai/api-reference/agents/deployments/list-deployments
/latest.yml GET /agents/{agent_id}/deployments
List of all deployments associated with an agent.
# Add Metric to Agent
Source: https://docs.cartesia.ai/api-reference/agents/metrics/add-metric-to-agent
/latest.yml POST /agents/{agent_id}/metrics/{metric_id}
Add a metric to an agent. Once the metric is added, it will be run on all calls made to the agent automatically from that point onwards.
# Create Metric
Source: https://docs.cartesia.ai/api-reference/agents/metrics/create-metric
/latest.yml POST /agents/metrics
Create a new metric.
# Export Metric Results as CSV
Source: https://docs.cartesia.ai/api-reference/agents/metrics/export-metric-results
/latest.yml GET /agents/metrics/results/export
Export metric results to a CSV file. This endpoint streams at most 100k results as the CSV file directly to the client. Use the optional filters to narrow down the results to export.
# Get Metric
Source: https://docs.cartesia.ai/api-reference/agents/metrics/get-metric
/latest.yml GET /agents/metrics/{metric_id}
Get a metric by its ID.
# List Metric Results
Source: https://docs.cartesia.ai/api-reference/agents/metrics/list-metric-results
/latest.yml GET /agents/metrics/results
Paginated list of metric results. Filter results using the query parameters,
# List Metrics
Source: https://docs.cartesia.ai/api-reference/agents/metrics/list-metrics
/latest.yml GET /agents/metrics
List of all LLM-as-a-Judge metrics owned by your account.
# Remove Metric from Agent
Source: https://docs.cartesia.ai/api-reference/agents/metrics/remove-metric-from-agent
/latest.yml DELETE /agents/{agent_id}/metrics/{metric_id}
Remove a metric from an agent. Once the metric is removed, it will no longer be run on all calls made to the agent automatically from that point onwards. Existing metric results will remain.
# API Status and Version
Source: https://docs.cartesia.ai/api-reference/api-status/get
/latest.yml GET /
# Speech-to-Text (Streaming)
Source: https://docs.cartesia.ai/api-reference/stt/stt
This endpoint creates a bidirectional WebSocket connection for real-time speech transcription.
Our STT endpoint enables sending in a stream of audio as bytes, and provides transcription results as they become available.
**Usage Pattern**:
1. Connect to the WebSocket with appropriate query parameters
2. Send audio chunks as binary WebSocket messages in the specified encoding format
3. Receive transcription messages as JSON with word-level timestamps
4. Send `finalize` as a text message to flush any remaining audio (receives `flush_done` acknowledgment)
5. Send `done` as a text message to close the session cleanly (receives `done` acknowledgment and closes)
**Performance Recommendation**: For best performance, it is recommended to resample audio before streaming and send audio chunks in `pcm_s16le` format at 16kHz sample rate.
**Pricing**: Speech-to-text streaming is priced at **1 credit per 1 second** of audio streamed in.
For WebSocket connection limits, see the [concurrency limits and timeouts](/use-the-api/concurrency-limits-and-timeouts) page.
# Speech-to-Text (Batch)
Source: https://docs.cartesia.ai/api-reference/stt/transcribe
/latest.yml POST /stt
Transcribes audio files into text using Cartesia's Speech-to-Text API.
Upload an audio file and receive a complete transcription response. Supports arbitrarily long audio files with automatic intelligent chunking for longer audio.
**Supported audio formats:** flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm
**Response format:** Returns JSON with transcribed text, duration, and language. Include `timestamp_granularities: ["word"]` to get word-level timestamps.
**Pricing:** Batch transcription is priced at **1 credit per 2 seconds** of audio processed.
For migrating from the OpenAI SDK, see our [OpenAI Whisper to Cartesia Ink Migration Guide](/use-the-api/migrate-from-open-ai).
# Text to Speech (Bytes)
Source: https://docs.cartesia.ai/api-reference/tts/bytes
/latest.yml POST /tts/bytes
# Text to Speech (SSE)
Source: https://docs.cartesia.ai/api-reference/tts/sse
/latest.yml POST /tts/sse
# Text to Speech (WebSocket)
Source: https://docs.cartesia.ai/api-reference/tts/websocket
This endpoint creates a bidirectional WebSocket connection. The connection supports multiplexing, so you can send multiple requests and receive the corresponding responses in parallel.
The WebSocket API is built around contexts:
- When you send a generation request, you pass a `context_id`. Further inputs on the same `context_id` will [continue the generation](/build-with-cartesia/capability-guides/stream-inputs-using-continuations), maintaining prosody.
- Responses for a context contain the `context_id` you passed in so that you can match requests and responses.
Read the guide [on working with contexts](/use-the-api/tts-websocket/contexts) to learn more.
For the best performance, we recommend the following usage pattern:
1. **Do many generations over a single WebSocket**. Just use a separate context for each generation. The WebSocket scales up to dozens of concurrent generations.
2. **Set up the WebSocket before the first generation**. This ensures you don’t incur latency when you start generating speech.
3. **Include necessary spaces and punctuation**: This allows Sonic to generate speech more accurately and with better prosody.
For conversational agent use cases, we recommend the following usage pattern:
1. **Each turn in a conversation should correspond to a context**: For example, if you are using Sonic to power a voice agent, each turn in the conversation should be a new context.
2. **Start a new context for interruptions**: If the user interrupts the agent, start a new context for the agent’s response.
To learn more about managing concurrent generations and WebSocket connection limits, see the [concurrency limits and timeouts](/use-the-api/concurrency-limits-and-timeouts) page.
# Clone Voice
Source: https://docs.cartesia.ai/api-reference/voices/clone
/latest.yml POST /voices/clone
Clone a high similarity voice from an audio clip. Clones are more similar to the source clip, but may reproduce background noise. For these, use an audio clip about 5 seconds long.
# Delete Voice
Source: https://docs.cartesia.ai/api-reference/voices/delete
/latest.yml DELETE /voices/{id}
# Get Voice
Source: https://docs.cartesia.ai/api-reference/voices/get
/latest.yml GET /voices/{id}
# List Voices
Source: https://docs.cartesia.ai/api-reference/voices/list
/latest.yml GET /voices
# Localize Voice
Source: https://docs.cartesia.ai/api-reference/voices/localize
/latest.yml POST /voices/localize
Create a new voice from an existing voice localized to a new language and dialect.
# Update Voice
Source: https://docs.cartesia.ai/api-reference/voices/update
/latest.yml PATCH /voices/{id}
Update the name, description, and gender of a voice. To set the gender back to the default, set the gender to `null`. If gender is not specified, the gender will not be updated.
# Audio encodings
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/audio-encodings
Pick the encoding that matches your downstream pipeline.
## TTS output encodings
Used in the `output_format.encoding` field when generating audio.
| Encoding | Bit depth | Best for | Pair with sample rate |
| ----------- | ---------------- | --------------------------------------------------------------- | --------------------------------- |
| `pcm_s16le` | 16-bit int | General-purpose playback, browsers, audio players, most devices | 44100 (CD quality) or 16000–48000 |
| `pcm_f32le` | 32-bit float | ML post-processing, high-fidelity recording, audio analysis | 48000 |
| `pcm_mulaw` | 8-bit compressed | North American / Japanese telephony (G.711μ), Twilio | 8000 |
| `pcm_alaw` | 8-bit compressed | European / international telephony (G.711A) | 8000 |
### `pcm_s16le`
16-bit signed integer PCM, little-endian. Matches the standard audio CD format and is the most widely supported encoding across audio players, browsers, and hardware. Use this as your default unless you have a specific reason to choose another format.
```json theme={null}
{
"container": "raw",
"encoding": "pcm_s16le",
"sample_rate": 44100
}
```
### `pcm_f32le`
32-bit floating point PCM, little-endian. Provides the highest precision and dynamic range. Use when your pipeline handles float audio end-to-end—for example, feeding generated audio into an ML model, performing signal processing with NumPy/SciPy, or recording to a lossless format for later mastering.
```json theme={null}
{
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": 48000
}
```
### `pcm_mulaw`
8-bit μ-law compressed PCM. The standard encoding for North American and Japanese telephone networks (G.711μ). Use this when sending audio to Twilio or any telephony provider that expects μ-law. Always pair with an 8000 Hz sample rate to match the telephony standard.
```json theme={null}
{
"container": "raw",
"encoding": "pcm_mulaw",
"sample_rate": 8000
}
```
### `pcm_alaw`
8-bit A-law compressed PCM. The standard encoding for European and international telephone networks (G.711A). Use when your telephony infrastructure expects A-law rather than μ-law. Always pair with an 8000 Hz sample rate.
```json theme={null}
{
"container": "raw",
"encoding": "pcm_alaw",
"sample_rate": 8000
}
```
## STT input encodings
Used in the `encoding` parameter when sending audio for transcription. Must match the actual encoding of your audio source.
| Encoding | Bit depth | Common sources |
| ----------- | ---------------- | ------------------------------------------------------------------- |
| `pcm_s16le` | 16-bit int | Microphones, browsers (Web Audio API), most audio capture libraries |
| `pcm_s32le` | 32-bit int | Professional audio interfaces |
| `pcm_f16le` | 16-bit float | Half-precision ML pipelines |
| `pcm_f32le` | 32-bit float | ML models, Web Audio API `AudioWorklet` nodes, NumPy/SciPy |
| `pcm_mulaw` | 8-bit compressed | North American telephony, Twilio streams |
| `pcm_alaw` | 8-bit compressed | European telephony systems |
For best STT performance, resample your audio to `pcm_s16le` at 16000 Hz before sending.
# Choosing a Voice
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/choosing-a-voice
How to pick the best voice for your Voice Agents
When designing a voice agent experience, the voice that your agents will speak in is a critical choice that will influence your customers' experience.
Cartesia offers 500+ voices out-out-of-box, as well as the ability to clone your own voices.
### Featured Voices
We feature a set of Voices that we've found work well for our customers and pass our internal quality checks. These voices are a great starting point to find the best Voice for your voice agent.
Featured Voices are displayed with a check mark icon next to their names on [play.cartesia.ai](https://play.cartesia.ai/).
### Stable voices (best for voice agents)
For voice agents in production, we've found that more stable, realistic voices perform better than studio quality, emotive voices. From our testing, we think these are the top performing English Voices for voice agents in Sonic 3:
* **Male**: Ronald, Carson
* **Female**: Katie, Jacqueline, Brooke
### Emotive voices (best for AI characters)
Our latest model, Sonic 3, is very expressive with some voices like Tessa and Maya labeled as emotive in the playground, and respond well to [emotion instructions](/build-with-cartesia/sonic-3/volume-speed-emotion).
If your use case requires more expressive speech (e.g. companion apps, game characters), then we suggest trying:
* **Male**: Kyle, Cory
* **Female**: Tessa, Ariana
We tag such voices as Emotive in our playground and you can see a full list [here](https://play.cartesia.ai/voices?tags=Emotive).
# Choosing TTS parameters
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/choosing-tts-parameters
Our Text-to-Speech API includes many parameters that can be bewildering to developers who have not
worked with audio before.
In general, you should pick the highest precision and sample rate supported by every stage of your audio
pipeline, including telephony and device outputs.
A typical digital audio setup will perform well with these settings, which match the standard audio CD format:
```
output_format: {
container: "raw",
encoding: "pcm_s16le",
sample_rate: 44100,
}
```
If you know your pipeline supports a higher encoding and sample rate end to end, the highest quality settings are:
```
output_format: {
container: "raw",
encoding: "pcm_f32le",
sample_rate: 48000,
}
```
## Reference
The container format (if any), for the audio output.
Available options: `RAW`, `WAV`, `MP3`. Only the Bytes endpoint supports all container formats;
our streaming endpoints (SSE, Websockets) only support `RAW`.
The encoding of the output audio. Available options: `pcm_f32le`, `pcm_s16le`, `pcm_mulaw`, `pcm_alaw`.
For detailed guidance on when to use each encoding, see [Audio encodings](/build-with-cartesia/capability-guides/audio-encodings).
The sample rate of the output audio. Remember that to represent a given signal, the sample rate
must be at least twice the highest frequency component of the signal (Nyquist theorem).
Available options: `8000`, `16000`, `22050`, `24000`, `44100`, `48000`.
## Examples
### Audio CD quality
Standard audio CDs are encoded as `pcm_s16le` at 44.1kHz sample rate:
```
output_format: {
container: "raw",
encoding: "pcm_s16le",
sample_rate: 44100,
}
```
This performs well for consumer digital audio setups.
### Telephony
Many customers send their audio output over Twilio. Since all audio sent over Twilio is
transcoded to µlaw encoding with 8kHz sample rate (to match the telephony standard), you should
specify the following output\_format:
```
output_format: {
container: "raw",
encoding: "pcm_mulaw",
sample_rate: 8000,
}
```
### Bluetooth headsets
If you happen to know that that the user is using a Bluetooth headset (such as AirPods) to multiplex
both microphone input and headphone output, the user will be on the Bluetooth Hands-Free Profile
(HFP), limiting sample rate to 16kHz. (In practice, it's difficult to programmatically determine the
end-user's microphone/speaker devices, so this example is a bit contrived.)
```
output_format: {
container: "raw"
encoding: "pcm_s16le",
sample_rate: 16000,
}
```
# Clone Voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices
Learn how to get the best voice clones from your audio clips.
Voice cloning is available through the [playground](https://play.cartesia.ai) and the [API](/2024-11-13/api-reference/voices/clone). With current API versions, instant cloning uses **high-similarity** mode: clones sound more like the source clip, but may reproduce background noise. For the legacy **stability** workflow, pin API version `2024-11-13` and see [Older TTS models](/build-with-cartesia/tts-models/older-models).
For the best voice clones, we recommend following these best practices:
## General best practices for voice cloning
1. **Choose an appropriate script to speak.** You want your recording to align as closely as possible with the voice you want to generate. For example, don't read a colorless transcript in a monotone voice unless you're aiming for a monotonous clone. Instead, prepare a script that is suited to your use case and has the right energy.
2. **Speak as clearly as possible and avoid background noise.** For example, when recording yourself, try to use a high-quality microphone and be in a quiet space.
3. **Avoid long pauses.** Pauses in the recording will be mimicked by the cloned voice, such as between sentences. Ensure your recording matches the pacing you want your voice to follow.
4. **Trim your recording.** The audio you provide should roughly contain speech from start to finish. Make sure the speaker is not cut-off and that there's no excessive silence at the beginning or end. You can use a tool like Audacity or our playground make the perfect clip from your recording.
5. **Speak in the target language.** For instance, if you want the cloned voice to speak Spanish, speak Spanish in the recording. If this is not possible, you can use Cartesia's localization feature—available in the playground and in the API—to convert your clone to a different language.
## Best practices for high-similarity clones
1. **Limit your recording to ten seconds.** This is the sweet spot for high-similarity clones. A longer clip will not result in a better clone.
2. **Set `enhance` to `false` when cloning.** Unless your source clip has substantial background noise, any postprocessing will reduce the similarity of the clone to the source clip.
# End-to-end Pro Voice Cloning (Python)
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices-pro/api
Use Cartesia's REST API to create a Pro Voice Clone.
> **Prerequisites**
>
> 1. You have a **Cartesia API token** (export it as `CARTESIA_API_TOKEN`).
> 2. You have at least 1 M credits on your account.
> 3. You have a folder called `samples/` with one or more `.wav` files.
```python lines theme={null}
"""
End-to-end Pro Voice Cloning example.
Steps
-----
1. Create a dataset.
2. Upload audio files from samples/ to the dataset.
3. Kick off a fine-tune from that dataset.
4. Poll until fine-tune is completed.
5. Get the voices produced by the fine-tune.
"""
import os
import time
from pathlib import Path
import requests
API_BASE = "https://api.cartesia.ai"
API_HEADERS = {
"Cartesia-Version": "2025-04-16",
"Authorization": f"Bearer {os.environ['CARTESIA_API_KEY']}",
}
def create_dataset(name: str, description: str) -> str:
"""POST /datasets → dataset id."""
res = requests.post(
f"{API_BASE}/datasets",
headers=API_HEADERS,
json={"name": name, "description": description},
)
res.raise_for_status()
return res.json()["id"]
def upload_file_to_dataset(dataset_id: str, path: Path) -> None:
"""POST /datasets/{dataset_id}/files (multipart/form-data)."""
with path.open("rb") as fp:
res = requests.post(
f"{API_BASE}/datasets/{dataset_id}/files",
headers=API_HEADERS,
files={"file": fp, "purpose": (None, "fine_tune")},
)
res.raise_for_status()
def create_fine_tune(dataset_id: str, *, name: str, language: str, model_id: str) -> str:
"""POST /fine-tunes → fine-tune id."""
body = {
"name": name,
"description": "Pro Voice Clone demo",
"language": language,
"model_id": model_id,
"dataset": dataset_id,
}
res = requests.post(f"{API_BASE}/fine-tunes", headers=API_HEADERS, json=body, timeout=60)
res.raise_for_status()
return res.json()["id"]
def wait_for_fine_tune(ft_id: str, every: float = 10.0) -> None:
"""Poll GET /fine-tunes/{id} until status == completed."""
start = time.monotonic()
while True:
res = requests.get(f"{API_BASE}/fine-tunes/{ft_id}", headers=API_HEADERS)
res.raise_for_status()
status = res.json()["status"]
print(f"fine-tune {ft_id} -> {status}. Elapsed: {time.monotonic() - start:.0f}s")
if status == "completed":
return
if status == "failed":
raise RuntimeError(f"fine-tune ended with status={status}")
time.sleep(every)
def list_voices(ft_id: str) -> list[dict]:
"""GET /fine-tunes/{id}/voices → list of voices."""
res = requests.get(f"{API_BASE}/fine-tunes/{ft_id}/voices", headers=API_HEADERS)
res.raise_for_status()
return res.json()["data"]
if __name__ == "__main__":
# Create the dataset
DATASET_ID = create_dataset("PVC demo", "Samples for a Pro Voice Clone")
print("Created dataset:", DATASET_ID)
# Upload .wav files to the dataset
for wav_path in Path("samples").glob("*.wav"):
upload_file_to_dataset(DATASET_ID, wav_path)
print(f"Uploaded {wav_path.name} to dataset {DATASET_ID}")
# Ask for confirmation before kicking off the fine-tune
confirmation = input(
"Are you sure you want to start the fine-tune? It will cost 1M credits upon successful completion (yes/no): "
)
if confirmation.lower() != "yes":
print("Fine-tuning cancelled by user.")
exit()
# Kick off the fine-tune
FINE_TUNE_ID = create_fine_tune(
DATASET_ID,
name="PVC demo",
language="en",
model_id="sonic-2",
)
print(f"Started fine-tune: {FINE_TUNE_ID}")
# Wait for training to finish
wait_for_fine_tune(FINE_TUNE_ID)
print("Fine-tune completed!")
# Fetch the voices created by the fine-tune
voices = list_voices(FINE_TUNE_ID)
print("Voices IDs:")
for voice in voices:
print(voice["id"])
```
# Pro Voice Cloning
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices-pro/playground
## Why use Pro Voice Cloning?
A Professional Voice Clone (PVC) is a voice that uses a fine-tune of our TTS model on your data, which allows it to create an almost exact replica of the voice it hears including accent, speaking style, and audio quality.
Compared to [Instant Voice Cloning](/build-with-cartesia/capability-guides/clone-voices), Pro Voice Cloning can capture the exact nuances of your hours of studio-quality audio voice data.
## Overview
Pro Voice Cloning is available in the [Playground](https://play.cartesia.ai/pro-voice-cloning) for anyone with a Cartesia subscription of Startup or higher. It allows you to create highly accurate voice clones by leveraging a larger amount of data compared to instant cloning.
| Feature | Required audio data | Pricing: cost to create | Pricing: cost to use for TTS |
| ------------------- | ------------------- | ----------------------- | ---------------------------- |
| Instant Voice Clone | 10 seconds | Free | 1 credit per character |
| Pro Voice Clone | 3 hours | 1M credits on success | 1.5 credits per character |
When you create a Pro Voice Clone, Cartesia first fine-tunes a model on your data, then creates Voices from selected clips of your data. These Voices are tied to the fine-tuned model and will be automatically used with these Voices for text-to-speech.
## Get started
Visit the Pro Voice Clone tab to get started on your first PVC. On the home page, you can to see all your fine-tuned models and their statuses (i.e Draft, Failed, Training, Completed).
Fill out the form to create a Pro Voice Clone.
Then, upload all of the audio files you want to use for training. You can upload multiple
files at once. Files must be one of the following audio formats:
* .wav
* .mp3
* .flac
* .ogg
* .oga
* .ogx
* .aac
* .wma
* .m4a
* .opus
* .ac3
* .webm
Pro Voice Clones require a minimum of 30 minutes of audio, but we recommend 2 hours of audio for optimal balance of quality and effort. The Pro Voice Clone will closely match your uploaded data, so make sure it sounds the way you like in terms of background noise, loudness, and speech quality.
Generally, it's better to upload audio with only the speaker you which to clone. Multi-speaker audio can interfere with cloning quality.
If you also reused data from past Pro Voice Clones. Switch to the **Select dataset** tab to view previous datasets. These datasets can be edited separately from your PVCs and are helpful for managing your audio files.
Training should take 3 hours to complete. You'll only be charged if the training is successful. If training fails, you can click the `Re-attempt Training` button to try again or contact [support](mailto:support@cartesia.ai) if the failures persist.
Once training is complete, we'll automatically create four Voices based on different source audio clips from your dataset. These Voices are internally linked to your fine-tuned model, which will be used when you specify the model ID of the fine-tuned model in your requests.
The Voices are also available in the Voice Library under My Voices and can be used through the API.
**Note about base model updates:**
We've fine-tuned the latest base model available in production, which is reflected in the displayed model ID. This means that the fine-tuned model is fixed to this particular model ID and will not be activated if you use a different `model-id`. PVCs will not automatically be updated for future base models, and will need to be retrained on each new base model.
Retraining a new fine-tuned model with new data or the latest base model will again cost 1M credits.
# Localize voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/localize-voices
Learn how to localize voices for your brand or product.
The localization feature accepts a voice to localize, the gender of the voice, and the target language and accent to localize to, and produces a Voice that you can use to generate speech (or save as a new voice).
# Stream Inputs using Continuations
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/stream-inputs-using-continuations
Learn how to stream input text to Sonic TTS.
In many real-time use cases, you don't have input text available upfront—like when you're generating it on the fly using a language model. For these cases, we support input streaming through a feature we call *continuations*.
This guide will cover how input streaming works from the perspective of the TTS model. If you just want to implement input streaming, see [the WebSocket API reference](/api-reference/tts/tts), which implements continuations using *contexts*.
## Continuations
Continuations are generations that extend already generated speech. They're called continuations because you're continuing the generation from where the last one left off, maintaining the *prosody* of the previous generation.
If you don't use continuations, you get sudden changes in prosody that create seams in the audio.
Prosody refers to the rhythm, intonation, and stress in speech. It's what makes speech flow naturally and sound human-like.
Let's say we're using an LLM and it generates a transcript in three parts, with a one second delay between each part:
1. `Hello, my name is Sonic.`
2. ` It's very nice`
3. ` to meet you.`
To generate speech for the whole transcript, we might think to generate speech for each part independently and stitch the audios together:
Unfortunately, we end up with speech that has sudden changes in prosody and strange pacing:
Your browser does not support the audio element.
Now, let's try the same transcripts, but using continuations. The setup looks like this:
Here's what we get:
Your browser does not support the audio element.
As you can hear, this output sounds seamless and natural.
You can scale up continuations to any number of inputs. There is no limit.
## Caveat: Streamed inputs should form a valid transcript when joined
This means that `"Hello, world!"` can be followed by `" How are you?"` (note the leading space) but not `"How are you?"`, since when joined they form the invalid transcript `"Hello, world!How are you?"`.
In practice, this means you should maintain spacing and punctuation in your streamed inputs.
**End complete sentences with closing punctuation** (for example `.`, `?`, or `!`).
If a streamed chunk does not end with sentence-ending punctuation, the model often treats it as an incomplete sentence. That can cause:
* **Extra latency:** Text may stay in the automatic input buffer until the model sees a clearer boundary or until `max_buffer_delay_ms` elapses (**3000ms by default**), so audio starts later than you expect.
* **Audio artifacts:** The model expects natural sentence endings; without closing punctuation, the generated audio sometimes ends with odd or distorted sounds.
When a user-facing utterance is finished, put terminal punctuation on the final segment (and signal that no more text is coming on the context when appropriate, for example `no_more_inputs()` in the SDK or `continue: false` over the WebSocket).
## Automatic buffering with `max_buffer_delay_ms`
When streaming inputs from LLMs word-by-word or token-by-token, we buffer text until the optimal transcript length for our model. The default buffer is 3000ms, if you wish to modify this you can use the `max_buffer_delay_ms` parameter, though we *do not recommend making this change*.
If you plan on using `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with buffering, make sure decimal values are not split up.
Submitting `1.0` as `1`, `.`, `0` will result in unintended failure modes.
### How it works
When set, the model will buffer incoming text chunks until it's confident it has enough context to generate high-quality speech, or the buffer delay elapses, whichever comes first.
Without this buffer, the model would immediately start generating with each input, which could result in choppy audio or unnatural prosody if inputs are very small (like single words or tokens).
### Configuration
* **Range**: Values between 0-5000ms are supported
* **Default**: 3000ms
Use this *only* if
* you have custom buffering client side, in which case you can set this to 0
* you have choppiness even at 3000ms, in which case you can try a higher value
```js lines theme={null}
// Example WebSocket request with `max_buffer_delay_ms`
{
"model_id": "sonic-3",
"transcript": "Hello", // First word/token
"voice": {
"mode": "id",
"id": "a0e99841-438c-4a64-b679-ae501e7d6091"
},
"context_id": "my-conversation-123",
"continue": true,
"max_buffer_delay_ms": 3000 // Buffer up to 3000ms
}
```
Let's try the following transcripts with continuations and the default `max_buffer_delay_ms=3000`: `['Hello', 'my name', 'is Sonic.', "It's ", 'very ', 'nice ', 'to ', 'meet ', 'you.']`
Your browser does not support the audio element.
# Custom Pronunciations
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/custom-pronunciations
Learn how to specify custom pronunciations for words that are hard to get right, like proper nouns or domain-specific terms.
All models in the Sonic TTS family support custom pronunciations in your transcripts. Try out the pronunciation tool on our [demo](https://play.cartesia.ai/demos/pronunciation) page.
`sonic-3` supports custom pronunciation dictionaries, which allow specifying how to pronounce a specific word or words more easily and sustainably.
At its core, a dictionary is a simple search and replace, which directs the model to use another string in lieu of the text for the transcript. The pronunciation can either be an [IPA pronunciation](/build-with-cartesia/sonic-3/phonemes), or a "sounds-like" guidance:
```json lines theme={null}
[
{
"text": "bayou",
"pronunciation": "<<ˈ|b|ɑ|ˈ|j|u>>"
},
{
"text": "jambalaya",
"pronunciation": "<<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>>"
},
{
"text": "tchoupitoulas",
"pronunciation": "chop-uh-TOO-liss"
}
]
```
These JSONs can then be saved as a pronunciation dictionaries [through our API](https://docs.cartesia.ai/api-reference/pronunciation-dicts/create), or through our [playground](https://play.cartesia.ai/pronunciation). The playground gives affordances for creating and manipulating dictionaries also directly in the UI:
Once the dictionaries are created, they can be used in any of the TTS APIs by specifying the id in `pronunciation_dict_id`.
With the above dictionary, the string: `I ate some jambalaya on tchoupitoulas street` would become `I ate some <<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>> on chop-uh-TOO-liss street` before being handed off to the model, which in turn, would do a better job in pronouncing it properly.
## Case Sensitivity
Dictionary matching is **case-sensitive**, with one exception: a lowercase entry also matches its sentence-start capitalized form. For example, `cat` matches both `cat` and `Cat`, but not `CAT`. An entry for `CAT` only matches `CAT`.
This applies to multi-word entries too. An entry for `green valley` matches `green valley` and `Green valley`, but not `Green Valley`.
**Use lowercase entries for common words.** These match the word both mid-sentence (`cat`) and at the start of a sentence (`Cat`), covering the two most common positions.
**Use exact capitalization for proper nouns.** A term like "LaTeX" should be entered as `LaTeX` so it doesn't collide with a different pronunciation for the common word `latex`. For multi-word proper nouns, enter the exact casing as it appears in your transcripts — for example, `Green Valley` if the transcript capitalizes both words.
> For the best controllability around pronunciation, we recommend using `sonic-3`.
`sonic-2` and `sonic-turbo` use MFA-style IPA for all languages.
For the best controllability around pronunciation, we recommend using `sonic-2`.
You can also get custom pronunciations with older Sonic models.
The `sonic`, `sonic-2024-12-12`, and `sonic-2024-10-19` models use Sonic-flavored IPA phonemes for English.
The `sonic` and `sonic-2024-12-12` use MFA-style IPA for languages other than English, and the Sonic Preview model uses MFA-style IPA for all languages.
Note that `sonic-2024-10-19` does not support custom pronunciations for languages other than English.
We will soon be updating all models to use MFA-style IPA.
Custom words should be wrapped in double angle brackets `<<` `>>` , with pipe characters `|` between phonemes and no whitespace.
For example:
* `Can I get <> on that?` (MFA-style IPA)
* `Can I get <> on that?` (Sonic-flavored IPA)
Each individual word should be wrapped in it’s own set of angle brackets.
# MFA-style IPA
## Constructing Pronunciations
We use the IPA phoneset as defined by the [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/). Because of the size and complexity of this phoneset, you may find it easier to construct your custom pronunciation starting from existing words with known phonemizations. We suggest the following workflow for constructing a custom pronunciation for a word:
1. Go to the [MFA pronunciation dictionary index](https://mfa-models.readthedocs.io/en/latest/dictionary/index.html) and find the page corresponding to your language. Make sure the phoneset is MFA, and try to download the latest version (for most languages, v3.0 or v3.1).
1. This page will give you the full range of acceptable phones for your language under the “phones” section.
2. Scroll down to the `Installation` section and click on the `Download from the release page` link.
3. Scroll to the bottom of the release page and download the .dict file; this is a text file mapping words to their constituent phonemes.
1. The first column in the file contains words, and the last column contains space delimited phonemes. Ignore the other columns.
4. Look up your word or words that sound similar to your intended pronunciation in the dictionary. Use these pronunciations as a starting point to construct your custom pronunciation.
Automatic pronunciation suggestions based on audio samples will be added in a future update. Note that MFA-style IPA does not support stress markers.
## Example
Suppose I want to generate the text “This is a generation from Cartesia” and the model is not pronouncing “Cartesia” correctly. I would do the following:
1. Go to the [MFA pronunciation dictionary index](https://mfa-models.readthedocs.io/en/latest/dictionary/index.html) and look for English pronunciation dictionaries. I see that for US English, the most recent version is v3.1.
1. I note that the page says that the acceptable phones for US english are `aj aw b bʲ c cʰ cʷ d dʒ dʲ d̪ ej f fʲ h i iː j k kʰ kʷ l m mʲ m̩ n n̩ ow p pʰ pʲ pʷ s t tʃ tʰ tʲ tʷ t̪ v vʲ w z æ ç ð ŋ ɐ ɑ ɑː ɒ ɒː ɔj ə ɚ ɛ ɝ ɟ ɟʷ ɡ ɡʷ ɪ ɫ ɫ̩ ɱ ɲ ɹ ɾ ɾʲ ɾ̃ ʃ ʉ ʉː ʊ ʎ ʒ ʔ θ`
2. Download the .dict file from the bottom of the [release page](https://github.com/MontrealCorpusTools/mfa-models/releases/tag/dictionary-english_us_mfa-v3.1.0).
3. Find a word in this dictionary that sounds similar to how I want “Cartesia” to be pronounced. I see this entry in the dictionary:
`cartesian 0.99 0.14 1.0 1.0 kʰ ɑ ɹ tʲ i ʒ ə n`
4. Ignore the middle four numeric columns. I want to cut off the part of the pronunciation that corresponds to “-an” and replace it with an “uh” sound. I know that the MFA phoneme for “uh” is `ɐ` (if I didn’t know that, I could also look up “uh” in the dictionary). So the pronunciation I want is `kʰ ɑ ɹ tʲ i ʒ ɐ`.
5. Format the phonemes it in angle brackets with pipe characters between phonemes and no whitespace. So my transcript is `This is a generation from <>`.
# (Deprecated) Sonic-flavored IPA
Sonic-flavored IPA is only for `sonic` and users of our latest models (`sonic-2` and `sonic-turbo`) should use MFA-style IPA.
Here is a pronunciation guide for Sonic-flavored IPA.
It follows the [English phonology article on Wikipedia](https://en.wikipedia.org/wiki/English_phonology) for most phonemes,
but in spots where our model requires different notation than you may expect, we've included a blue `<=` in the margins.
You can copy/paste some of these uncommon symbols from the original [charts here](https://docs.google.com/spreadsheets/d/1OJbiKtxLyodpNPqVfOu43X2HloLsAixTtFppEuQ_4pI/edit?usp=sharing).
## Stresses and vowel length markers
Sonic English requires stress markers for first (`ˈ`) and second (`ˌ`) stressed syllables, which go directly before the vowel. We also use annotations for vowel length (`ː`). The model can also operate without them, but you will have noticeably better robustness and control when using them.
# Prompting tips
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/prompting-tips
1. **Use appropriate punctuation.** Add punctuation where appropriate and at the end of each transcript whenever possible.
2. **Use dates in MM/DD/YYYY form.** For example, 04/20/2023.
3. **Add spaces between time and AM/PM.** For example, `7:00 PM`, `7 PM`, `7:00 P.M`.
4. **Insert pauses.** To insert pauses, insert "-" or use [break tags](/build-with-cartesia/formatting-text-for-sonic-2/inserting-breaks-pauses) where you need the pause. These tags are considered 1 character and do not need to be separated with adjacent text using a space -- to save credits you can remove spaces around break tags.
5. **Match the voice to the language.** Each voice has a language that it works best with. You can use the playground to quickly understand which voices are most appropriate for a language.
6) **Stream in inputs for contiguous audio.** Use [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) if generating audio that should sound contiguous in separate chunks.
7) **Specify [custom pronunciations](/build-with-cartesia/sonic-3/custom-pronunciations) for
domain-specific or ambiguous words.** You may want to do this for proper nouns and trademarks, as
well as for words that are spelled the same but pronounced differently, like the city of Nice and
the adjective "nice."
8) **Force [spelling out numbers and letters](/build-with-cartesia/sonic-3/ssml-tags#spelling-out-numbers-and-letters).** You may want to do this for IDs, email addresses, or numeric values.
For sonic-2, see [Formatting Text for Sonic-2](/build-with-cartesia/formatting-text-for-sonic-2/best-practices).
# SSML Tags
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/ssml-tags
Tags for volume, speed, and emotions is in beta and subject to change in the
future.
Sonic-3 supports SSML-like (Speech Synthesis Markup Language) tags to control generated speech.
## Speed
Note that if you're streaming token by token, you'll need to buffer the whole value of the speed or volume tags.
Passing in `1`, `.`, `0` as separate inputs, for example, will result in reading out the tags.
You can guide the speed of a TTS generation with a `speed` tag, which takes a scalar between `0.6` and `1.5`.
This value is roughly a multiplier on the default speed. For example, `1.5` will generate audio at roughly 1.5x the
default speed.
```xml theme={null}
I like to speak quickly because it makes me sound smart.
```
## Volume
You can guide the volume of a TTS generation with a `volume` tag, which is a number between `0.5`
and `2.0`. The default volume is `1`.
```xml theme={null}
I will speak softly.
```
## Emotion Beta
Emotion control is highly experimental, particularly when emotion shifts occur
mid-generation. If you need to change the emotion in a transcript, we recommend
using separate generation contexts for each emotion. For best results, use [Voices
tagged as "Emotive"](https://play.cartesia.ai/voices?tags=Emotive), as emotions may not work reliably with other Voices.
```xml theme={null}
I will not allow you to continue this! I was hoping for a peaceful resolution.
```
## Pauses and breaks
To insert breaks (or pauses) in generated speech, use a `break` tags with one attribute, `time`. For
example, ``. You can specify the time in seconds (`s`) or milliseconds (`ms`).
For accounting purposes, these tags are considered 1 character and do not need to be separated with adjacent text using a
space -- to save credits you can remove spaces around break tags.
```xml theme={null}
Hello, my name is Sonic.Nice to meet you.
```
## Spelling out numbers and letters
To spell out input text, you can wrap it in `` tags.
This is particularly useful for pronouncing long numbers or identifiers, such as credit card numbers, phone numbers, or unique IDs.
```xml theme={null}
My name is Bob, spelled Bob, my account number is ABC-123, my phone number is (123) 456-7890, and my credit card is 1234-5678-9012-3456.
```
If you want to spell out numbers or identifiers and have planned breaks between the generations (e.g. taking a break between the area code of a phone number and the rest of that number), you can combine `` and `` tags. These tags are considered 1 character and do not need to be separated with adjacent text using a space -- to save credits you can remove spaces around break and spell tags.
```xml theme={null}
My phone number is (123)4712177 and my credit card number is 1234567863474537.
```
# Volume, Speed, and Emotion
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/volume-speed-emotion
Sonic-3 provides rich controls for the speed, volume, and emotion of generated speech. These controls are available on play.cartesia.ai using the UI controls, or by passing in a `generation_config` parameter, or by using SSML tags within the transcript itself.
**Sonic-3 interprets these parameters as guidance** instead of as strict adjustments to ensure natural speech, so we recommend testing them against your content to ensure the output matches your expectations.
## Speed and Volume Controls
You can guide the speed and volume of a TTS generation with the `generation_config.speed` and `generation_config.volume` parameters. These values are roughly a multiplier on the default speed and volume, eg, `1.5` will generate audio at 1.5x the default speed.
The speed of the generation, ranging from `0.6` to `1.5`.
The volume of the generation, ranging from `0.5` to `2.0`.
You can also specify these inside the transcript itself, using [SSML](/build-with-cartesia/sonic-3/ssml-tags), for example:
```xml lines theme={null}
I like to speak quickly because it makes me sound smart.
And I can be loud, too!
```
## Emotion Controls Beta
By default, the model attempts to interpret the emotional subtext present in the provided transcript. You can guide the emotion of a TTS generation, like a director providing guidance to an actor, using the `generation_config.emotion` parameter.
Emotion tags are good to push the model to be more emotive, but they only work when the emotion is consistent with transcript. For instance, the mismatch below is unlikely to work well:
```xml theme={null}
I'm so excited!
```
The emotional guidance for a generation, one of the emotions below.
The primary emotions, for which we have the most data and produce the best results, are: `neutral`, `angry`, `excited`, `content`, `sad`, and `scared`.
The complete list of available emotions is: `happy`, `excited`, `enthusiastic`, `elated`, `euphoric`, `triumphant`, `amazed`, `surprised`, `flirtatious`, `joking/comedic`, `curious`, `content`, `peaceful`, `serene`, `calm`, `grateful`, `affectionate`, `trust`, `sympathetic`, `anticipation`, `mysterious`, `angry`, `mad`, `outraged`, `frustrated`, `agitated`, `threatened`, `disgusted`, `contempt`, `envious`, `sarcastic`, `ironic`, `sad`, `dejected`, `melancholic`, `disappointed`, `hurt`, `guilty`, `bored`, `tired`, `rejected`, `nostalgic`, `wistful`, `apologetic`, `hesitant`, `insecure`, `confused`, `resigned`, `anxious`, `panicked`, `alarmed`, `scared`, `neutral`, `proud`, `confident`, `distant`, `skeptical`, `contemplative`, `determined`.
The Voices with the best emotional response are:
* [Leo](https://play.cartesia.ai/voices/0834f3df-e650-4766-a20c-5a93a43aa6e3) (id: `0834f3df-e650-4766-a20c-5a93a43aa6e3`)
* [Jace](https://play.cartesia.ai/voices/6776173b-fd72-460d-89b3-d85812ee518d) (id: `6776173b-fd72-460d-89b3-d85812ee518d`)
* [Kyle](https://play.cartesia.ai/voices/c961b81c-a935-4c17-bfb3-ba2239de8c2f) (id: `c961b81c-a935-4c17-bfb3-ba2239de8c2f`)
* [Gavin](https://play.cartesia.ai/voices/f4a3a8e4-694c-4c45-9ca0-27caf97901b5) (id: `f4a3a8e4-694c-4c45-9ca0-27caf97901b5`)
* [Maya](https://play.cartesia.ai/voices/cbaf8084-f009-4838-a096-07ee2e6612b1) (id: `cbaf8084-f009-4838-a096-07ee2e6612b1`)
* [Tessa](https://play.cartesia.ai/voices/6ccbfb76-1fc6-48f7-b71d-91ac6298247b) (id: `6ccbfb76-1fc6-48f7-b71d-91ac6298247b`)
* [Dana](https://play.cartesia.ai/voices/cc00e582-ed66-4004-8336-0175b85c85f6) (id: `cc00e582-ed66-4004-8336-0175b85c85f6`)
* [Marian](https://play.cartesia.ai/voices/26403c37-80c1-4a1a-8692-540551ca2ae5) (id: `26403c37-80c1-4a1a-8692-540551ca2ae5`)
View the full list of emotive Voices on our [Voice Library with voices tagged 'Emotive'](https://play.cartesia.ai/voices?tags=Emotive).
You can also use [SSML](/build-with-cartesia/sonic-3/ssml-tags) tags for emotions, for example:
```xml theme={null}
How dare you speak to me like I'm just a robot!
```
## Nonverbalisms
Insert `[laughter]`in your transcript to make the model laugh. In the future we plan to add more non-speech verbalisms like sighs and coughs.
# STT Models
Source: https://docs.cartesia.ai/build-with-cartesia/stt-models
Ink is a new family of streaming speech-to-text (STT) models for developers building real-time voice applications.
* the latest **stable** snapshot of the model
To use the stable version of the model, we recommend using the base model name (e.g. `ink-whisper`).
In many cases the stable and preview snapshots are the same, but in some cases the preview snapshot may have additional features or improvements.
## `ink-whisper`
Ink Whisper is the fastest, most affordable speech-to-text model — engineered for enterprise deployment in production-grade voice agents. It delivers higher accuracy than baseline Whisper and is optimized for real-time performance in a wide variety of real-world conditions.
Additional Capabilities:
* Handles variable-length audio chunks and interruptions gracefully using dynamic chunking.
* Reliably transcribes speech with background noise.
* Accurately transcribes audio with telephony artifacts, accents, and disfluencies.
* Excels at transcribing proper nouns and domain-specific terminology.
| Snapshot | Release Date | Languages | Status |
| ------------------------------------ | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |
| `ink-whisper` | June 10, 2025 | `en`, `zh`, `de`, `es`, `ru`, `ko`, `fr`, `ja`, `pt`, `tr`, `pl`, `ca`, `nl`, `ar`, `sv`, `it`, `id`, `hi`, `fi`, `vi`, `he`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `ta`, `no`, `th`, `ur`, `hr`, `bg`, `lt`, `la`, `mi`, `ml`, `cy`, `sk`, `te`, `fa`, `lv`, `bn`, `sr`, `az`, `sl`, `kn`, `et`, `mk`, `br`, `eu`, `is`, `hy`, `ne`, `mn`, `bs`, `kk`, `sq`, `sw`, `gl`, `mr`, `pa`, `si`, `km`, `sn`, `yo`, `so`, `af`, `oc`, `ka`, `be`, `tg`, `sd`, `gu`, `am`, `yi`, `lo`, `uz`, `fo`, `ht`, `ps`, `tk`, `nn`, `mt`, `sa`, `lb`, `my`, `bo`, `tl`, `mg`, `as`, `tt`, `haw`, `ln`, `ha`, `ba`, `jw`, `su`, `yue` | Stable |
| `ink-whisper-2025-06-04` | June 4, 2025 | `en`, `zh`, `de`, `es`, `ru`, `ko`, `fr`, `ja`, `pt`, `tr`, `pl`, `ca`, `nl`, `ar`, `sv`, `it`, `id`, `hi`, `fi`, `vi`, `he`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `ta`, `no`, `th`, `ur`, `hr`, `bg`, `lt`, `la`, `mi`, `ml`, `cy`, `sk`, `te`, `fa`, `lv`, `bn`, `sr`, `az`, `sl`, `kn`, `et`, `mk`, `br`, `eu`, `is`, `hy`, `ne`, `mn`, `bs`, `kk`, `sq`, `sw`, `gl`, `mr`, `pa`, `si`, `km`, `sn`, `yo`, `so`, `af`, `oc`, `ka`, `be`, `tg`, `sd`, `gu`, `am`, `yi`, `lo`, `uz`, `fo`, `ht`, `ps`, `tk`, `nn`, `mt`, `sa`, `lb`, `my`, `bo`, `tl`, `mg`, `as`, `tt`, `haw`, `ln`, `ha`, `ba`, `jw`, `su`, `yue` | Stable |
To learn how to use the Ink STT family, see [the Speech-to-Text API Reference](/api-reference/stt/stt). You can find a detailed mapping of codes to languages, see the [STT supported languages](/api-reference/stt/stt#request.query.language) list.
## Selecting a Model
When making API calls, you can specify either:
```python lines theme={null}
// Use the base model (automatically routes to the latest snapshot)
{
model = "ink-whisper",
...
}
// Or specify a particular snapshot for consistency
{
model = "ink-whisper-2025-06-04",
...
}
```
### Continuous updates
All models have a base model name (e.g. `ink-whisper`).
We recommend using these for prototyping and development, then switching to a date-versioned model for production use cases to ensure stability.
## Future Updates
New snapshots are released periodically with improvements in performance, additional language support, and new capabilities. Check back regularly for updates.
# API Changes
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/api-changes
Starting June 1, 2026, we are discontinuing several models, snapshots, and languages, and removing voice embeddings from our voice API. Migrate to `sonic-3` for improved naturalness, 42-language support, and fine-grained controls.
## Deprecated models and languages
You can check if you're making requests to deprecated models on [play.cartesia.ai/deprecation/traffic](https://play.cartesia.ai/deprecation/traffic).
### Fully deprecated models
These models will stop serving requests on June 1, 2026.
| Model | Snapshots affected | Deprecated languages |
| -------------------- | ------------------------ | -------------------- |
| `sonic` | All | All |
| `sonic-english` | — | All |
| `sonic-multilingual` | — | All |
| `sonic-2` | `sonic-2-2025-03-07` | All |
| `sonic-turbo` | `sonic-turbo-2025-03-07` | All |
### Partially deprecated models
These models will continue to serve a reduced set of languages. The languages listed below will be discontinued on June 1, 2026.
| Model | Snapshots affected | Deprecated languages |
| ------------- | ---------------------------------------------------------------- | -------------------------- |
| `sonic-2` | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | it, nl, pl, ru, sv, tr, hi |
| `sonic-turbo` | `sonic-turbo-2025-06-04` | it, nl, pl, ru, sv, tr |
## Stable offerings
The following will remain available beyond June 1.
| Model | Snapshots | Supported Languages |
| ------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| `sonic-3` | All | 42 languages — [full list](/build-with-cartesia/tts-models/latest#language-support) |
| `sonic-2` | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | en, de, es, fr, ja, ko, pt, zh |
| `sonic-turbo` | `sonic-turbo-2025-06-04` | en, de, es, fr, ja, ko, pt, zh, hi |
## API changes
These endpoints will be discontinued on June 1, 2026.
| Discontinued Endpoint | Replacement |
| ------------------------------------------ | ------------------------------------------ |
| Voice Embedding: `POST /voices/clone/clip` | [Clone Voice](/api-reference/voices/clone) |
| Mix Voices: `POST /voices/mix` | — |
| Create Voice: `POST /voices` | [Clone Voice](/api-reference/voices/clone) |
These endpoints will stop accepting voice embeddings on June 1, 2026.
| Endpoint with a breaking change | Replacement |
| ------------------------------------- | ------------------------------------------------------ |
| TTS (bytes): `POST /tts/bytes` | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |
| TTS (SSE): `POST /tts/sse` | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |
| TTS (WebSocket): `WSS /tts/websocket` | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |
You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.
### Moving off of deprecated endpoints
1. Change how you create voices — see [Migrating Voices](/build-with-cartesia/tts-models/migrating-voices).
2. Switch from voice embeddings to IDs — see [Voice IDs](/build-with-cartesia/tts-models/voice-ids).
## Full Checklist
1. Move off of [deprecated models / snapshots / languages](/build-with-cartesia/tts-models/api-changes#deprecated-models-and-languages) onto `sonic-3` or another stable model
2. Move off of [deprecated endpoints](/build-with-cartesia/tts-models/api-changes#api-changes) when creating voices
3. Use [Voice IDs](/build-with-cartesia/tts-models/voice-ids)
4. Check your deprecated model traffic on [play.cartesia.ai/deprecation/traffic](https://play.cartesia.ai/deprecation/traffic)
5. Make sure your voices are migrated on [play.cartesia.ai/deprecation/voices](https://play.cartesia.ai/deprecation/voices)
6. (Optional) Upgrade your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`
## Why are we doing this?
Since the launch of Sonic 3, we've made improvements across pacing, prosody, and naturalness; the vast majority of our customers have migrated to these models with great success. In order to increase our capacity, availability, and serving performance, we have to discontinue our oldest models.
Additionally, our newer models have evolved beyond voice embeddings in order to sound more natural. The parts of our API that accept voice embeddings cannot be made forward-compatible. Migrating to voice IDs will allow us to continue to improve both our models and your voices in tandem.
If you have questions, reach out to [support@cartesia.ai](mailto:support@cartesia.ai).
# Migrating Voices
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/migrating-voices
On June 1, 2026, we are discontinuing our voice embedding (aka stability) TTS models.
Voices listed on [play.cartesia.ai/deprecation/voices](https://play.cartesia.ai/deprecation/voices) will stop working. Simply click "Auto Migrate" to make these voices compatible with the latest Sonic 3, 2, and Turbo snapshots.
If you use voice embeddings rather than voice IDs, see [Voice IDs](/build-with-cartesia/tts-models/voice-ids).
For an overview of all changes, see [API Changes](/build-with-cartesia/tts-models/api-changes).
## Where do these voices come from?
Voices created by these endpoints rely on our voice embedding models:
* [POST /voices](/2024-06-10/api-reference/voices/create)
* [POST /voices/mix](/2024-06-10/api-reference/voices/mix)
* `POST /voices/clone/clip`
## Creating voices
You can move to our [Clone Voice API](/api-reference/voices/clone) or use our [web UI](https://play.cartesia.ai/voices/create/clone) to create voices from 3–10 seconds of source audio.
You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.
Here is an example using the Cartesia SDK:
```python theme={null}
your_api_key: str = ""
client = Cartesia(api_key=your_api_key)
print("Cloning a voice")
with open("3 to 10 seconds of source audio.wav", mode="rb") as f:
voice = client.voices.clone(
clip=f,
# this must match the source audio
language="en",
name="My Voice",
mode="similarity",
)
print(f"Cloned voice {voice.id}")
print("Generating audio...")
generated_audio = client.tts.bytes(
# voice embeddings will not work after June 1, 2026!
voice={"mode": "id", "id": voice.id},
model_id="sonic-3",
transcript="Hello from Cartesia!",
language="en",
output_format={
"container": "wav",
"encoding": "pcm_f32le",
"sample_rate": 44100
},
)
```
# Older TTS Models
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/older-models
We recommend using [Sonic 3](/build-with-cartesia/tts-models/latest) for best
results, most languages, and controllability. We continue to serve these older
models for compatibility.
Some models and snapshots are being discontinued on June 1, 2026 — see [API Changes](/build-with-cartesia/tts-models/api-changes) for details.
> the latest **stable** snapshot of the model\
> to be discontinued June 1, 2026
All models have a base model name (e.g. `sonic-2`, `sonic-turbo`) and date-versioned model names
(e.g. `sonic-2-2025-06-11`).
We recommend using base model names for prototyping and development, then switching to a date-versioned model for production use cases to ensure stability.
When making API calls, you can specify either:
```python lines theme={null}
# Use the base model
# (automatically routes to the latest stable snapshot)
model_id = "sonic-3"
# Or specify a particular snapshot for consistency
model_id = "sonic-3-2026-01-12"
```
## `sonic-2`
Sonic-2 provides ultra-realistic speech with accurate transcript following, minimal hallucinations, and excellent voice cloning. It's latency optimized and achieves 90ms model latency.
Additional Capabilities:
* Higher fidelity voice cloning
* Timestamps for all 15 languages
* [Infill](/2024-11-13/api-reference/infill/bytes) support
| Snapshot | Release Date | Languages | Status |
| ------------------------------------------- | -------------- | ---------------------------------------------------------- | ---------------- |
| `sonic-2-2025-06-11` | June 11, 2025 | en, fr, de, es, pt, zh, ja, ko | Stable |
| `sonic-2-2025-06-11` | June 11, 2025 | hi, it, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| `sonic-2-2025-05-08` | May 8, 2025 | en, fr, de, es, pt, zh, ja, ko | Stable |
| `sonic-2-2025-05-08` | May 8, 2025 | hi, it, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| `sonic-2-2025-04-16` | April 16, 2025 | en, fr, de, es, pt, zh, ja, ko | Stable |
| `sonic-2-2025-04-16` | April 16, 2025 | hi, it, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| `sonic-2-2025-03-07` | March 7, 2025 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |
Read these pages to learn more about how to use Sonic-2:
* [Best practices](/build-with-cartesia/formatting-text-for-sonic-2/best-practices)
* [Inserting breaks](/build-with-cartesia/formatting-text-for-sonic-2/inserting-breaks-pauses)
* [Spelling text](/build-with-cartesia/formatting-text-for-sonic-2/spelling-out-input-text)
## `sonic-turbo`
All the power of Sonic, with half the latency (as low as 40ms).
| Snapshot | Release Date | Languages | Status |
| ----------------------------------------------- | ------------- | ---------------------------------------------------------- | ---------------- |
| `sonic-turbo-2025-06-04` | June 6, 2025 | en, fr, de, es, pt, zh, ja, hi, ko | Stable |
| `sonic-turbo-2025-06-04` | June 6, 2025 | it, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| `sonic-turbo-2025-03-07` | March 7, 2025 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |
## `sonic`
The first version of our flagship text-to-speech model. It produces high-accuracy, expressive speech, and is optimized for efficiency to achieve low latency.
| Snapshot | Release Date | Languages | Status |
| ----------------------------------------- | ----------------- | ---------------------------------------------------------- | ---------------- |
| `sonic-2024-12-12` | December 12, 2024 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| `sonic-2024-10-19` | October 19, 2024 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |
## Deprecated and Preview Model Aliases
The following model aliases are now deprecated. Please use the recommended model names instead:
| Deprecated Alias | Use Instead |
| ------------------------------------------- | ----------------------------------------- |
| `sonic-3-preview` | `sonic-3` |
| `sonic-preview` | `sonic-2` |
| `sonic-english` | `sonic-2024-10-19` |
| `sonic-multilingual` | `sonic-2024-10-19` |
# Sonic 3
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/sonic-3
`sonic-3` is our streaming TTS model, with high naturalness, accurate transcript following, and industry-leading latency. It provides fine-grained control on volume, speed, and emotion.
Key Features:
* **42 languages** supported
* **Volume, speed, and emotion** controls, supported through API parameters and SSML tags
* **Laughter** through `[laughter]` tags
For more information, see [Volume, Speed, and Emotion](/build-with-cartesia/sonic-3/volume-speed-emotion).
### Voice selection
Choosing voices that work best for your use case is key to getting the best performance out of Sonic 3.
* **For voice agents**: We've found stable, realistic voices work better for voice agents than studio, emotive voices. Example American English voices include Katie (ID: `f786b574-daa5-4673-aa0c-cbe3e8534c02`) and Kiefer (ID: `228fca29-3a0a-435c-8728-5cb483251068`).
* **For expressive characters**: We've tagged our most expressive and emotive voices with the `Emotive` tag. Example American English voices include Tessa (ID: `6ccbfb76-1fc6-48f7-b71d-91ac6298247b`) and Kyle (ID: `c961b81c-a935-4c17-bfb3-ba2239de8c2f`).
For more information and recommendations, see [Choosing a Voice](/build-with-cartesia/capability-guides/choosing-a-voice).
### Language support
Sonic-3 supports the following languages:
English (`en`)
French (`fr`)
German (`de`)
Spanish (`es`)
Portuguese (`pt`)
Chinese (`zh`)
Japanese (`ja`)
Hindi (`hi`)
Italian (`it`)
Korean (`ko`)
Dutch (`nl`)
Polish (`pl`)
Russian (`ru`)
Swedish (`sv`)
Turkish (`tr`)
Tagalog (`tl`)
Bulgarian (`bg`)
Romanian (`ro`)
Arabic (`ar`)
Czech (`cs`)
Greek (`el`)
Finnish (`fi`)
Croatian (`hr`)
Malay (`ms`)
Slovak (`sk`)
Danish (`da`)
Tamil (`ta`)
Ukrainian (`uk`)
Hungarian (`hu`)
Norwegian (`no`)
Vietnamese (`vi`)
Bengali (`bn`)
Thai (`th`)
Hebrew (`he`)
Georgian (`ka`)
Indonesian (`id`)
Telugu (`te`)
Gujarati (`gu`)
Kannada (`kn`)
Malayalam (`ml`)
Marathi (`mr`)
Punjabi (`pa`)
## Selecting a Model
| Snapshot | Release Date | Languages | Status |
| ------------------------------------------- | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
| `sonic-3-2026-01-12` | January 12, 2026 | en, de, es, fr, ja, pt, zh, hi, ko, it, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | Stable |
| `sonic-3-2025-10-27` | October 27, 2025 | en, de, es, fr, ja, pt, zh, hi, ko, it, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | Stable |
the latest **stable** snapshot of the model
When making API calls, you can specify either:
```python lines theme={null}
# Use the base model
# (automatically routes to the latest stable snapshot)
model_id = "sonic-3"
# Or specify a particular snapshot for consistency
model_id = "sonic-3-2026-01-12"
# Try the latest (beta) model (can be 'hot swapped')
model_id = "sonic-3-latest"
```
### Continuous updates and model snapshots
All models have a base model name (e.g. `sonic-3`) and a dated snapshot (e.g. `sonic-3-2025-10-27`). Using the base model will automatically keep you up to date with the most recent stable snapshot of that model. If pinning a specific version is important for your use case, we recommend using the dated version.
For testing our latest capabilities, we recommend using `sonic-3-latest`, which is a non-snapshotted version. `sonic-3-latest` can be updated with no notice, and not recommended for production.
To summarize:
| **Model ID** | Model update behavior | Recommended for |
| -------------------- | :---------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
| `sonic-3-YYYY-MM-DD` | Snapshotted, will never change | Customers who want to run internal evals before any updates |
| `sonic-3` | Will be updated to point to the most recent stable snapshot | Customers who want stable releases, but want to be up-to-date with the recent capabilities |
| `sonic-3-latest` | Will always be updated to our latest beta releases | Testing purposes |
## Older Models
For information on `sonic-2`, `sonic-turbo`, `sonic-multilingual`, and `sonic`, see our page on [Older Models](/build-with-cartesia/tts-models/older-models).
# Voice IDs
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/voice-ids
On June 1, 2026, we are discontinuing our voice embedding (aka stability) TTS models.
If you are currently making generation requests with voice embeddings like this:
```json theme={null}
{
"voice": {
"mode": "embedding",
"embedding": [1, 2, ..., 3, 4]
},
"model_id": "sonic-2",
// ...
}
```
You will need to switch to using voice IDs like this:
```json theme={null}
{
"voice": {
"mode": "id",
"id": "e07c00bc-4134-4eae-9ea4-1a55fb45746b"
},
"model_id": "sonic-2",
// ...
}
```
If you already use voice IDs, see [Migrating Voices](/build-with-cartesia/tts-models/migrating-voices) to make sure your voices will continue to work after the change.
For an overview of all changes, see [API Changes](/build-with-cartesia/tts-models/api-changes).
## Get a voice ID
Choose one of the following options.
### Check out the voice library
Our featured voices have all gone through rigorous evaluations and are ready to use in production.
Check them out at [play.cartesia.ai/voices](https://play.cartesia.ai/voices) and copy the ID of any voice you'd like to use.
### Clone a voice
If you have source audio, create a cloned voice via the [playground](https://play.cartesia.ai/voices/create/clone) or the [API](/api-reference/voices/clone). Cloning returns a voice ID you can use immediately.
### Generate source audio from your existing embedding
If you no longer have the original audio clip used to create your embedding, generate a short sample with `sonic` or `sonic-2` and then clone a new voice.
You can do this on our playground:
1. [play.cartesia.ai/text-to-speech](https://play.cartesia.ai/text-to-speech)
2. [play.cartesia.ai/voices/create/clone](https://play.cartesia.ai/voices/create/clone)
Or with our API:
1. [Text to Speech (Bytes)](/2024-11-13/api-reference/tts/bytes)
2. [Clone Voice](/api-reference/voices/clone)
Here is an example using our SDK:
```python theme={null}
from cartesia import Cartesia
# inputs
your_api_key: str = ""
your_voice_embedding: list[float] = []
language = "en"
transcript = """
It's nice to meet you.
Hope you're having a great day!
Could we reschedule our meeting tomorrow?
Please call me back as soon as possible.
"""
source_tts_model_id = "sonic"
client = Cartesia(api_key=your_api_key)
# Step 1: generate an audio sample
print(f"Generating audio sample {source_tts_model_id=}")
source_audio_iterator = client.tts.bytes(
voice={"mode": "embedding", "embedding": your_voice_embedding},
model_id=source_tts_model_id,
transcript=transcript,
language=language,
output_format={
"container": "wav",
"encoding": "pcm_f32le",
"sample_rate": 44100
},
)
# Step 2: clone a voice
print("Cloning a voice")
voice = client.voices.clone(
name="My Voice",
language=language,
clip=b"".join(source_audio_iterator),
mode="similarity",
)
print(f"Cloned voice {voice.id}")
# you can now use the voice like this
migrate_to_model = "sonic-3"
generated_sample_file_name = f"{migrate_to_model}_{voice.id}.wav"
cloned_audio_iterator = client.tts.bytes(
voice={"mode": "id", "id": voice.id},
model_id=migrate_to_model,
transcript=transcript,
language=language,
output_format={
"container": "wav",
"encoding": "pcm_f32le",
"sample_rate": 44100
},
)
with open(generated_sample_file_name, "wb") as f:
for chunk in cloned_audio_iterator:
f.write(chunk)
print(f"Listen to your new voice: {generated_sample_file_name}")
try:
import subprocess
subprocess.run(
[
"ffplay",
"-loglevel",
"quiet",
"-autoexit",
"-nodisp",
generated_sample_file_name,
]
)
except FileNotFoundError:
pass
```
## Using Voice IDs
See [TTS (Bytes)](/api-reference/tts/bytes), [TTS (SSE)](/api-reference/tts/sse), and [TTS (WebSocket)](/api-reference/tts/websocket) for full API documentation.
You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.
# Set up an organization
Source: https://docs.cartesia.ai/enterprise/set-up-an-organization
Organization workspaces enable seamless collaboration between multiple team members. All users in an organization share the same view of resources, including voices, API keys, and datasets. The only exceptions are playground generation history and starred voices, which remain private to each individual user.
By default, your Cartesia account initializes as an organization workspace on the Free subscription plan with a limit of one member.
To invite team members, you must first upgrade your organization to the
Startup tier or higher. After upgrading, you can invite unlimited users at no
additional cost.
## Manage your organization
Organizations must be upgraded to the Startup tier or above before team members can be invited. Each workspace has its own billing and credit limits, so make sure you are on the intended organization before proceeding to upgrade your subscription.
Once you've upgraded your organization, you can use the "Manage" button in the workspace switcher to manage it:
This pops up a modal where you can change your profile and invite your team:
There are two membership types in an organizaton:
1. Admin: has the ability to manage the organization profile, invitations, and members.
2. Member: can use all functionality included in the subscription, but cannot alter organization settings.
You can invite unlimited team members in an organization once it is on Startup tier or higher.
Once your organization is upgraded, voices, Line agents, API keys, and other resources will be available to all users in the organization.
## Create additional organizations
If you want separate workspaces on different subscriptions, you can create another organization by going to the playground at [https://play.cartesia.ai](https://play.cartesia.ai), selecting the workspace switcher, and clicking **Create organization**.
This will bring up a dialog where you can name your organization and upload a logo.
Please reach out to us at [support@cartesia.ai](mailto:support@cartesia.ai) if you run into any troubles with your organization.
# Set up SSO
Source: https://docs.cartesia.ai/enterprise/set-up-sso
We support Single-Sign On (SSO) for customers on the Enterprise plan via SAML. This integration is processed through our identity provider, [Clerk](https://clerk.com).
## Set up SSO with Okta
1. Send us your SSO domain.
2. We will send you a service provider configuration, which consists of a single-sign on URL and an audience URI (SP entity ID).
3. Follow steps 2, 3, 4, and 5 in [the Clerk SSO guide](https://clerk.com/docs/authentication/enterprise-connections/saml/okta), and send us the metadata URL you get from step 6.1.
After you are done, we will complete the remaining SSO setup and send you a confirmation that SSO is enabled for your organization.
# Authenticate your client applications
Source: https://docs.cartesia.ai/get-started/authenticate-your-client-applications
Secure client access to Cartesia APIs using Access Tokens
You may want to make Cartesia API requests directly from your client application, eg, a web app. However, shipping your API key to the app is not secure, as a malicious user could extract your API key and issue API requests billed to your account.
Access Tokens provide a secure way to authenticate client-side requests to Cartesia's APIs without
exposing your API key.
Access Tokens are used in contexts like web apps which should not be bundled with an API key. For
trusted contexts like server applications, local scripts, or iPython notebooks, you should simply
use API keys.
## Prerequisites
Before implementing Access Tokens:
1. Configure your server with a Cartesia API key
2. Implement user authentication in your application
3. Establish secure client-server communication
### Available Grants
Access Tokens support granular permissions through grants. Both TTS and STT grants are optional:
**TTS Grant**: With `grants: { tts: true }`, clients have access to:
* `/tts/bytes` - Synchronous TTS generation streamed with chunked encoding
* `/tts/sse` - Server-sent events for streaming
* `/tts/websocket` - WebSocket-based streaming
**STT Grant**: With `grants: { stt: true }`, clients have access to:
* `/stt/websocket` - WebSocket-based speech-to-text streaming
* `/stt` - Batch speech-to-text processing
* `/audio/transcriptions` - OpenAI-compatible transcription endpoint
**Agents Grant**: With `grants: { agent: true }`, clients have access to:
* the Agents websocket calling endpoint
You can request multiple grants in a single token:
```json theme={null}
grants: { tts: true, stt: true, agent: false }
```
## Implementation Guide
### 1. Token Generation (Server-side)
Make a request to generate a new access token:
```bash cURL lines theme={null}
# TTS and STT access
curl --location 'https://api.cartesia.ai/access-token' \
-H 'Cartesia-Version: 2025-04-16' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk_car_...' \
-d '{ "grants": {"tts": true, "stt": true}, "expires_in": 60}'
# TTS-only access
curl --location 'https://api.cartesia.ai/access-token' \
-H 'Cartesia-Version: 2025-04-16' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk_car_...' \
-d '{ "grants": {"tts": true}, "expires_in": 60}'
```
```javascript JavaScript lines theme={null}
import { CartesiaClient } from "@cartesia/cartesia-js";
const client = new CartesiaClient({ apiKey: "YOUR_API_KEY" });
// TTS and STT access
await client.auth.accessToken({
grants: {
tts: true,
stt: true
},
expires_in: 60
});
// TTS-only access
await client.auth.accessToken({
grants: {
tts: true
},
expires_in: 60
});
```
```python Python lines theme={null}
from cartesia import Cartesia
client = Cartesia(
token="YOUR_API_KEY"
)
# TTS and STT access
response = client.auth.access_token(
grants={"tts": True, "stt": True}, # Grant both permissions
expires_in=60 # Token expires in 60 seconds
)
# TTS-only access
response = client.auth.access_token(
grants={"tts": True}, # Grant TTS permissions only
expires_in=60 # Token expires in 60 seconds
)
# The response will contain the access token
print(f"Access Token: {response.token}")
```
#### Example Implementation
```typescript lines theme={null}
// TTS and STT access
const response = await fetch("https://api.cartesia.ai/access-token", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Cartesia-Version": "2025-04-16",
Authorization: "Bearer ",
},
body: JSON.stringify({
grants: { tts: true, stt: true },
expires_in: 60, // 1 minute
}),
});
// TTS-only access
const responseTTS = await fetch("https://api.cartesia.ai/access-token", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Cartesia-Version": "2025-04-16",
Authorization: "Bearer ",
},
body: JSON.stringify({
grants: { tts: true },
expires_in: 60, // 1 minute
}),
});
const { token } = await response.json();
```
For detailed API specifications, see the [Token API Reference](/api-reference/auth/access-token).
### 2. Token Storage (Client-side)
Store the token securely, such as setting HTTP-only cookie with matching token expiration. The cookie should be `httpOnly`, `secure`, and `sameSite: "strict"`.
### 3. Making Authenticated Requests
```typescript lines theme={null}
// Using TTS with access token
const ttsResponse = await fetch("https://api.cartesia.ai/tts/bytes", {
headers: {
Authorization: `Bearer ${accessToken}`,
"Content-Type": "application/json",
},
// ... request configuration
});
// Using STT with access token
const sttResponse = await fetch("https://api.cartesia.ai/stt", {
method: "POST",
headers: {
Authorization: `Bearer ${accessToken}`,
},
body: formData, // multipart/form-data with audio file
});
```
### 4. Token Refresh Strategy
Proactively refresh the token in your app before they expire.
## Security Best Practices
### Essential Guidelines
* ✅ Generate tokens server-side only
* ✅ Use short token lifetimes (minutes)
* ✅ Implement automatic token refresh
* ✅ Store tokens in HTTP-only cookies
* ✅ Enable secure and SameSite cookie flags
### Security Don'ts
* ❌ Never store tokens in localStorage/sessionStorage
* ❌ Never log tokens or display them in the UI
* ❌ Never transmit tokens over non-HTTPS connections
### Token Lifecycle Management
1. Generate new token upon user authentication
2. Implement automatic refresh before expiration
3. Handle token expiration gracefully
## Additional Resources
* [API Reference](/api-reference/auth/access-token) - Access Token generation endpoint documentation
# Welcome to Cartesia
Source: https://docs.cartesia.ai/get-started/overview
Our API enables developers to build real-time, multimodal AI experiences that feel natural and responsive.
The Cartesia API is the fastest, most emotive, ultra-realistic voice AI platform. Purpose-built for developers, it serves state-of-the-art models for both text-to-speech and speech-to-text, enabling seamless conversational AI experiences.
## Sonic Models for Text-to-Speech
Sonic models take text input and and stream back ultra-realistic speech in response. They can also clone voices, with full control over pronunciation and accent.
**Sonic 3 is the world's fastest, most emotive, ultra-realistic text-to-speech model.** It can stream out the first byte of audio in just 90ms, making it perfect for real-time and conversational experiences as well as dubbing, narration, AI avatars, and more. (To put things into perspective, 90ms is about twice as fast as the blink of an eye.)
**If real-time performance is your top priority,** Sonic Turbo offers even better performance, streaming out the first byte of audio in just 40ms.
Learn more about available Sonic model variants and their capabilities in the [TTS Models](../build-with-cartesia/tts-models/latest) section.
## Ink Models for Speech-to-Text
Ink models provide streaming speech-to-text transcription optimized for real-time voice applications.
**Ink-Whisper**, our debut model, is specifically engineered for conversational AI—handling telephony artifacts, background noise, accents, and proper nouns that typically challenge standard STT systems.
Ink-Whisper uses advanced dynamic chunking to process variable-length audio segments, reducing errors and hallucinations during pauses or audio gaps. At just \$0.13/hour, it's the most affordable streaming STT model available.
Learn more about the Ink model and its capabilities in the [STT Models](../build-with-cartesia/stt-models) section.
## Support
Join our Discord server to chat with the Cartesia team, engage with the community, and get help with your projects.
Email us at [support@cartesia.ai](mailto:support@cartesia.ai) to get help with integrating Cartesia, your account, or billing.
# Realtime Text to Speech Quickstart
Source: https://docs.cartesia.ai/get-started/realtime-text-to-speech-quickstart
Stream text to Cartesia over a WebSocket and receive audio in real time.
Using the Cartesia Websocket API allows you to simultaneously stream text input and audio output. This is best for realtime use cases such as voice agents when text is generated incrementally, as from an LLM.
Stream text in chunks to the Cartesia and receive audio chunks in real time. This is ideal when text is generated incrementally, such as from an LLM.
## Prerequisites
* A Cartesia API key. [Create one here](https://play.cartesia.ai/keys), then add it to your `.bashrc` or `.zshrc`:
```sh theme={null}
export CARTESIA_API_KEY=
```
* `ffplay` (part of FFmpeg), used to play audio output:
```sh theme={null}
brew install ffmpeg
```
```sh theme={null}
sudo apt install ffmpeg
```
## Stream text and play audio
```sh theme={null}
pip install 'cartesia[websockets]'
```
```python realtime-tts.py theme={null}
from cartesia import Cartesia
import subprocess
import os
client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY"))
print("Starting ffplay to play streaming audio output...")
player = subprocess.Popen(
["ffplay", "-f", "f32le", "-ar", "44100", "-probesize", "32", "-analyzeduration", "0", "-nodisp", "-autoexit", "-loglevel", "quiet", "-"],
stdin=subprocess.PIPE,
bufsize=0,
)
print("Connecting to Cartesia via websockets...")
with client.tts.websocket_connect() as connection:
ctx = connection.context(
model_id="sonic-3",
voice={"mode": "id", "id": "f786b574-daa5-4673-aa0c-cbe3e8534c02"},
output_format={
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": 44100,
},
)
print("Sending chunked text input...")
for part in ["Hi there! ", "Welcome to ", "Cartesia Sonic."]:
ctx.push(part)
ctx.no_more_inputs()
for response in ctx.receive():
if response.type == "chunk" and response.audio:
print(f"Received audio chunk ({len(response.audio)} bytes)")
# Here we pipe audio to ffplay. In a production app you might play audio in
# a client, or forward it to another service, eg, a telephony provider.
player.stdin.write(response.audio)
elif response.type == "done":
break
player.stdin.close()
player.wait()
```
```sh theme={null}
python3 realtime-tts.py
```
This will stream text inputs to Cartesia, and play the streaming audio output using `ffplay`. (Make sure your device volume is turned on!)
```sh theme={null}
npm install @cartesia/cartesia-js ws
```
In the browser, you don't need the `ws` package — the native WebSocket API is used instead. However, you will need to use ephemeral access tokens for authentication. See [Authenticate Your Client Applications](/get-started/authenticate-your-client-applications).
Create a file named `realtime-tts.js` with the following code:
```js realtime-tts.js theme={null}
import Cartesia from "@cartesia/cartesia-js";
import { spawn } from "child_process";
const client = new Cartesia({ apiKey: process.env["CARTESIA_API_KEY"] });
console.log("Starting ffplay to play streaming audio output...");
const player = spawn("ffplay", ["-f", "f32le", "-ar", "44100", "-probesize", "32", "-analyzeduration", "0", "-nodisp", "-autoexit", "-loglevel", "quiet", "-"], {
stdio: ["pipe", "ignore", "ignore"],
});
console.log("Connecting to Cartesia via websockets...");
const ws = await client.tts.websocket();
const ctx = ws.context({
model_id: "sonic-3",
voice: { mode: "id", id: "f786b574-daa5-4673-aa0c-cbe3e8534c02" },
output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: 44100 },
});
console.log("Sending chunked text input...");
const transcriptChunks = ["Hi there! ", "Welcome to ", "Cartesia Sonic."]
for (const part of transcriptChunks) {
await ctx.push({ transcript: part });
}
await ctx.no_more_inputs();
for await (const event of ctx.receive()) {
if (event.type === "chunk" && event.audio) {
console.log("Received audio chunk (%d bytes)", event.audio.length);
// Here we pipe audio to ffplay. In a production app you might play audio in
// a client, or forward it to another service, eg, a telephony provider.
player.stdin.write(event.audio);
} else if (event.type === "done") {
break;
}
}
player.stdin.end();
ws.close();
```
```sh theme={null}
node realtime-tts.js
```
This will stream text inputs to Cartesia, and play the streaming audio output using `ffplay`. (Make sure your device volume is turned on!)
## How it works
The WebSocket connection manages multiple *contexts*, each representing an independent, continuous stream of speech. Cartesia context is exactly like an LLM context: on our servers, we store the previously-generated speech so that new speech matches it in tone.
To summarize, here's what our code does, after establishing a Websocket connection:
1. **Create a context** with `context()`.
2. **Push text** incrementally with `push()`. Each chunk continues seamlessly from the previous one using [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).
3. **Signal completion** with `no_more_inputs()` to tell the model no more text is coming.
4. **Receive audio** chunks as they are generated.
This maps directly to LLM token streaming — push each token or sentence fragment as it arrives, and audio begins streaming back even if the full text is not yet available.
## What's next
Deep dive into context management and buffering.
Browse voices and learn how to pick the right one for your use case.
Pick the right output format, sample rate, and encoding for your use case.
# LiveKit
Source: https://docs.cartesia.ai/integrations/live-kit
**LiveKit** is a WebRTC-first platform for realtime **video, voice, and data** in your product. **LiveKit Agents** is its framework for conversational agents.
**Cartesia** integrates in two ways: **LiveKit Inference** (hosted **cartesia/sonic-3** and related model IDs in the agent runtime; keys and pricing are through **LiveKit**—see [LiveKit’s Cartesia TTS guide](https://docs.livekit.io/agents/models/tts/inference/cartesia)) and the open-source **[livekit-plugins-cartesia](https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-cartesia)** Python package for **TTS and STT** using your **Cartesia** credentials from the worker.
# Demo
Here's a demo of a voice assistant built with LiveKit Agents and Cartesia:
Try out the LiveKit Cartesia demo.
The source code for this demo is available [here](https://github.com/livekit-examples/cartesia-voice-agent)
# Overview
Source: https://docs.cartesia.ai/integrations/overview
Partner integrations for Cartesia TTS and STT in your own app—not Cartesia-hosted agents.
Cartesia provides first-party speech APIs and SDKs, and integrates with many other products and developer frameworks. The pages in this section describe each path at a high level; detailed setup usually lives in partner documentation and repositories.
## Prerequisites
You’ll need these for almost every integration below. Individual pages also list extras (partner accounts, runtimes, SDK installs).
* **[Cartesia API key](https://play.cartesia.ai/keys)** — create and manage keys in the Playground.
* **A voice** — pick one in the Playground or API; see [Choosing a voice](/build-with-cartesia/capability-guides/choosing-a-voice) and [Voice IDs](/build-with-cartesia/tts-models/voice-ids).
## Integrations
Realtime rooms and agents—Cartesia via LiveKit Inference or the Cartesia plugin.
Python voice and multimodal agents with official Cartesia TTS/STT services.
Programmable Voice and Media Streams with Cartesia TTS (Node walkthrough).
TRTC realtime media with Cartesia for conversational AI workloads.
No-code phone agents; Cartesia is the default voice stack for new agents.
Rasa Pro voice assistants with Cartesia as the TTS backend.
Stream’s Vision Agents framework with a Cartesia TTS plugin.
`cartesia-mcp` for Cursor, Claude Desktop, and other MCP clients.
# Pipecat
Source: https://docs.cartesia.ai/integrations/pipecat
## Overview
[**Pipecat**](https://www.pipecat.ai/) is an open-source Python framework for realtime **voice** agents.
Building voice agents requires the creation and orchestration of pipelines, media and communication transports (such as Daily or LiveKit), and pluggable AI models.
**Cartesia** is available as a first-party provider plugin for **[TTS and STT services](https://github.com/pipecat-ai/pipecat/tree/main/src/pipecat/services/cartesia)** in the Pipecat repo.
## Prerequisites
Pipecat’s examples require a recent Python installation (see the Pipecat repo's [root-level README](https://github.com/pipecat-ai/pipecat/tree/main#prerequisites) for current prerequisites).
Install the **`pipecat-ai`** Python package with the **`cartesia`** extra for TTS/STT (bracket syntax):
```
pip install "pipecat-ai[cartesia,...]"
# or
uv add "pipecat-ai[cartesia,...]"
```
You'd also need to choose the **transport** extras your sample needs - you can do this by matching whatever the upstream README lists for that example.
## Getting Started - TTS (Websockets)
Pipecat's getting-started example provides you with a small, copy-friendly path to wire Cartesia TTS into a Pipecat [TTS WebSocket API](https://docs.cartesia.ai/api-reference/tts/websocket), and:
Getting-started examples in the Pipecat repo.
## Getting Started - TTS and STT (Websockets & HTTP)
For smaller voice-focused samples using Cartesia STT and TTS you can choose between two transports - WebSockets or HTTP:
Voice bot using Cartesia STT & TTS over WebSocket.
Same flow using Cartesia STT & TTS over HTTP.
## Orchestrated Conversational AI
For a fuller example app that shows an end to end Voice Agent experience (VAD -> STT -> LLM -> TTS) orchestrated with Pipecat, see StudyPal:
StudyPal example in the pipecat-examples repo.
# Rasa
Source: https://docs.cartesia.ai/integrations/rasa
**Rasa** is an open dialogue stack; **voice streaming with Cartesia** is documented for **Rasa Pro** (commercial) assistants. Configure a voice channel in **`credentials.yml`** with `tts: name: cartesia` and **`CARTESIA_API_KEY`** per Rasa’s speech-integrations reference. Start with their walkthrough, then use the reference for parameters (`model_id`, `voice`, multilingual `language_map`, etc.):
Full tutorial for building a voice agent with Rasa and Cartesia.
For implementation details, see their documentation:
Rasa reference for Cartesia TTS in voice channels.
# Tencent RTC
Source: https://docs.cartesia.ai/integrations/tencent-rtc
**Tencent Real-Time Communication (TRTC)** is Tencent Cloud’s stack for realtime audio and video—calls, live streaming, and conferencing.
**TRTC Conversational AI** is Tencent’s packaged stack for realtime voice agents. Tencent and Cartesia have a **public partnership** to combine TRTC networking with Cartesia **Sonic** TTS and **Ink-Whisper** STT for low-latency conversational AI (see Tencent’s [TRTC × Cartesia solution overview](https://trtc.tencentcloud.com/solutions/trtc-cartesia)). Integration steps and SDK details live in **Tencent’s** console and docs.
# Demo
Experience the TRTC × Cartesia voice assistant here:
[TRTC x Cartesia Demo](https://trtc.io/demo/homepage/#/cartesia)
# Thoughtly
Source: https://docs.cartesia.ai/integrations/thoughtly
**Thoughtly** is a no-code platform for **inbound and outbound AI phone agents** (sales, support, routing): visual flows, CRM and calendar integrations, analytics, and telephony. Following the [Thoughtly × Cartesia partnership](https://www.thoughtly.com/blog/thoughtly-upgrades-its-voice-library-through-partnership-with-cartesia/), **new agents default to Cartesia voices** (low-latency TTS, expanded library, cloning from a short sample in-product); Thoughtly notes existing agents can keep prior voices during migration.
# Demo
See a demo of Cartesia on Thoughtly.
# Integrate with Twilio
Source: https://docs.cartesia.ai/integrations/twilio
How to integrate Twilio with Cartesia to generate audio from text and send it as a voice call.
Use **Twilio Programmable Voice** with **Media Streams** so a phone call receives audio generated by **Cartesia TTS** over WebSockets. This walkthrough uses **Node.js**: a small server bridges Twilio’s stream to Cartesia and plays TTS audio on the callee’s line.
## Prerequisites
Before you begin, make sure you have the following:
1. [Node.js](https://nodejs.org/en/download) installed.
2. A [Twilio account](https://www.twilio.com/en-us/try-twilio). You will need your Account SID and Auth Token.
3. A [Cartesia API key](https://play.cartesia.ai/keys).
4. A phone number that you want to call.
5. A Twilio phone number to call from.
6. An [ngrok authtoken](https://dashboard.ngrok.com/get-started/your-authtoken) (a free account works).
## Get Started
1. Create a new directory for your project and navigate to it in your terminal.
2. Initialize a new Node.js project:
```bash lines theme={null}
npm init -y
```
3. Install the required dependencies:
```bash lines theme={null}
npm install twilio ws http @ngrok/ngrok dotenv
```
Create a `.env` file in your project root and add the following:
```sh lines theme={null}
TWILIO_ACCOUNT_SID="your_twilio_account_sid"
TWILIO_AUTH_TOKEN="your_twilio_auth_token"
CARTESIA_API_KEY="your_cartesia_api_key"
NGROK_AUTHTOKEN="your_ngrok_authtoken"
```
Replace the placeholder values with your actual credentials.
Create a file named `app.js` (or any name you prefer) and add the following code:
```javascript lines theme={null}
const twilio = require('twilio');
const WebSocket = require('ws');
const http = require('http');
const ngrok = require('@ngrok/ngrok');
const dotenv = require('dotenv');
const crypto = require('crypto');
// Load environment variables
dotenv.config({ override: true });
// Function to get a value from environment variable or command line argument
function getConfig(key, defaultValue = undefined) {
return process.env[key] || process.argv.find(arg => arg.startsWith(`${key}=`))?.split('=')[1] || defaultValue;
}
// Configuration
const config = {
TWILIO_ACCOUNT_SID: getConfig('TWILIO_ACCOUNT_SID'),
TWILIO_AUTH_TOKEN: getConfig('TWILIO_AUTH_TOKEN'),
CARTESIA_API_KEY: getConfig('CARTESIA_API_KEY'),
NGROK_AUTHTOKEN: getConfig('NGROK_AUTHTOKEN'),
};
// Validate required configuration
const requiredConfig = ['TWILIO_ACCOUNT_SID', 'TWILIO_AUTH_TOKEN', 'CARTESIA_API_KEY', 'NGROK_AUTHTOKEN'];
for (const key of requiredConfig) {
if (!config[key]) {
console.error(`Missing required configuration: ${key}`);
process.exit(1);
}
}
const client = twilio(config.TWILIO_ACCOUNT_SID, config.TWILIO_AUTH_TOKEN);
```
In the script, you'll find a configuration section for Cartesia TTS. Make sure to set the following variables according to your needs:
```javascript lines theme={null}
const TTS_WEBSOCKET_URL = `wss://api.cartesia.ai/tts/websocket?cartesia_version=2025-03-01`;
const modelId = 'sonic-3';
const voice = {
'mode': 'id',
// You can check available voices using the Cartesia API or at https://play.cartesia.ai
'id': "e07c00bc-4134-4eae-9ea4-1a55fb45746b"
};
const partialResponse = 'Hi there, my name is Cartesia. I hope youre having a great day!';
```
Configure your Twilio outbound and inbound numbers:
```javascript lines theme={null}
const outbound = "+1234567890"; // Replace with the number you want to call
const inbound = "+1234567890"; // Replace with your Twilio number
```
The `main()` function orchestrates the entire process:
1. Connects to the Cartesia TTS WebSocket
2. Tests the TTS WebSocket
3. Sets up a Twilio WebSocket server
4. Creates an ngrok tunnel for the Twilio WebSocket
5. Initiates the call using Twilio
```javascript expandable lines theme={null}
let ttsWebSocket;
let callSid;
let messageComplete = false;
let audioChunksReceived = 0;
function log(message) {
console.log(`[${new Date().toISOString()}] ${message}`);
}
function connectToTTSWebSocket() {
return new Promise((resolve, reject) => {
log('Attempting to connect to TTS WebSocket');
ttsWebSocket = new WebSocket(TTS_WEBSOCKET_URL, {
headers: { 'X-Api-Key': config.CARTESIA_API_KEY }
});
ttsWebSocket.on('open', () => {
log('Connected to TTS WebSocket');
resolve(ttsWebSocket);
});
ttsWebSocket.on('error', (error) => {
log(`TTS WebSocket error: ${error.message}`);
reject(error);
});
ttsWebSocket.on('close', (code, reason) => {
log(`TTS WebSocket closed. Code: ${code}, Reason: ${reason}`);
reject(new Error('TTS WebSocket closed unexpectedly'));
});
});
}
function sendTTSMessage(message) {
const textMessage = {
'model_id': modelId,
'transcript': message,
'voice': voice,
'output_format': {
'container': 'raw',
'encoding': 'pcm_mulaw',
'sample_rate': 8000
},
// create a new context for each message since each is a complete transcript
'context_id': crypto.randomUUID()
};
log(`Sending message to TTS WebSocket: ${message}`);
ttsWebSocket.send(JSON.stringify(textMessage));
}
function testTTSWebSocket() {
return new Promise((resolve, reject) => {
const testMessage = 'This is a test message';
let receivedAudio = false;
sendTTSMessage(testMessage);
const timeout = setTimeout(() => {
if (!receivedAudio) {
reject(new Error('Timeout: No audio received from TTS WebSocket'));
}
}, 10000); // 10 second timeout
ttsWebSocket.on('message', (audioChunk) => {
if (!receivedAudio) {
log(audioChunk);
log('Received audio chunk from TTS for test message');
receivedAudio = true;
clearTimeout(timeout);
resolve();
}
});
});
}
async function startCall(twilioWebsocketUrl) {
try {
log(`Initiating call with WebSocket URL: ${twilioWebsocketUrl}`);
const call = await client.calls.create({
twiml: ``,
to: outbound, // Replace with the phone number you want to call
from: inbound // Replace with your Twilio phone number
});
callSid = call.sid;
log(`Call initiated. SID: ${callSid}`);
} catch (error) {
log(`Error initiating call: ${error.message}`);
throw error;
}
}
async function hangupCall() {
try {
log(`Attempting to hang up call: ${callSid}`);
await client.calls(callSid).update({status: 'completed'});
log('Call hung up successfully');
} catch (error) {
log(`Error hanging up call: ${error.message}`);
}
}
function setupTwilioWebSocket() {
return new Promise((resolve, reject) => {
const server = http.createServer((req, res) => {
log(`Received HTTP request: ${req.method} ${req.url}`);
res.writeHead(200);
res.end('WebSocket server is running');
});
const wss = new WebSocket.Server({ server });
log('WebSocket server created');
wss.on('connection', (twilioWs, request) => {
log(`Twilio WebSocket connection attempt from ${request.socket.remoteAddress}`);
let streamSid = null;
twilioWs.on('message', (message) => {
try {
const msg = JSON.parse(message);
log(`Received message from Twilio: ${JSON.stringify(msg)}`);
if (msg.event === 'start') {
log('Media stream started');
streamSid = msg.start.streamSid;
log(`Stream SID: ${streamSid}`);
sendTTSMessage(partialResponse);
} else if (msg.event === 'media' && !messageComplete) {
log('Received media event');
} else if (msg.event === 'stop') {
log('Media stream stopped');
hangupCall();
}
} catch (error) {
log(`Error processing Twilio message: ${error.message}`);
}
});
twilioWs.on('close', (code, reason) => {
log(`Twilio WebSocket disconnected. Code: ${code}, Reason: ${reason}`);
});
twilioWs.on('error', (error) => {
log(`Twilio WebSocket error: ${error.message}`);
});
// Handle incoming audio chunks from TTS WebSocket
ttsWebSocket.on('message', (audioChunk) => {
log('Received audio chunk from TTS');
try {
if (streamSid) {
twilioWs.send(JSON.stringify({
event: 'media',
streamSid: streamSid,
media: {
payload: JSON.parse(audioChunk)['data']
}
}));
audioChunksReceived++;
log(`Audio chunks received: ${audioChunksReceived}`);
if (audioChunksReceived >= 50) {
messageComplete = true;
log('Message complete, preparing to hang up');
setTimeout(hangupCall, 2000);
}
} else {
log('Warning: Received audio chunk but streamSid is not set');
}
} catch (error) {
log(`Error sending audio chunk to Twilio: ${error.message}`);
}
});
log('Twilio WebSocket connected and handlers set up');
});
wss.on('error', (error) => {
log(`WebSocket server error: ${error.message}`);
});
server.listen(0, () => {
const port = server.address().port;
log(`Twilio WebSocket server is running on port ${port}`);
resolve(port);
});
server.on('error', (error) => {
log(`HTTP server error: ${error.message}`);
reject(error);
});
});
}
async function setupNgrokTunnel(port) {
try {
const listener = await ngrok.forward({
addr: port,
authtoken: config.NGROK_AUTHTOKEN,
});
const wssUrl = listener.url().replace('https://', 'wss://');
log(`ngrok tunnel established: ${wssUrl}`);
return wssUrl;
} catch (error) {
log(`Error setting up ngrok tunnel: ${error.message}`);
throw error;
}
}
async function main() {
try {
log('Starting application');
await connectToTTSWebSocket();
log('TTS WebSocket connected successfully');
await testTTSWebSocket();
log('TTS WebSocket test passed successfully');
const twilioWebsocketPort = await setupTwilioWebSocket();
log(`Twilio WebSocket server set up on port ${twilioWebsocketPort}`);
const twilioWebsocketUrl = await setupNgrokTunnel(twilioWebsocketPort);
await startCall(twilioWebsocketUrl);
} catch (error) {
log(`Error in main function: ${error.message}`);
}
}
// Run the script
main();
```
To run the application, use the following command:
```bash lines theme={null}
node app.js
```
## How It Works
1. The script establishes a connection to Cartesia's TTS WebSocket.
2. It sets up a local WebSocket server to communicate with Twilio.
3. An ngrok tunnel is created to expose the local WebSocket server to the internet.
4. A call is initiated using Twilio, connecting to the ngrok tunnel.
5. When the call connects, the script sends the predefined message to Cartesia's TTS.
6. Cartesia converts the text to speech and sends audio chunks back.
7. The script forwards these audio chunks to Twilio, which plays them on the call.
## Customization
* To change the spoken message, modify the `partialResponse` variable.
* Adjust the voice parameters in the `voice` object to change the TTS voice characteristics.
* Modify the `audioChunksReceived` threshold to control when the call should end.
## Troubleshooting
* If you encounter any issues, check the console logs for detailed error messages.
* Ensure all required environment variables are correctly set.
* If you see `invalid tunnel configuration`, make sure you're using the better supported `@ngrok/ngrok` package and not `ngrok`.
# Vision Agents by Stream
Source: https://docs.cartesia.ai/integrations/vision-agents-by-stream
[Stream](https://getstream.io/) maintains **[Vision Agents](https://visionagents.ai)**—an open-source Python framework for voice- and vision-driven agents with realtime media over **Stream**’s WebRTC edge. Cartesia is supported as the **TTS** provider; install steps, environment variables, and parameters are in Stream’s **[Cartesia integration](https://visionagents.ai/integrations/cartesia)**.
You need a **Stream** developer account for realtime transport and a **Cartesia API key** for speech.
The ["Simple Agent"](https://github.com/GetStream/Vision-Agents/tree/main/examples/01_simple_agent_example) example in GitHub and the [voice](https://visionagents.ai/introduction/voice-agents) / [video](https://visionagents.ai/introduction/video-agents) intros are good starting points.
# Demo
Try out the Simple Agent Cartesia demo.
# CLI documentation
Source: https://docs.cartesia.ai/line/cli
Create, deploy, and manage voice agents from the command line.
## Installation
By running the quick install commands, you are accepting Cartesia's [Terms of Service (TOS)](https://cartesia.ai/legal/terms.html). Please make sure to review the full TOS here before proceeding.
Install and download from our servers:
```zsh lines theme={null}
curl -fsSL https://cartesia.sh | sh
```
Update to the latest version:
```zsh lines theme={null}
cartesia update
```
## Quick Start
Authenticate with your Cartesia API key.
To make an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys) and select your organization.
```zsh lines theme={null}
cartesia auth login # paste your API key when prompted
```
Clone an example agent from the Line repository.
```zsh lines theme={null}
cartesia create my-agent
# Choose any example you like.
cd my-agent
```
Give your agent a name and link it to your organization.
```zsh lines theme={null}
cartesia init
```
Deploy your agent to make it available in the playground.
```zsh lines theme={null}
cartesia deploy
```
## Features
### Initialize a Project
Link any directory to a new or existing Cartesia agent:
```zsh lines theme={null}
cartesia init
```
Create a project from an example:
```zsh lines theme={null}
cartesia create
```
Inside a project directory, the CLI auto-detects the agent. Run `cartesia status` to see the current agent ID.
### Chat with Your Agent
Test your agent's text reasoning locally.
Terminal 1. Run your text logic fastapi server:
```zsh lines theme={null}
PORT=8000 uv run python main.py
```
Terminal 2. Run the CLI to chat with your agent:
```zsh lines theme={null}
cartesia chat 8000
```
## Commands
### Authentication
To get an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys), select your organization, and generate a new key.
```zsh lines theme={null}
cartesia auth login
```
To validate the existing API key:
```zsh lines theme={null}
cartesia auth status
```
To logout (clears cached credentials):
```zsh lines theme={null}
cartesia auth logout
```
### Voice Agents
Deploy your agent to Cartesia cloud.
```zsh lines theme={null}
cartesia deploy
```
List out all the agents in your organization:
```zsh lines theme={null}
cartesia agents ls
```
#### Managed Deployments
Versions of your agent running on Cartesia's cloud. Each deployment rebuilds the environment, instantiates your project, and runs a health check.
To see all of your deployments:
```zsh lines theme={null}
cartesia deployments ls
```
Check the status of a deployment:
```zsh lines theme={null}
cartesia status [ or ]
```
#### Self-Hosted Agent Code
While Cartesia's managed deployments are the simplest way to deploy low-latency voice agents, if you'd like to manage your own deployments of your agent code, you can pass us a URL for your agent to connect to during calls.
Connect an existing agent to your self-hosted code:
```zsh lines theme={null}
cartesia connect --agent-id --url https://my-agent.example.com
```
Or run without `--agent-id` to interactively select an existing agent or create a new one:
```zsh lines theme={null}
cartesia connect --url https://my-agent.example.com
```
Disconnect an agent from your self-hosted code:
```zsh lines theme={null}
cartesia disconnect --agent-id
```
### Environment Variables
Create, list, and remove environment variables for your agent.
Set environment variables for your agent:
```zsh lines theme={null}
cartesia env set API_KEY=FOOBAR MY_CONFIG=FOOBAZ
```
Environment variables are encrypted for storage and can only be accessed by your code.
Port environment variables from a `.env` file:
```zsh lines theme={null}
cartesia env set --from .env
```
```text .env theme={null}
API_KEY=FOOBAR
MY_CONFIG=FOOBAZ
```
Remove an environment variable:
```zsh lines theme={null}
cartesia env rm
```
### Help Menu
For more details on any command:
```zsh lines theme={null}
cartesia --help
```
# Release Notes
Source: https://docs.cartesia.ai/line/developer-tools/release-notes
Updates to the Line SDK and platform.
## March 2026
Platform-wide API, PVC, and client library updates for this month are in [Changelog 2026](/changelog/2026) (March 2026).
***
## February 4, 2026
### AgentUpdateCall Output Event
Added `AgentUpdateCall` event for dynamically updating call configuration during a conversation:
```python theme={null}
from line.events import AgentUpdateCall
# In an agent's process method:
yield AgentUpdateCall(voice_id="5ee9feff-1265-424a-9d7f-8e4d431a12c7")
yield AgentUpdateCall(pronunciation_dict_id="dict-123")
```
| Field | Description |
| ----------------------- | ------------------------------------ |
| `voice_id` | Updates the agent's voice |
| `pronunciation_dict_id` | Updates the pronunciation dictionary |
All fields are optional—only set fields are updated. See [Events](/line/sdk/events#dynamic-configuration) for details.
***
## February 1, 2026
### Line SDK v0.2 — Major Release
We're releasing **Line SDK v0.2**, a complete redesign of the voice agent framework focused on simplicity, streaming performance, and seamless LLM integration. This release introduces a new async iterable architecture that replaces the previous event bus system.
**Breaking Changes**: v0.2 is not backwards compatible with v0.1.x. See the [Migration Guide](#migration-guide-from-v0-1-x-to-v0-2) below for detailed upgrade instructions.
**What's changing?** Line SDK v0.2 makes it much simpler to build voice agents. Instead of manually wiring together multiple components (systems, bridges, nodes), you now write a single function that returns your agent. The SDK handles audio, interruptions, and conversation flow automatically.
**Why upgrade?**
* **Faster development** — Build agents in hours instead of days with less boilerplate code
* **Easier maintenance** — Fewer moving parts means fewer bugs and simpler debugging
* **Better reliability** — Built-in error handling, retries, and fallback models
* **More flexibility** — Switch between 100+ AI providers (OpenAI, Anthropic, Google, etc.) without code changes
* **Powerful tools** — Add capabilities like web search, call transfers, and multi-agent handoffs with one line of code
***
## What's New in v0.2
### Simplified Agent Architecture
The new architecture replaces the `VoiceAgentSystem`, `Bus`, `Bridge`, and `ReasoningNode` pattern with a single async iterable function:
```python theme={null}
import os
from line import CallRequest
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import AgentEnv, VoiceAgentApp
async def get_agent(env: AgentEnv, call_request: CallRequest):
return LlmAgent(
model="anthropic/claude-haiku-4-5-20251001",
api_key=os.getenv("ANTHROPIC_API_KEY"),
tools=[end_call],
config=LlmConfig(
system_prompt="You are a helpful assistant.",
introduction="Hello! How can I help you today?",
),
)
app = VoiceAgentApp(get_agent=get_agent)
```
**Benefits:**
* Less boilerplate code
* No manual event routing or bridge configuration
* Automatic conversation history management
* Built-in interruption handling
* Quick, and easy tool definition
### Built-in LLM Support via LiteLLM
`LlmAgent` provides unified access to 100+ LLM providers through [LiteLLM](https://github.com/BerriAI/litellm):
```python theme={null}
# OpenAI
LlmAgent(model="gpt-5-nano", api_key=os.getenv("OPENAI_API_KEY"), ...)
# Anthropic
LlmAgent(model="anthropic/claude-haiku-4-5-20251001", api_key=os.getenv("ANTHROPIC_API_KEY"), ...)
# Google Gemini
LlmAgent(model="gemini/gemini-2.5-flash-preview-09-2025", api_key=os.getenv("GEMINI_API_KEY"), ...)
# With fallbacks
LlmAgent(
model="gpt-5-nano",
config=LlmConfig(fallbacks=["anthropic/claude-haiku-4-5-20251001", "gemini/gemini-2.5-flash-preview-09-2025"]),
...
)
```
### Declarative Tool System
Define agent capabilities using simple decorators. Three tool types cover all common scenarios:
| Tool Type | Decorator | What It Does | Example Use Case |
| --------------- | ------------------- | --------------------------------------------------------------- | ------------------------------------------------- |
| **Loopback** | `@loopback_tool` | Fetches information, then the agent speaks the answer naturally | Looking up order status, checking account balance |
| **Passthrough** | `@passthrough_tool` | Takes an immediate action without additional AI processing | Ending a call, transferring to a phone number |
| **Handoff** | `@handoff_tool` | Transfers the conversation to a different specialized agent | Routing to Spanish support, escalating to billing |
```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool, passthrough_tool, handoff_tool
from line.events import AgentEndCall
@loopback_tool
async def get_weather(ctx, city: Annotated[str, "City name"]) -> str:
"""Get current weather for a city."""
return f"72°F and sunny in {city}"
@passthrough_tool
async def end_call(ctx):
"""End the call."""
yield AgentEndCall()
@handoff_tool
async def transfer_to_support(ctx, event):
"""Transfer to support agent."""
async for output in support_agent.process(ctx.turn_env, event):
yield output
```
### Background Tool Execution
Long-running tools can execute in the background without blocking the LLM:
```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool
@loopback_tool(is_background=True)
async def check_bank_balance(ctx, account_id: Annotated[str, "Account ID"]):
"""Check account balance (may take a few seconds)."""
yield "Checking your balance..." # Immediate acknowledgment
balance = await api.get_balance(account_id) # Long operation
yield f"Your balance is ${balance:.2f}" # Triggers new LLM completion
```
### Built-in Tools
Common operations available out of the box:
```python theme={null}
from line.llm_agent import end_call, send_dtmf, transfer_call, web_search, agent_as_handoff
agent = LlmAgent(
tools=[
end_call, # End the call
send_dtmf, # Send DTMF tones
transfer_call, # Transfer to phone number
web_search, # Real-time web search
agent_as_handoff(other_agent, name="transfer_to_billing"),
],
...
)
```
### Multi-Agent Workflows
Create sophisticated agent routing with `agent_as_handoff`:
```python theme={null}
spanish_agent = LlmAgent(
model="gpt-5-nano",
config=LlmConfig(system_prompt="Speak only in Spanish.", ...),
...
)
main_agent = LlmAgent(
tools=[
agent_as_handoff(
spanish_agent,
handoff_message="Transferring to Spanish support...",
name="transfer_to_spanish",
description="Transfer when user requests Spanish.",
),
],
...
)
```
### Structured Event System
Events are how your agent communicates with the outside world. **Output events** are actions your agent takes (speaking, ending calls). **Input events** are things that happen during a call (user speaks, call starts).
**Output Events** (agent → harness):
* `AgentSendText` — Send text to be spoken
* `AgentEndCall` — End the call
* `AgentTransferCall` — Transfer to another number
* `AgentSendDtmf` — Send DTMF tone
* `AgentToolCalled` / `AgentToolReturned` — Tool execution tracking
* `LogMetric` / `LogMessage` — Observability
**Input Events** (harness → agent):
* `CallStarted` / `CallEnded` — Call lifecycle
* `UserTurnStarted` / `UserTurnEnded` — User speaking
* `UserTextSent` / `UserDtmfSent` — User content
* `AgentHandedOff` — Handoff notification
All input events include a `history` field with the complete conversation context.
### Enhanced Configuration
Fine-tune how your agent thinks and responds. `LlmConfig` lets you control the AI's personality, response length, creativity, and reliability:
```python theme={null}
LlmConfig(
system_prompt="You are a helpful assistant.",
introduction="Hello! How can I help?",
# Sampling parameters
temperature=0.7,
max_tokens=1024,
top_p=0.95,
# Resilience
num_retries=2,
fallbacks=["gpt-5-nano"],
timeout=30.0,
# Provider-specific options
extra={"reasoning_effort": "high"},
)
```
***
## Migration Guide from v0.1.x to v0.2
This guide walks you through upgrading your existing v0.1.x agents to v0.2. The migration involves updating imports, simplifying your agent setup, and adopting the new tool system. Most agents can be migrated in under an hour.
### Overview of Changes
| v0.1.x | v0.2 |
| ------------------------------------- | ----------------------------------------- |
| `VoiceAgentSystem` + `Bus` + `Bridge` | `VoiceAgentApp` with `get_agent` callback |
| `ReasoningNode` subclasses | `LlmAgent` or custom `Agent` protocol |
| `call_handler(system, request)` | `get_agent(env, request) -> Agent` |
| Manual event routing | Automatic event dispatch with filters |
| `process_context()` method | `process(env, event)` async iterable |
### Step 1: Update Imports
```python theme={null}
# v0.1.x
from line.voice_agent_app import VoiceAgentApp
from line.voice_agent_system import VoiceAgentSystem
from line.bridge import Bridge
from line.nodes import ReasoningNode
from line.events import (
AgentSpeechSent,
UserTranscriptionReceived,
EndCall,
TransferCall,
)
# v0.2
from line.voice_agent_app import VoiceAgentApp, AgentEnv
from line.llm_agent import LlmAgent, LlmConfig
from line.llm_agent import end_call, transfer_call, loopback_tool, passthrough_tool
from line.events import (
AgentSendText,
AgentEndCall,
AgentTransferCall,
UserTurnEnded,
CallStarted,
)
```
### Step 2: Replace VoiceAgentSystem with get\_agent
In v0.1.x, event routing was configured manually via `bridge.on()`. In v0.2, event dispatch is automatic with customizable **run** and **cancel filters**.
```python v0.1.x theme={null}
from line.voice_agent_app import VoiceAgentApp
from line.voice_agent_system import VoiceAgentSystem
from line.bridge import Bridge
from line.nodes import ReasoningNode
from line.events import (
UserTranscriptionReceived,
UserStoppedSpeaking,
DTMFInputEvent,
)
class MyReasoningNode(ReasoningNode):
async def process_context(self, context):
# Your LLM logic here
response = await call_llm(context.messages)
yield AgentResponse(content=response)
async def call_handler(system: VoiceAgentSystem, call_request):
node = MyReasoningNode(system_prompt="You are helpful.")
bridge = Bridge(node)
system.with_speaking_node(node, bridge)
# Manual event routing with bridge.on()
bridge.on(UserTranscriptionReceived).map(node.add_event)
bridge.on(UserStoppedSpeaking).stream(node.generate).broadcast()
# DTMF events required explicit routing
bridge.on(DTMFInputEvent).map(node.handle_dtmf)
await system.start()
await system.send_initial_message("Hello!")
await system.wait_for_shutdown()
app = VoiceAgentApp(call_handler=call_handler)
```
```python v0.2 theme={null}
import os
from line import CallRequest
from line.voice_agent_app import VoiceAgentApp, AgentEnv
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.events import (
CallStarted,
UserTurnEnded,
UserDtmfSent,
UserTurnStarted,
CallEnded,
)
async def get_agent(env: AgentEnv, call_request: CallRequest):
agent = LlmAgent(
model="gpt-5-nano",
api_key=os.getenv("OPENAI_API_KEY"),
tools=[end_call],
config=LlmConfig(
system_prompt="You are helpful.",
introduction="Hello!",
),
)
# Default: just return the agent (uses default filters)
return agent
async def get_agent_with_dtmf(env: AgentEnv, call_request: CallRequest):
"""Alternative: include DTMF events in processing."""
agent = LlmAgent(...)
# Return an AgentSpec tuple to customize filters
run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded]
cancel_filter = [UserTurnStarted]
return (agent, run_filter, cancel_filter)
app = VoiceAgentApp(get_agent=get_agent)
```
#### Run and Cancel Filters
Filters control your agent's behavior during a call:
* **Run filters** determine what triggers your agent to respond (e.g., when the user finishes speaking)
* **Cancel filters** determine what interrupts your agent (e.g., when the user starts talking over the agent)
You can customize these by returning a tuple instead of just the agent:
```python theme={null}
from typing import Union, Tuple
AgentSpec = Union[Agent, Tuple[Agent, run_filter, cancel_filter]]
```
| Filter | Purpose | Default |
| ------------------ | ------------------------------------------ | ----------------------------------------- |
| **run\_filter** | Events that trigger agent processing | `[CallStarted, UserTurnEnded, CallEnded]` |
| **cancel\_filter** | Events that cancel in-progress agent tasks | `[UserTurnStarted]` |
**Example: Agent that responds to DTMF input**
```python theme={null}
from line.events import (
CallStarted, CallEnded, UserTurnEnded, UserTurnStarted, UserDtmfSent
)
async def get_agent(env: AgentEnv, call_request: CallRequest):
agent = LlmAgent(...)
# Include UserDtmfSent in run_filter to process DTMF
run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded]
cancel_filter = [UserTurnStarted]
return (agent, run_filter, cancel_filter)
```
**Example: Agent that doesn't get interrupted**
```python theme={null}
async def get_agent(env: AgentEnv, call_request: CallRequest):
agent = LlmAgent(...)
# Empty cancel_filter = agent won't be interrupted
run_filter = [CallStarted, UserTurnEnded, CallEnded]
cancel_filter = []
return (agent, run_filter, cancel_filter)
```
**Example: Custom filter function**
```python theme={null}
def my_run_filter(event: InputEvent) -> bool:
"""Only process events during business hours."""
if isinstance(event, CallStarted):
return is_business_hours()
return isinstance(event, (UserTurnEnded, CallEnded))
async def get_agent(env: AgentEnv, call_request: CallRequest):
agent = LlmAgent(...)
return (agent, my_run_filter, [UserTurnStarted])
```
### Step 3: Migrate Event Handling
```python v0.1.x theme={null}
# Event names
AgentSpeechSent # Agent spoke
UserTranscriptionReceived # User spoke
EndCall # End call
TransferCall # Transfer call
# Manual event handling in ReasoningNode
class MyNode(ReasoningNode):
async def process_context(self, context):
for event in context.events:
if isinstance(event, UserTranscriptionReceived):
user_message = event.transcription
```
```python v0.2 theme={null}
# Event names
AgentSendText # Output: send text to speak
AgentTextSent # Input: confirmation text was spoken
UserTurnEnded # Input: user finished speaking
AgentEndCall # Output: end call
AgentTransferCall # Output: transfer call
# Events include history automatically
async def process(self, env, event):
if isinstance(event, UserTurnEnded):
# Access user's message
user_message = event.content[0].content
# Access full conversation history
for past_event in event.history:
if isinstance(past_event, UserTextSent):
print(f"User previously said: {past_event.content}")
```
### Step 4: Migrate Custom Tools
```python v0.1.x theme={null}
# Manual tool handling in ReasoningNode
class MyNode(ReasoningNode):
async def process_context(self, context):
# Parse tool calls from LLM response
if tool_call := extract_tool_call(response):
result = await self.execute_tool(tool_call)
# Manually add to context and call LLM again
context.add_tool_result(result)
response = await call_llm(context)
```
```python v0.2 theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool, passthrough_tool
from line.events import AgentSendText, AgentEndCall
# Declarative tool definitions
@loopback_tool
async def get_account_balance(ctx, account_id: Annotated[str, "Account ID"]):
"""Look up account balance."""
balance = await api.get_balance(account_id)
return f"${balance:.2f}"
@passthrough_tool
async def end_call_with_message(ctx, message: Annotated[str, "Goodbye message"]):
"""End call with a custom message."""
yield AgentSendText(text=message)
yield AgentEndCall()
# Tools are passed to LlmAgent
agent = LlmAgent(
tools=[get_account_balance, end_call_with_message],
...
)
```
### Step 5: Migrate Multi-Agent Patterns
```python v0.1.x theme={null}
# Manual agent switching
class MainNode(ReasoningNode):
def __init__(self, spanish_node):
self.spanish_node = spanish_node
self.use_spanish = False
async def process_context(self, context):
if self.should_switch_to_spanish(context):
self.use_spanish = True
# Complex manual state management
```
```python v0.2 theme={null}
from line.llm_agent import agent_as_handoff
spanish_agent = LlmAgent(
model="gpt-5-nano",
config=LlmConfig(system_prompt="Speak only in Spanish."),
...
)
main_agent = LlmAgent(
tools=[
agent_as_handoff(
spanish_agent,
handoff_message="Transferring...",
name="transfer_to_spanish",
description="Use when user requests Spanish.",
),
],
...
)
```
### Removed APIs
The following APIs from v0.1.x have been removed with no direct replacement:
| Removed | Alternative |
| --------------------- | -------------------------------------------- |
| `VoiceAgentSystem` | Use `VoiceAgentApp` with `get_agent` |
| `Bus` | Events are dispatched automatically |
| `Bridge` | Use run/cancel filters on `AgentSpec` |
| `ReasoningNode` | Use `LlmAgent` or implement `Agent` protocol |
| `ConversationHarness` | Handled internally by `ConversationRunner` |
| `EventsRegistry` | Use typed event classes directly |
### Custom Agent Protocol
If you need custom logic beyond `LlmAgent`, implement the `Agent` protocol:
```python theme={null}
from typing import AsyncIterable
from line.events import (
InputEvent,
OutputEvent,
AgentSendText,
CallStarted,
UserTurnEnded,
)
class CustomAgent:
"""Custom agent implementing the Agent protocol."""
async def process(self, env, event: InputEvent) -> AsyncIterable[OutputEvent]:
if isinstance(event, CallStarted):
yield AgentSendText(text="Hello from custom agent!")
elif isinstance(event, UserTurnEnded):
# Your custom logic here
user_message = event.content[0].content
response = await your_custom_logic(user_message, event.history)
yield AgentSendText(text=response)
```
***
## Breaking Changes Summary
This section provides a quick reference for all breaking changes. Use this as a checklist when migrating your code.
### Event Renames
| v0.1.x | v0.2 |
| --------------------------- | -------------------------------------------------- |
| `AgentSpeechSent` | `AgentSendText` (output) / `AgentTextSent` (input) |
| `UserTranscriptionReceived` | `UserTextSent` / `UserTurnEnded` |
| `UserStartedSpeaking` | `UserTurnStarted` |
| `UserStoppedSpeaking` | `UserTurnEnded` |
| `AgentStartedSpeaking` | `AgentTurnStarted` |
| `AgentStoppedSpeaking` | `AgentTurnEnded` |
| `EndCall` | `AgentEndCall` |
| `TransferCall` | `AgentTransferCall` |
| `DTMFInputEvent` | `UserDtmfSent` |
| `DTMFOutputEvent` | `AgentSendDtmf` |
**Output vs. Input events**: `AgentSendText` is an output event you **yield** to make the agent speak. `AgentTextSent` is an input event you **receive** confirming what was spoken (appears in history).
### Structural Changes
* **History in events**: All input events now include an optional `history` field with complete conversation context. When `history` is `None`, the event is inside a history list; when it contains a list, the event has full context attached.
* **Tool events**: `ToolCall`/`ToolResult` replaced with structured `AgentToolCalled`/`AgentToolReturned`
* **Event IDs**: All events now have stable `event_id` fields for tracking
### Configuration Changes
| v0.1.x | v0.2 |
| --------------------------------- | ------------------------------------- |
| `CallRequest.agent.system_prompt` | `LlmConfig.system_prompt` |
| `CallRequest.agent.introduction` | `LlmConfig.introduction` |
| Manual LLM parameters | `LlmConfig` with full LiteLLM support |
Use `LlmConfig.from_call_request(call_request, fallback_system_prompt="...", fallback_introduction="...")` to automatically inherit configuration from the Cartesia Playground while providing sensible defaults. See [Agents documentation](/line/sdk/agents#accessing-call-metadata-in-your-agent-logic) for details.
***
## New Dependencies
v0.2 introduces the following dependencies:
```
litellm # Multi-provider LLM support
pydantic # Type validation for events
phonenumbers >= 9.0 # Phone number validation for transfer_call
```
Optional dependencies for examples:
```
exa-py # Exa web search integration
duckduckgo-search # Fallback web search
```
***
## Getting Help
* **Documentation**: [Line SDK Overview](/line/sdk/overview)
* **Examples**: [github.com/cartesia-ai/line/examples](https://github.com/cartesia-ai/line/tree/main/examples)
* **Support**: [support@cartesia.ai](mailto:support@cartesia.ai)
# Metrics
Source: https://docs.cartesia.ai/line/evaluations/metrics
The Line platform includes a suite of tools for evaluating how your Agent is performing, both during development phase and in production.
You have full control over how metrics for evaluating your agent are defined.
## System Metrics
By default, all calls made by a Line Agent have a set of system metrics automatically calculated to help evaluate performance.
| System Metric | Description |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------ |
| system\_call\_success | A boolean status indicating if the call disconnects unexpectedly, for example due to reasoning code crashing |
| system\_text\_to\_speech\_ttfb | The time to first byte of audio generated by the TTS model on the first turn of the conversation |
### LLM as a Judge
An LLM-as-a-Judge metric is created in the playground by setting a name and specifying a prompt. You can try out different prompts in
the playground against existing call transcripts by copying a call id into the metric creation field and clicking evaluate
to generate a sample output.
Write your LLM as a Judge metrics to return a single value and description
field.
A metric name can only include lower case letters, digits, and ‘-’, ‘\_’, or ‘.’ characters so that you can manage it
from a cli. Metric names must also be unique within your organization.
## Assigning Metrics
Once a metric is created, it can be assigned to an Agent via the playground from the Agent page. All subsequent calls made
to or from that Agent will have metric results calculated and available to view in the console and API. Note
that when you assign a metric to an existing Agent, it won’t be automatically run on previous calls.
# Metrics Results
Source: https://docs.cartesia.ai/line/evaluations/results
View the results from metrics run against all calls handled by your agent.
Metrics results are accessible via both API and the playground.
Each metric result contains relevant information to help you analyze your calls. Some fields include:
```
- metric_id
- metric_name
- agent_id
- call_id
- summary
- transcript
- deployment_id
- value
- status
```
To view the full schema, visit the API [List Metric Results](/api-reference/agents/metrics/list-metric-results).
## API
To get metrics via the API, you can specify a few filter parameters including `call_id`, `agent_id` and more. You can retrieve these metric results or export them into a CSV. [List Metric Results](/api-reference/agents/metrics/list-metric-results) and [Export Metric Results](/api-reference/agents/metrics/export-metric-results) have the same query parameters available and differ only in the response format.
#### Example Request for CSV Results
```zsh cURL lines theme={null}
curl --location 'https://api.cartesia.ai/agents/metrics/export?metric_id={metric_id}&limit=100&starting_after={previous_next_page_metric_id}' \
--header 'Cartesia-Version: 2025-04-16' \
--header 'Authorization: Bearer {YOUR_API_KEY}'
```
```python Python lines theme={null}
import requests
url = "https://api.cartesia.ai/agents/metrics/export"
params = {
"metric_id": "{metric_id}",
"limit": 100,
"starting_after": "{previous_next_page_metric_id}"
}
headers = {
"Content-Type": "application/json",
"Cartesia-Version": "2025-04-16",
"Authorization": "Bearer "
}
response = requests.get(url, headers=headers, params=params)
if response.status_code == 200:
# Save CSV to file
with open("metrics.csv", "w", encoding="utf-8") as f:
f.write(response.text)
print("CSV file saved as metrics.csv")
else:
print(f"Error {response.status_code}: {response.text}")
```
```typescript Javascript lines theme={null}
const response = await fetch(
"https://api.cartesia.ai/agents/metrics/export?metric_id={metric_id}&limit=100&starting_after={previous_next_page_metric_id}",
{
method: "GET",
headers: {
"Content-Type": "application/json",
"Cartesia-Version": "2025-04-16",
Authorization: "Bearer ",
},
}
);
```
## Console
Metrics are visible in the playground for a specific call record.
# Deployments
Source: https://docs.cartesia.ai/line/infrastructure/deployments
Deployments are instances of your agent running on Cartesia's servers.
# State
Only deployments in the `ready` state can handle inbound or outbound calls. At any time, only one deployment is active.
Deployments that fail health checks will not receive traffic.
# Creating a deployment
Use `cartesia deploy` or push to a linked GitHub repository to create a deployment.
Cartesia servers:
1. Build the virtual environment
2. Load `main.py` and instantiate a FastAPI app
3. Run a health check
4. Set the deployment to `ready` and start receiving traffic
Line supports Python 3.9–3.13 (specify in `pyproject.toml`). FastAPI servers only; more frameworks coming soon.
**Pre-Call Initialization**
Inbound calls will ring for five seconds to allow your application logic to warm up any required resources and establish
connections.
# Observability
Source: https://docs.cartesia.ai/line/infrastructure/observability
Get full visibility into how your Agent is performing.
Monitor every deployment and call.
## Deployment
Each deployment generates a unique ID. View logs in the console.
## Call Logs
You can click into a call and view any logging statements generated by your reasoning code.
## Transcripts
Each call has a transcript with independently separated transcribed audio and text to be generated. When you export these
transcripts with the API or CLI, these include more granular turn level timestamps.
## Loggable Events
Record events without tying them to tool calls.
### SDK
In the SDK, yield `LogMessage` events from your agent or tools to record custom events:
```python theme={null}
from line.events import LogMessage
@loopback_tool
async def process_order(ctx, order_id: Annotated[str, "Order ID"]):
"""Process a customer order."""
result = await api.process_order(order_id)
# Log a custom event
yield LogMessage(
name="order_processed",
level="info",
message=f"Processed order {order_id}",
metadata={"status": result.status, "order_id": order_id}
)
return f"Order {order_id} processed: {result.status}"
```
Events are automatically sent to the platform when yielded.
### Websocket
If you're not using the SDK and instead just relying on the bare websocket, logging events will look like this:
```json theme={null}
{
"type": "log_event",
"event": "event_name",
"metadata": {
"key": "value"
}
}
```
### Playground
You can view these events in the Playground under the `Transcript` tab of the call.
## Loggable Metrics
Record metrics at any point in your workflow.
### SDK
In the context of the SDK, we can log a metric by broadcasting the `LogMetric` event.
Here's a snippet from the form filling template that exhibits this:
```python theme={null}
# Record the answer in form manager
success = self.form_manager.record_answer(answer)
if success:
# Log metric for the answered question
if current_question:
metric_name = current_question["id"]
yield LogMetric(name=metric_name, value=answer)
logger.info(f"📊 Logged metric: {metric_name}={answer}")
```
The user bridge is subscribed to the `LogMetric` event by default, and it will
log it over the websocket by default when it sees that `LogMetric` has been broadcast.
### Websocket
If you're not using the SDK and instead just relying on the bare websocket, logging metrics will look like this:
```json theme={null}
{
"type": "log_metric",
"name": "metric_name",
"value": "metric_value"
}
```
### Playground
You can view these events in the Playground under the `Transcript` tab of the call.
## Call Recordings
Call recordings can be downloaded from the playground.
## Webhooks
Cartesia sends webhook events to your **HTTPS** endpoint throughout the call lifecycle. Expose **`POST`** + **`application/json`** and verify the **`x-webhook-secret`** header matches your stored secret.
### Verify the webhook secret
```python theme={null}
if request.headers.get("x-webhook-secret") != os.environ["LINE_WEBHOOK_SECRET"]:
return jsonify({"error": "unauthorized"}), 401
```
```typescript theme={null}
if (req.headers["x-webhook-secret"] !== process.env.LINE_WEBHOOK_SECRET)
return res.status(401).json({ error: "unauthorized" });
```
### Event types
| Event | When | Typed field |
| -------------------- | ------------------------------ | ----------- |
| `call_started` | Call session begins | `call` |
| `call_completed` | Call ends normally | `call` |
| `call_failed` | Call ends with error | `call` |
| `call_turn` | Each conversational turn | `turn` |
| `post_call_analysis` | After async analysis completes | `analysis` |
### Envelope fields
Every webhook event includes these top-level fields:
| Field | Description |
| ------------ | ----------------------------- |
| `type` | Event type (see table above). |
| `call_id` | Call identifier. |
| `agent_id` | Agent that handled the call. |
| `webhook_id` | Webhook config id. |
| `timestamp` | RFC 3339 event time. |
### `call`
Present on `call_started`, `call_completed`, and `call_failed` events. Matches the [GET /agents/calls/\{call\_id}](/api-reference/agents/calls/get-call) response. Some events (e.g. `call_started`) may omit fields like `end_time` that do not yet have a valid value.
| Field | Description |
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `id` | Call identifier. |
| `agent_id` / `agent_name` | Agent details. |
| `status` | `started`, `completed`, or `failed`. |
| `start_time` / `end_time` | RFC 3339 timestamps. |
| `end_reason` | Why the call ended (e.g. `client_hangup`, `agent_hangup`, `inactivity`). See [EndReason](/api-reference/agents/calls/get-call) for all values. |
| `transcript` | Array of turns (see `turn` below). |
| `telephony_params` | `from`, `to`, `direction`, `call_sid`, `connection_type`. |
| `error_message` | Error detail (failed calls only). |
| `metadata` | User-supplied metadata passed at call start. |
| `summary` | Call summary (if available at event time). |
### `turn`
Present on `call_turn` events. One turn per agent or user utterance.
| Field | Description |
| ----------------------------------- | ----------------------------------------------------- |
| `role` | `assistant` or `user`. |
| `text` | Turn text. |
| `start_timestamp` / `end_timestamp` | Seconds from call start. |
| `tts_ttfb` | Agent TTS time-to-first-byte (seconds), when present. |
| `tool_calls` | Tool calls made during this turn, when present. |
### `analysis`
Present on `post_call_analysis` events. Sent after async analysis completes (currently summary generation; evaluations and metrics will be added here in the future).
| Field | Description |
| --------- | -------------------------- |
| `summary` | 1-2 sentence call summary. |
### Example: `call_completed`
```json theme={null}
{
"type": "call_completed",
"call_id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
"agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
"webhook_id": "agent_webhook_P3MgdLf1cpaucZJ7xWehCC",
"end_reason": "client_hangup",
"timestamp": "2026-04-16T01:08:50.061907836Z",
"call": {
"id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
"agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
"agent_name": "My Agent",
"status": "completed",
"start_time": "2026-04-16T01:08:37.413659Z",
"end_time": "2026-04-16T01:08:50.036327Z",
"end_reason": "client_hangup",
"telephony_params": {
"from": "websocket",
"to": "agent_rwh4HGMgyhK7rM5ucVqbiC",
"connection_type": "websocket"
},
"transcript": [
{
"role": "assistant",
"text": "Hi there! How can I help you today?",
"start_timestamp": 0.41,
"end_timestamp": 3.2,
"tts_ttfb": 0.065
},
{
"role": "user",
"text": "I want to schedule an appointment.",
"start_timestamp": 3.5,
"end_timestamp": 5.8
}
]
}
}
```
### Example: `post_call_analysis`
```json theme={null}
{
"type": "post_call_analysis",
"call_id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
"agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
"webhook_id": "agent_webhook_P3MgdLf1cpaucZJ7xWehCC",
"timestamp": "2026-04-16T01:08:50.955058787Z",
"analysis": {
"summary": "The caller requested to schedule an appointment. The agent confirmed availability and booked a slot."
}
}
```
### Test your endpoint
```bash theme={null}
curl -sS -X POST "https://your-server.example/webhooks/cartesia" \
-H "Content-Type: application/json" \
-H "x-webhook-secret: YOUR_WEBHOOK_SECRET" \
-d '{
"type": "call_completed",
"call_id": "ac_test_123",
"agent_id": "agent_demo",
"webhook_id": "agent_webhook_test",
"timestamp": "2026-01-01T00:00:00.000000000Z",
"call": {
"id": "ac_test_123",
"agent_id": "agent_demo",
"agent_name": "Test Agent",
"status": "completed",
"end_reason": "client_hangup",
"transcript": []
}
}'
```
For backwards compatibility, `call_completed` and `call_failed` events also include `body` (transcript array) and a top-level `end_reason`. These are deprecated — use `call.transcript` and `call.end_reason` instead.
# Scaling
Source: https://docs.cartesia.ai/line/infrastructure/scaling
## Compute Resources
Each call has access to 1GB memory and 0.5 vCPU. Contact support to increase limits.
## Concurrency
Concurrent call limits by subscription tier:
| Subscription Tier | Concurrency Limit |
| ----------------- | ----------------- |
| Free | 8 |
| Pro | 12 |
| Startup | 20 |
| Scale | 60 |
**Outbound Concurrency**
When triggering outbound calls, you are limited to triggering one call per second while the overall concurrency limits still apply.
# Calls API
Source: https://docs.cartesia.ai/line/integrations/calls-api
Stream audio between your application and your voice agent via WebSocket. Use this for web apps, mobile apps, or to bridge your own telephony provider.
## Quick start
```javascript theme={null}
const ws = new WebSocket(
`wss://api.cartesia.ai/agents/stream/${agentId}`,
{
headers: {
Authorization: `Bearer ${accessToken}`,
"Cartesia-Version": "2025-04-16",
},
}
);
// Initialize the stream
ws.onopen = () => {
ws.send(JSON.stringify({
event: "start",
config: { input_format: "pcm_44100" },
}));
};
// Handle agent audio
ws.onmessage = (msg) => {
const data = JSON.parse(msg.data);
if (data.event === "media_output") {
playAudio(atob(data.media.payload));
}
};
// Send user audio
function sendAudio(audioData) {
ws.send(JSON.stringify({
event: "media_input",
stream_id: streamId,
media: { payload: btoa(audioData) },
}));
}
```
Get an access token from the `/access-token` [endpoint](/api-reference/auth/access-token#body-grants-agent). See [Authenticating Client Apps](/get-started/authenticate-your-client-applications) for details.
***
## Connection
Connect to the WebSocket endpoint:
```
wss://api.cartesia.ai/agents/stream/{agent_id}
```
**Headers:**
| Header | Value |
| ------------------ | ---------------- |
| `Authorization` | `Bearer {token}` |
| `Cartesia-Version` | `2025-04-16` |
## Protocol Overview
The WebSocket connection uses JSON messages for control events and base64-encoded audio for media.
The client sends a `start` event, the server responds with `ack`, then both sides exchange audio and control events until the connection closes.
## Client events
### Start Event
Initializes the audio stream configuration.
* `config` overrides your agent's default input audio settings
* `stream_id` is optional. If not provided, the server generates one and returns it in the `ack` event
**This must be the first message sent.**
```json theme={null}
{
"event": "start",
"stream_id": "unique_id",
"config": {
"input_format": "pcm_44100",
"voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
},
"agent": {
"introduction": "Hello, I'm an AI assistant",
"system_prompt": "### Your Role \n You are a helpful assistant"
},
"metadata": {
"to": "user@example.com",
"from": "+1234567890"
}
}
```
**Fields:**
* `stream_id` (optional): Stream identifier. If not provided, server generates one
* `config.input_format`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`)
* `config.voice_id` (optional): Override the agent's default TTS voice
* `agent` (optional): Allows configuring individual agent calls via API and previewing changes in introduction or prompt without publishing to production
* `metadata` (optional): Custom metadata object. These will be passed through to the agent code, but there are some special fields you can use as well:
* `to` (optional): Destination identifier for call routing (defaults to agent ID)
* `from` (optional): Source identifier for the call (defaults to "websocket")
### Media Input Event
Audio data sent from the client to the server. `payload` audio data should be base64 encoded.
```json theme={null}
{
"event": "media_input",
"stream_id": "unique_id",
"media": {
"payload": "base64_encoded_audio_data"
}
}
```
**Fields:**
* `stream_id`: Unique identifier for the Stream from the ack response
* `media.payload`: Base64-encoded audio data in the format specified in the start event
### DTMF Event
Sends DTMF (dual-tone multi-frequency) tones.
```json theme={null}
{
"event": "dtmf",
"stream_id": "example_id",
"dtmf": "1"
}
```
**Fields:**
* `stream_id`: Stream identifier
* `dtmf`: DTMF digit (0-9, \*, #)
### Custom Event
Sends custom metadata to the agent.
```json theme={null}
{
"event": "custom",
"stream_id": "example_id",
"metadata": {
"user_id": "user123",
"session_info": "custom_data"
}
}
```
**Fields:**
* `stream_id`: Stream identifier
* `metadata`: Object containing key-value pairs of custom data
## Server events
### Ack Event
Confirms stream configuration. Returns the server-generated `stream_id` if one wasn't provided in the `start` event.
```json theme={null}
{
"event": "ack",
"stream_id": "example_id",
"config": {
"input_format": "pcm_44100",
"voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
},
"agent": {
"system_prompt": "### Your Role \n You are a helpful assistant",
"introduction": "Hello, I'm an AI assistant"
}
}
```
### Media Output Event
Server sends agent audio response. `payload` is base 64 encoded audio data.
```json theme={null}
{
"event": "media_output",
"stream_id": "example_id",
"media": {
"payload": "base64_encoded_audio_data"
}
}
```
### Clear Event
Indicates the agent wants to clear/interrupt the current audio stream.
```json theme={null}
{
"event": "clear",
"stream_id": "example_id"
}
```
### Transfer Call Event
Indicates the agent wants to transfer the call to a phone number. The client is responsible for initiating the transfer on its telephony side.
```json theme={null}
{
"event": "transfer_call",
"stream_id": "example_id",
"transfer": {
"target_phone_number": "+1234567890"
}
}
```
**Fields:**
* `stream_id`: Stream identifier
* `transfer.target_phone_number`: E.164 phone number to transfer the call to
## Connection Management
### Inactivity Timeout
The server closes idle connections after **180 seconds**. Any client message resets the timer:
* Application messages (media\_input, dtmf, custom events)
* Standard WebSocket ping frames
* Any other valid WebSocket message
When the timeout occurs, the connection is closed with:
* **Code:** 1000 (Normal Closure)
* **Reason:** `"connection idle timeout"`
### Ping/Pong Keepalive
To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive:
```python theme={null}
# Client sends ping to reset inactivity timer
pong_waiter = await websocket.ping()
latency = await pong_waiter
```
```javascript theme={null}
// Requires the Node.js `ws` library — the browser WebSocket API does not expose ping()
setInterval(() => {
if (websocket.readyState === WebSocket.OPEN) {
websocket.ping();
}
}, 60000); // Send ping every 60 seconds
```
The server automatically responds to ping frames with pong frames and resets the inactivity timer upon receiving any message.
### Connection Close
The connection can be closed by either the client or server using WebSocket close frames.
**Client-initiated close:**
```python theme={null}
await websocket.close(code=1000, reason="session completed")
```
**Server-initiated close:**
When the agent ends the call, the server closes the connection with:
* **Code:** 1000 (Normal Closure)
* **Reason:** `"call ended by agent"` or `"call ended by agent, reason: {specific_reason}"` if additional context is available
## Best Practices
1. **Send `start` first** — The connection closes if any other event is sent before `start`.
2. **Choose the right audio format** — Match the format to your source: `mulaw_8000` for telephony, `pcm_44100` for web clients.
3. **Handle closes cleanly** — Always capture close codes and reasons for debugging and recovery.
4. **Keep the connection alive** — Send WebSocket ping frames every 60–90 seconds to avoid the 180-second inactivity timeout.
5. **Manage stream IDs** — Provide your own `stream_id` values to improve observability across systems.
6. **Recover from idle timeouts** — On `1000 / connection idle timeout`, reconnect and resend a `start` event.
# Overview
Source: https://docs.cartesia.ai/line/integrations/overview
Your Line agent needs audio input to work. Choose based on your use case.
## Telephony
Use [Cartesia Telephony](/line/integrations/telephony/phone-numbers) for managed phone numbers. Cartesia provisions numbers and handles the telephony infrastructure for inbound and outbound use cases.
You can also use your own telephony stack by connecting to the [Calls API](/line/integrations/calls-api).
Bringing your own phone numbers or CCaaS provider is on the roadmap.
## Web and Mobile Apps
Use the [Calls API](/line/integrations/calls-api) to stream audio between your application and the agent via WebSocket.
```javascript theme={null}
const ws = new WebSocket(`wss://api.cartesia.ai/agents/stream/${agentId}`);
```
This option works great for:
* Web applications with browser microphone access
* Mobile apps with native audio capture
## Pricing
| Feature | Price per Minute | Notes |
| ------------------------ | ---------------- | ------------------------------------- |
| Agent Calling | \$0.06 | Base rate for all voice agent calls |
| Telephony (add-on) | +\$0.014 | Additional when using managed numbers |
| **Total with Telephony** | **\$0.074** | Combined cost for phone-based calls |
View your usage and remaining Voice Agent credits on the [Subscription](https://play.cartesia.ai/subscription) page.
# Outbound
Source: https://docs.cartesia.ai/line/integrations/telephony/outbound-dialing
Agents can make outbound dials with an API request. Simply specify a set of target phone numbers and your agent ID
to place your dial.
**Compliance**
You are solely responsible for remaining complaint with relevant local regulations for dialing including the Telephone
Consumer Protection Act (TCPA).
See Cartesia's [Acceptable Use Policy](https://cartesia.ai/legal/acceptable-use.html) for more detail.
```bash Bash lines theme={null}
curl -X POST "https://api.cartesia.ai/twilio/call/outbound" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $CARTESIA_API_KEY" \
-H "Cartesia-Version: 2025-04-16" \
-d '{
"target_numbers": ["YOUR_PHONE_NUMBER"],
"agent_id": "YOUR_AGENT_ID",
"metadata": {
"customer_id": "cust_123",
"custom_prompt": "Be extra friendly"
}
}'
```
```python Python lines theme={null}
import requests
url = "https://api.cartesia.ai/twilio/call/outbound"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer YOUR_CARTESIA_API_KEY",
"Cartesia-Version": "2025-04-16"
}
payload = {
"target_numbers": ["YOUR_PHONE_NUMBER"],
"agent_id": "YOUR_AGENT_ID",
"metadata": {
"customer_id": "cust_123",
"custom_prompt": "Be extra friendly"
}
}
response = requests.post(url, headers=headers, json=payload)
print("Status Code:", response.status_code)
print("Response:", response.json())
```
```bash CLI theme={null}
# Trigger an outbound call from a deployed agent to a specific number
cartesia call
```
The `metadata` field accepts any JSON object up to 1MB. This data is passed to your agent code deployment and can be accessed to customize agent behavior per call.
You can access the metadata in your agent code via the `call_request.metadata` object in your `get_agent` function.
```python theme={null}
async def get_agent(env, call_request):
if call_request.metadata:
logger.info(f"Received metadata: {call_request.metadata}")
# Use metadata to customize agent behavior
return LlmAgent(...)
```
You are limited to one outbound dial placed per second, any requests faster than one dial per second will be queued.
# Phone Numbers
Source: https://docs.cartesia.ai/line/integrations/telephony/phone-numbers
Cartesia Telephony provides managed phone numbers so your agent can receive and make real phone calls without setting up your own telephony infrastructure.
## Provisioning
The platform automatically provisions a phone number for each agent when you promote to production. When an agent is deleted, the assigned phone number is released and cannot be re-assigned to another agent.
Bringing your own phone numbers or CCaaS provider is on the roadmap.
## Finding Your Phone Number
When viewing your Line agents from the Playground, you can see the provisioned phone number on the Agents page in the card:
Or in the header once you navigate to the agent's page:
You can also retrieve your phone number using the [CLI](/line/cli).
List all agents to see their phone numbers:
```bash theme={null}
cartesia agents ls
```
Or get detailed information for a specific agent:
```bash theme={null}
cartesia status
```
This returns agent information including name, deployments, and phone numbers.
# Introduction
Source: https://docs.cartesia.ai/line/introduction
Build intelligent, low-latency voice agents with Line.
## What is Line?
Line brings voice to your text agents with Cartesia's state-of-the-art speech models. We handle audio orchestration, deployment, and observability so you can focus on your agent's reasoning.
## Get Started
Build, deploy, and call your first agent
Prototype and iterate on agents without code
Write your custom reasoning logic in code
## Audio Orchestration
Line deploys your code in seconds in our managed runtime with auto-scaling and blazing fast audio processing, using [Ink](https://cartesia.ai/ink) for speech-to-text and [Sonic](https://cartesia.ai/sonic) for text-to-speech.
## What You Can Build
Line gives you full control over your agent's behavior through code: connect any LLM, call external APIs, query databases, and handle interruptions and turn-taking.
## Developer Tools
* **[CLI](/line/cli)** – Deploy and test agents from your terminal
* **[Call logs](/line/infrastructure/observability#call-logs)** – Debug conversations and monitor performance
* **[Evaluations](/line/evaluations/metrics)** – Measure agent quality with custom metrics
* **[Deployments](/line/infrastructure/observability#deployment)** – Track versions and roll back changes
# Agents
Source: https://docs.cartesia.ai/line/sdk/agents
Agents process input events and yield output events to control the conversation.
## What is an Agent?
An Agent controls the input/output event loop. The `process` method receives events (user speech, call start, etc.) and yields responses.
An Agent can be:
1. A **class** with a `process` method
2. A **function** with the same signature `(env, event) -> AsyncIterable[OutputEvent]`
```python theme={null}
from line.events import CallStarted, UserTurnEnded, AgentSendText
class HelloAgent:
async def process(self, env, event):
if isinstance(event, CallStarted):
yield AgentSendText(text="Hello!")
elif isinstance(event, UserTurnEnded):
yield AgentSendText(text="I heard you!")
```
**How an Agent works:**
* Events arrive (user speaks, call starts, button pressed)
* SDK calls `agent.process(env, event)`
* Agent yields output events (speech, tool calls, handoffs)
* SDK handles audio, LLM calls, and state management
***
## LlmAgent
Use the built-in `LlmAgent` which wraps 100+ LLM providers via LiteLLM:
```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig
agent = LlmAgent(
model="anthropic/claude-haiku-4-5-20251001", # Or "gpt-5.2", "gemini/gemini-2.5-flash", etc.
api_key="your-api-key",
tools=[...], # Optional list of tools
config=LlmConfig(
system_prompt="You are a helpful assistant...",
introduction="Hello! How can I help you today?",
),
)
```
### Prompting
`system_prompt` to define your agent's personality and `introduction` for the greeting:
```python theme={null}
import os
from line import CallRequest
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import AgentEnv, VoiceAgentApp
SYSTEM_PROMPT = """You are a friendly customer service agent.
Rules:
- Be polite and empathetic
- Confirm understanding before taking action
- end_call to gracefully end conversations
"""
async def get_agent(env: AgentEnv, call_request: CallRequest):
return LlmAgent(
model="anthropic/claude-haiku-4-5-20251001",
api_key=os.getenv("ANTHROPIC_API_KEY"),
tools=[end_call],
config=LlmConfig(
system_prompt=SYSTEM_PROMPT,
introduction="Hello! How can I help you today?",
),
)
app = VoiceAgentApp(get_agent=get_agent)
if __name__ == "__main__":
app.run()
```
### Supported Models
| Provider | Model Examples |
| ------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| Anthropic | `anthropic/claude-haiku-4-5-20251001`, `anthropic/claude-sonnet-4-5` |
| OpenAI | `gpt-5.4`, `gpt-5.2` |
| Google | `gemini/gemini-2.5-flash-preview-09-2025`, `gemini/gemini-3.0-preview` |
| And 100+ more via [LiteLLM](https://docs.litellm.ai/docs/providers) | |
### LlmConfig Options
| Option | Type | Description |
| ------------------- | --------------------- | ---------------------------------------------------------- |
| `system_prompt` | `str` | The system prompt defining agent behavior |
| `introduction` | `Optional[str]` | Message sent on call start. `None` or `""` to wait for r |
| `temperature` | `Optional[float]` | Sampling temperature |
| `max_tokens` | `Optional[int]` | Maximum tokens per response |
| `top_p` | `Optional[float]` | Nucleus sampling threshold |
| `stop` | `Optional[List[str]]` | Stop sequences |
| `seed` | `Optional[int]` | Random seed for reproducibility |
| `presence_penalty` | `Optional[float]` | Presence penalty for token generation |
| `frequency_penalty` | `Optional[float]` | Frequency penalty for token generation |
| `num_retries` | `int` | Number of retries on failure (default: 2) |
| `fallbacks` | `Optional[List[str]]` | Fallback models if primary fails |
| `timeout` | `Optional[float]` | Request timeout in seconds |
| `reasoning_effort` | `Optional[str]` | `none`, `low`, `medium`, or `high`. Dependent on provider. |
| `extra` | `Dict[str, Any]` | Provider-specific options passed through to LiteLLM |
### History Management
`LlmAgent` exposes a `history` attribute for structured control over the conversation history the LLM sees.
**Adding entries:**
```python theme={null}
# Append a user note (role="user" is the default)
agent.history.add_entry("The user prefers formal language.")
# Insert before a specific event
agent.history.add_entry("Context about the caller.", before=some_event)
```
**Replacing history segments:**
```python theme={null}
# Replace the entire history
agent.history.update(new_events)
# Replace everything from `start` onward
agent.history.update(new_events, start=some_event)
# Replace a specific segment
agent.history.update(new_events, start=start_event, end=end_event)
```
### Per-Turn Overrides
`process()` accepts keyword arguments that apply to just that turn without mutating the agent:
```python theme={null}
# Higher temperature for just this turn
await agent.process(env, event, config=LlmConfig(temperature=0.9))
# Swap a specific tool for one turn
await agent.process(env, event, tools=[custom_lookup_tool])
# Inject ephemeral context
await agent.process(env, event, context="The user is a VIP customer.")
# Completely override history for one turn
await agent.process(env, event, history=custom_history_list)
```
Only explicitly set `LlmConfig` fields take effect — unset fields fall through to the agent's stored config.
To change tools permanently (e.g., enabling `end_call` after a certain point), modify `agent.tools` directly instead of using per-turn overrides.
***
## Controlling the Conversational Loop
Use **event filters** to control when your agent’s `process` method runs, and which events can interrupt it.
### Default Behavior
```python theme={null}
# Agent processes these events:
run_filter = [CallStarted, UserTurnEnded, CallEnded]
# These events interrupt the agent:
cancel_filter = [UserTurnStarted]
```
This means: agent greets on call start, responds when user finishes speaking, and can be interrupted.
### Customizing Filters
Return a tuple from `get_agent` to override defaults:
```python theme={null}
from line.events import CallStarted, UserTurnEnded, UserTurnStarted, CallEnded
async def get_agent(env, call_request):
agent = LlmAgent(...)
# Customize behavior
run_filter = [CallStarted, UserTurnEnded, CallEnded]
cancel_filter = [UserTurnStarted]
return (agent, run_filter, cancel_filter)
```
### Common Customizations
**More responsive (process partial transcriptions):**
```python theme={null}
from line.events import CallStarted, UserTurnEnded, UserTextSent, CallEnded
run_filter = [CallStarted, UserTurnEnded, UserTextSent, CallEnded]
cancel_filter = [UserTurnStarted]
```
This makes your agent start processing before the user finishes speaking, creating a more responsive experience.
**Uninterruptible turns:**
If you want a single message to complete without being interrupted by the user, mark the output as `interruptible=False` when sending it with `AgentSendText`.
```python theme={null}
from line.events import AgentSendText
yield AgentSendText(
text="Before we continue, I need to share a quick disclaimer.",
interruptible=False,
)
```
**Custom logic with functions:**
```python theme={null}
def business_hours_only(event):
hour = datetime.now().hour
if isinstance(event, (CallStarted, CallEnded)):
return True
return isinstance(event, UserTurnEnded) and 9 <= hour < 17
return (agent, business_hours_only, [UserTurnStarted])
```
For advanced patterns like guardrails, routing, and agent wrappers, see [Advanced Patterns](./patterns#agent-wrappers).
***
## Handling Incoming Calls
When a call arrives, you can inspect caller information and configure how your agent responds before it starts.
1. A call arrives from a web client or telephony provider
2. Your `pre_call_handler` receives a `CallRequest` with caller details
3. You return configuration (voice, language) or reject the call
4. Your `get_agent` function creates an agent using the enriched request
### Parsing the CallRequest
Contains information about the incoming call:
| Field | Type | Description |
| --------------- | ---------------- | ----------------------------------------------- |
| `call_id` | `str` | Unique identifier for the call |
| `from_` | `str` | Caller identifier (phone number or client ID) |
| `to` | `str` | Called number or agent ID |
| `agent_call_id` | `str` | Agent call ID for logging/correlation |
| `metadata` | `Optional[dict]` | Custom data passed from your client application |
| `agent` | `AgentConfig` | Prompts configured in Playground or via API |
The `agent` field contains an `AgentConfig` with:
| Field | Type | Description |
| --------------- | --------------- | ------------------------------------------------------------------ |
| `system_prompt` | `Optional[str]` | System prompt configured in Playground or via the Calls API |
| `introduction` | `Optional[str]` | Introduction message configured in Playground or via the Calls API |
### Returning a PreCallResult
Use `pre_call_handler` to set voice, language, or reject calls before your agent starts:
```python theme={null}
from line.voice_agent_app import CallRequest, PreCallResult, VoiceAgentApp
async def pre_call_handler(call_request: CallRequest):
return PreCallResult(
metadata={"tier": "premium"}, # Merged into call_request.metadata
config={
"tts": {
"voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
"model": "sonic-3",
"language": "en",
}
}
)
app = VoiceAgentApp(get_agent=get_agent, pre_call_handler=pre_call_handler)
```
Your client application can pass metadata (user ID, language preference, account tier) in the call request. Your `pre_call_handler` reads this and configures TTS/STT accordingly.
#### Configuration Options
**TTS Options:**
| Option | Type | Description |
| ----------------------- | ------ | ---------------------------------------------------------------------------------------- |
| `voice_id` | string | Voice identifier (UUID) |
| `model` | string | TTS model (`sonic-3`, `sonic-turbo`) |
| `language` | string | Language code (`en`, `es`, `hi`, etc.) |
| `pronunciation_dict_id` | string | [Custom pronunciation dictionary](/build-with-cartesia/sonic-3/custom-pronunciations) ID |
**STT Options:**
| Option | Type | Description |
| ---------- | ------ | ------------------------------------ |
| `language` | string | Language code for speech recognition |
#### Rejecting Calls
Return `None` to reject a call with a 403 status:
```python theme={null}
async def pre_call_handler(call_request: CallRequest):
if is_blocked(call_request.from_):
return None # Rejects with 403
return PreCallResult()
```
#### Custom Pronunciations
Use a [pronunciation dictionary](/build-with-cartesia/sonic-3/custom-pronunciations) to control how specific words are spoken:
```python theme={null}
async def pre_call_handler(call_request: CallRequest):
return PreCallResult(
config={
"tts": {
"voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
"model": "sonic-3",
"pronunciation_dict_id": "your-dict-id",
}
}
)
```
### Accessing call metadata in your Agent logic
The `CallRequest` is available in `get_agent`:
```python theme={null}
async def get_agent(env, call_request):
# Log call information
logger.info(f"Call {call_request.call_id} from {call_request.from_}")
# Access metadata passed from your application (or added in pre_call_handler)
customer_id = call_request.metadata.get("customer_id") if call_request.metadata else None
customer_name = call_request.metadata.get("customer_name") if call_request.metadata else None
# Build a personalized system prompt using metadata
base_prompt = call_request.agent.system_prompt or "You are a helpful customer service agent."
if customer_id:
base_prompt += f"\n\nCurrent customer ID: {customer_id}"
if customer_name:
base_prompt += f"\nCustomer name: {customer_name}"
return LlmAgent(
model="gpt-5-nano",
api_key=os.getenv("OPENAI_API_KEY"),
config=LlmConfig(
system_prompt=base_prompt,
introduction=call_request.agent.introduction,
),
)
```
`LlmConfig.from_call_request()` handles the priority chain automatically:
1. `CallRequest.agent.system_prompt` value (if set)
2. Your fallback value (if provided)
3. SDK default
```python theme={null}
async def get_agent(env, call_request):
return LlmAgent(
model="anthropic/claude-haiku-4-5-20251001",
api_key=os.getenv("ANTHROPIC_API_KEY"),
tools=[end_call],
config=LlmConfig.from_call_request(
call_request,
fallback_system_prompt="You are a sales assistant.",
fallback_introduction="Hi! How can I help with your purchase?",
temperature=0.7, # Additional LlmConfig options
),
)
```
Using `CallRequest` lets you iterate on system prompts from the Playground instantly, while code handles the technical configuration and fallback defaults.
### Letting The User Speak First
Set `introduction` to an empty string to wait for the user to speak first:
```python theme={null}
config=LlmConfig.from_call_request(
call_request,
fallback_system_prompt=SYSTEM_PROMPT,
fallback_introduction="",
)
```
***
## Custom Agent Function
For advanced use cases, you can build agents from scratch as functions:
```python theme={null}
from line.events import UserTurnEnded, AgentSendText, CallStarted
async def my_agent(env, event):
if isinstance(event, CallStarted):
yield AgentSendText(text="Hello! How can I help?")
elif isinstance(event, UserTurnEnded):
user_text = event.content[0].content if event.content else ""
yield AgentSendText(text=f"You said: {user_text}")
```
## Custom Agent Class
Or as classes with state:
```python theme={null}
class GreetingAgent:
def __init__(self, greeting: str):
self.greeting = greeting
self.greeted = False
async def process(self, env, event):
if isinstance(event, CallStarted) and not self.greeted:
yield AgentSendText(text=self.greeting)
self.greeted = True
```
Most developers can use `LlmAgent` with tools rather than building custom agents from scratch! Custom agents are powerful when you need full control over the event processing logic without LLM reasoning.
# Events
Source: https://docs.cartesia.ai/line/sdk/events
Events are typed Python objects for communication between your agent and the Cartesia platform. Your agent receives **input events** from the harness and yields **output events** to control the conversation.
To learn which events trigger your agent and how to customize this behavior (e.g., responding to DTMF, preventing interruptions), see [Controlling the Conversational Loop](/line/sdk/agents#controlling-the-conversational-loop).
## Input Events
Input events are received by your agent from the Cartesia harness. All input events include an optional `history` field containing the complete conversation history. When `history` is `None`, the event is being used within a history list; when `history` contains a list, the event has the full conversation context attached.
### Call Lifecycle
| Event | Description |
| ------------- | ---------------------- |
| `CallStarted` | The call has connected |
| `CallEnded` | The call has ended |
```python theme={null}
from line.events import CallStarted, CallEnded
async def process(self, env, event):
if isinstance(event, CallStarted):
yield AgentSendText(text="Hello! How can I help?")
elif isinstance(event, CallEnded):
# Perform cleanup
pass
```
### User Turn Events
| Event | Description |
| ----------------- | --------------------------------------------------------------- |
| `UserTurnStarted` | The user started speaking (triggers interruption by default) |
| `UserTurnEnded` | The user finished speaking (triggers new agent turn by default) |
| `UserTextSent` | User text content (within `UserTurnEnded.content`) |
| `UserDtmfSent` | User pressed a DTMF button |
```python theme={null}
from line.events import UserTurnEnded, UserTextSent
if isinstance(event, UserTurnEnded):
for content in event.content:
if isinstance(content, UserTextSent):
user_message = content.content
```
### Agent Turn Events (in history)
| Event | Description |
| ------------------ | -------------------------- |
| `AgentTurnStarted` | Agent started its turn |
| `AgentTurnEnded` | Agent finished its turn |
| `AgentTextSent` | Agent text that was spoken |
| `AgentDtmfSent` | DTMF tone sent by agent |
### Handoff Event
| Event | Description |
| ---------------- | ------------------------------------- |
| `AgentHandedOff` | Control transferred to a handoff tool |
### Custom Event
| Event | Description |
| ---------------- | ------------------------------------------------------------------------------------------------------------------ |
| `UserCustomSent` | Custom metadata sent from the client via the WebSocket [`custom` event](/line/integrations/calls-api#custom-event) |
Received when your client application sends a `custom` WebSocket event to the call stream. The event carries a `metadata` dict with whatever key-value pairs the client included:
```python theme={null}
from line.events import UserCustomSent
async def process(self, env, event):
if isinstance(event, UserCustomSent):
action = event.metadata.get("action")
# React to client-side triggers (e.g., button clicks, form submissions)
```
***
## Output Events
Output events are yielded by your agent to control the conversation.
### Speech
You can choose to send messages with `AgentSendText`.
```python theme={null}
from line.events import AgentSendText
yield AgentSendText(text="Hello! How can I help you today?")
```
By default, users can interrupt the agent. However, if you have a disclaimer or another important message that you wish be uninterruptible, you can set the `interruptible` flag as false.
```python theme={null}
from line.events import AgentSendText
yield AgentSendText(
text="Before we continue, I need to share a quick disclaimer.",
interruptible=False,
)
```
### Call Control
```python theme={null}
from line.events import AgentEndCall, AgentTransferCall, AgentSendDtmf
# End the call
yield AgentEndCall()
# Transfer to another number
yield AgentTransferCall(target_phone_number="+14155551234")
# Send DTMF tone
yield AgentSendDtmf(button="1")
```
### Dynamic Configuration
Update call settings (voice, pronunciation, language) mid-conversation using `AgentUpdateCall`:
```python theme={null}
from line.events import AgentUpdateCall
# Change voice
yield AgentUpdateCall(voice_id="5ee9feff-1265-424a-9d7f-8e4d431a12c7")
# Change pronunciation dictionary
yield AgentUpdateCall(pronunciation_dict_id="dict-123")
# Change language
yield AgentUpdateCall(language="es")
# Update multiple settings at once
yield AgentUpdateCall(
voice_id="spanish-voice-id",
pronunciation_dict_id="spanish-dict-id",
language="es"
)
```
**AgentUpdateCall Parameters:**
| Field | Type | Description |
| ----------------------- | ------------------------ | --------------------------------------------------------------------------------- |
| `type` | `Literal["update_call"]` | Event type identifier (automatically set) |
| `voice_id` | `Optional[str]` | Updates the agent's voice |
| `pronunciation_dict_id` | `Optional[str]` | Updates the pronunciation dictionary |
| `language` | `Optional[str]` | Updates the language used on speech-to-text (STT) and text-to-speech (TTS) models |
All fields are optional—only set fields are updated.
### Tool Events
These are emitted by `LlmAgent` to track tool execution:
```python theme={null}
from line.events import AgentToolCalled, AgentToolReturned
# Emitted when LLM calls a tool
yield AgentToolCalled(
tool_call_id="call_123",
tool_name="get_weather",
tool_args={"city": "San Francisco"}
)
# Emitted when tool returns
yield AgentToolReturned(
tool_call_id="call_123",
tool_name="get_weather",
tool_args={"city": "San Francisco"},
result="72°F and sunny"
)
```
### Logging
```python theme={null}
from line.events import LogMetric, LogMessage
# Log a metric
yield LogMetric(name="response_time_ms", value=150)
# Log a message
yield LogMessage(
name="order_lookup",
level="info",
message="Found order #12345",
metadata={"order_id": "12345"}
)
```
### Custom Events
Send arbitrary metadata from your agent to the harness:
```python theme={null}
from line.events import AgentSendCustom
yield AgentSendCustom(metadata={"action": "show_form", "form_id": "checkout"})
```
Pair with [`UserCustomSent`](#custom-event) for bidirectional metadata exchange.
### Voice & Language Control
Change voice or speech recognition language mid-call:
```python theme={null}
from line.events import AgentUpdateCall
# Switch to Spanish voice and speech recognition
yield AgentUpdateCall(voice_id="spanish-voice-id", language="es")
# Enable multilingual auto-detect mode
yield AgentUpdateCall(language="multilingual")
```
The `language` field sets the ASR (speech recognition) language. Pass any language code supported by [Ink STT](/build-with-cartesia/stt-models), or `"multilingual"` for automatic language detection.
***
## Event History
All input events include an optional `history` field containing the conversation history. When `history` is `None`, the event is inside a history list; when it contains a list, full conversation context is attached. `LlmAgent` handles this automatically—you only need to understand history if building custom agents.
### Accessing History
```python theme={null}
from line.events import UserTextSent, AgentTextSent
async def process(self, env, event):
for past_event in event.history:
if isinstance(past_event, UserTextSent):
print(f"User said: {past_event.content}")
elif isinstance(past_event, AgentTextSent):
print(f"Agent said: {past_event.content}")
```
Events in the history list have `history=None` to avoid redundant nesting. The event types are the same as regular input events:
| Event Type | Description |
| ------------------ | ------------------------- |
| `CallStarted` | Call began |
| `UserTurnStarted` | User started speaking |
| `UserTextSent` | User's transcribed speech |
| `UserDtmfSent` | User's DTMF button press |
| `UserTurnEnded` | User finished speaking |
| `AgentTurnStarted` | Agent started responding |
| `AgentTextSent` | Agent's spoken text |
| `AgentDtmfSent` | Agent's DTMF tone |
| `AgentTurnEnded` | Agent finished responding |
| `CallEnded` | Call ended |
`LlmAgent` automatically converts the event history to LLM messages:
* **User messages**: From `UserTextSent` events
* **Assistant messages**: From `AgentTextSent` events
* **Tool calls**: From `AgentToolCalled` and `AgentToolReturned` events
This means the LLM sees full context including previous tool calls and results, enabling it to reference that information without making redundant API calls.
If building a custom agent (not using `LlmAgent`), you can use history for context, summarization, or pattern detection:
```python theme={null}
class CustomAgent:
async def process(self, env, event):
user_turns = sum(
1 for e in event.history
if isinstance(e, UserTurnEnded)
)
if user_turns > 5:
yield AgentSendText(text="We've been chatting for a while! Is there anything else I can help with?")
```
# SDK Overview
Source: https://docs.cartesia.ai/line/sdk/overview
The [Line SDK](https://github.com/cartesia-ai/line/) is a Python framework for building voice agents. Handles audio infrastructure, speech recognition, and conversation flow.
```bash theme={null}
uv add cartesia-line
```
New to Line? Start with the [Quickstart](/line/start-building/quickstart) to build and deploy your first agent.
## Core Concepts
| Component | Purpose |
| --------------------------------------------------- | ----------------------------------------------------------------------- |
| [`Agent`](./agents) | Controls the input/output event loop via a `process` method |
| [`LlmAgent`](./agents#llmagent) | Built-in agent that wraps 100+ LLM providers via LiteLLM |
| [`Tools`](./tools) | Functions your agent can call—database lookups, handoffs, web search |
| [`VoiceAgentApp`](./agents#handling-incoming-calls) | HTTP server that connects your agent to Cartesia's audio infrastructure |
```python theme={null}
import os
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import VoiceAgentApp
async def get_agent(env, call_request):
return LlmAgent(
model="anthropic/claude-haiku-4-5-20251001",
api_key=os.getenv("ANTHROPIC_API_KEY"),
tools=[end_call],
config=LlmConfig(
system_prompt="You are a helpful assistant.",
introduction="Hello! How can I help you today?",
),
)
app = VoiceAgentApp(get_agent=get_agent)
```
The agent speaks the `introduction` when a call starts, then responds to whatever the user says using the LLM.
## Features
* **Real-time interruption support** — Handles audio interruptions and turn-taking out-of-the-box.
* **Tool calling** — Connect to databases, APIs, and external services
* **Multi-agent handoffs** — Route conversations between specialized agents
* **Web search** — Built-in tool for real-time information lookup
## Add Capabilities
### Look up information
```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool
@loopback_tool
async def get_order_status(ctx, order_id: Annotated[str, "The order ID"]):
"""Look up an order's current status."""
order = await db.get_order(order_id)
return f"Order {order_id} is {order.status}"
```
### Handoff to another agent
```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, agent_as_handoff, end_call
spanish_agent = LlmAgent(
model="gpt-5-nano",
api_key=os.getenv("OPENAI_API_KEY"),
tools=[end_call],
config=LlmConfig(
system_prompt="You speak only in Spanish.",
introduction="¡Hola! ¿Cómo puedo ayudarte?",
),
)
main_agent = LlmAgent(
model="anthropic/claude-haiku-4-5-20251001",
api_key=os.getenv("ANTHROPIC_API_KEY"),
tools=[
end_call,
agent_as_handoff(
spanish_agent,
name="transfer_to_spanish",
description="Transfer when user requests Spanish.",
),
],
config=LlmConfig(...),
)
```
### Search the web
```python theme={null}
from line.llm_agent import end_call, web_search
agent = LlmAgent(
tools=[end_call, web_search], # Add built-in web search
...
)
```
See [Tools](./tools) for the full guide.
## Code Examples
| Example | Description |
| ----------------------------------------------------------------------------------------- | -------------------------------------------------- |
| [Basic Chat](https://github.com/cartesia-ai/line/tree/main/examples/basic_chat) | Simple conversational agent |
| [Chat Supervisor](https://github.com/cartesia-ai/line/tree/main/examples/chat_supervisor) | Fast chat model with powerful reasoning escalation |
| [Form Filler](https://github.com/cartesia-ai/line/tree/main/examples/form_filler) | Collect structured data via conversation |
| [Multi-Agent](https://github.com/cartesia-ai/line/tree/main/examples/transfer_agent) | Hand off between specialized agents |
### Integrations
| Integration | Description |
| --------------------------------------------------------------------------------------------- | ------------------------ |
| [Exa Web Research](https://github.com/cartesia-ai/line/tree/main/example_integrations/exa) | Real-time web search |
| [Browserbase](https://github.com/cartesia-ai/line/tree/main/example_integrations/browserbase) | Fill web forms via voice |
## Next Steps
Configure prompts, LLMs, and conversation flow
Add custom tools and multi-agent handoffs
# Advanced Patterns
Source: https://docs.cartesia.ai/line/sdk/patterns
Patterns for production voice agents: observability, tool design, multi-agent systems, and guardrails.
## Complete Example: Multi-Agent Customer Service
This example combines prompting, all three tool types, and multi-agent handoffs:
```python theme={null}
import os
from typing import Annotated
from line import CallRequest
from line.llm_agent import (
LlmAgent, LlmConfig, loopback_tool, passthrough_tool,
agent_as_handoff, end_call
)
from line.events import AgentSendText, AgentTransferCall
from line.voice_agent_app import AgentEnv, VoiceAgentApp
# Loopback tool: Fetch order info for LLM to contextualize
@loopback_tool
async def get_order_status(ctx, order_id: Annotated[str, "The order ID"]):
"""Look up order status by ID."""
order = await db.get_order(order_id)
return f"Order {order_id}: {order.status}, delivers {order.delivery_date}"
# Passthrough tool: Deterministic transfer action
@passthrough_tool
async def transfer_to_human(ctx):
"""Transfer to a human agent."""
yield AgentSendText(text="Let me connect you with a team member who can help further.")
yield AgentTransferCall(target_phone_number="+18005551234")
SYSTEM_PROMPT = """You are a friendly customer service agent for Acme Corp.
You can:
- Look up order status using get_order_status
- Transfer to a human agent using transfer_to_human
- Transfer to Spanish support using transfer_to_spanish
- End calls politely using end_call
Rules:
- Always confirm the order ID before looking it up
- Offer to transfer to a human if you can't resolve the issue
- Transfer to Spanish support if the user speaks Spanish or requests it
- Be empathetic and professional
"""
async def get_agent(env: AgentEnv, call_request: CallRequest):
# Spanish-speaking specialist agent
spanish_agent = LlmAgent(
model="gpt-5-nano",
api_key=os.getenv("OPENAI_API_KEY"),
tools=[get_order_status, transfer_to_human, end_call],
config=LlmConfig(
system_prompt="Eres un agente de servicio al cliente amigable para Acme Corp. Habla solo en español.",
introduction="¡Hola! Gracias por llamar a Acme Corp. ¿Cómo puedo ayudarte hoy?",
),
)
# Main English-speaking agent with handoff capability
return LlmAgent(
model="anthropic/claude-haiku-4-5-20251001",
api_key=os.getenv("ANTHROPIC_API_KEY"),
tools=[
get_order_status,
transfer_to_human,
agent_as_handoff(
spanish_agent,
handoff_message="Transferring you to our Spanish-speaking team...",
name="transfer_to_spanish",
description="Transfer to Spanish support when user speaks Spanish or requests it.",
),
end_call,
],
config=LlmConfig(
system_prompt=SYSTEM_PROMPT,
introduction="Hi! Thanks for calling Acme Corp. How can I help you today?",
),
)
app = VoiceAgentApp(get_agent=get_agent)
if __name__ == "__main__":
app.run()
```
***
## Observability
### Log Metrics
Track performance and business metrics:
```python theme={null}
from line.events import LogMetric, LogMessage
@loopback_tool
async def process_order(ctx, order_id: Annotated[str, "Order ID"]):
"""Process a customer order."""
import time
start = time.time()
result = await api.process_order(order_id)
# Log timing metric
yield LogMetric(name="order_processing_ms", value=(time.time() - start) * 1000)
# Log business event
yield LogMessage(
name="order_processed",
level="info",
message=f"Processed order {order_id}",
metadata={"status": result.status}
)
return f"Order {order_id} processed: {result.status}"
```
### Built-in LLM Agent Metrics
`LlmAgent` automatically emits three timing metrics on every turn — no code needed:
| Metric | Description |
| -------------------- | -------------------------------------------------------------------------------------- |
| `llm_first_chunk_ms` | Time from start of response generation to first chunk (text or tool call) from the LLM |
| `llm_first_text_ms` | Time from start of response generation to first text chunk |
| `agent_turn_ms` | Total agent processing time for the turn |
***
## Tool Patterns
### Validation in Tools
Validate inputs before processing:
```python theme={null}
@loopback_tool
async def book_appointment(
ctx,
date: Annotated[str, "Date in YYYY-MM-DD format"],
time: Annotated[str, "Time in HH:MM format"]
):
"""Book an appointment."""
from datetime import datetime
try:
dt = datetime.strptime(f"{date} {time}", "%Y-%m-%d %H:%M")
except ValueError:
return "Invalid date or time format. Please use YYYY-MM-DD and HH:MM."
if dt < datetime.now():
return "Cannot book appointments in the past."
# Proceed with booking
return f"Appointment booked for {dt.strftime('%B %d at %I:%M %p')}"
```
### Async Operations in Tools
Handle long-running operations with proper timeout handling:
```python theme={null}
import asyncio
@loopback_tool
async def search_inventory(ctx, query: Annotated[str, "Search query"]):
"""Search inventory with timeout protection."""
try:
result = await asyncio.wait_for(
inventory_api.search(query),
timeout=5.0
)
return f"Found {len(result.items)} items matching '{query}'"
except asyncio.TimeoutError:
return "Search is taking longer than expected. Please try a more specific query."
```
### Error Handling
Handle errors gracefully in tools:
```python theme={null}
@loopback_tool
async def get_account_info(ctx, account_id: Annotated[str, "Account ID"]):
"""Look up account information."""
try:
account = await api.get_account(account_id)
return f"Account {account_id}: Balance ${account.balance:.2f}"
except AccountNotFoundError:
return f"Account {account_id} not found."
except Exception as e:
logger.error(f"Error fetching account: {e}")
return "Sorry, I couldn't retrieve that account information right now."
```
***
## Agent Wrappers
Agent wrappers add cross-cutting behavior (logging, validation, routing) without modifying the underlying agent.
### Guardrails: Safety and Content Filtering
Wrappers are ideal for implementing guardrails that filter unsafe content in both directions:
```python theme={null}
class GuardrailsAgent:
def __init__(self, inner_agent, safety_api):
self.inner = inner_agent
self.safety_api = safety_api
async def process(self, env, event):
# Pre-processing: Check user input for unsafe content
if isinstance(event, UserTurnEnded):
user_text = event.content[0].content if event.content else ""
if await self.safety_api.is_unsafe(user_text):
yield AgentSendText(text="I'm here to help with appropriate requests. Let's keep our conversation respectful.")
return
# Post-processing: Check agent output for safety issues
async for output in self.inner.process(env, event):
if isinstance(output, AgentSendText):
if await self.safety_api.is_unsafe(output.text):
yield LogMessage(
name="safety_violation",
level="warning",
message=f"Blocked unsafe output: {output.text[:100]}..."
)
yield AgentSendText(text="I apologize, but I can't provide that information.")
continue
yield output
```
Common guardrail patterns:
* Content safety filtering (toxicity, hate speech, PII)
* Rate limiting and abuse prevention
* Compliance checks (HIPAA, financial regulations)
* Brand safety (off-brand responses)
### Routing Between Multiple Agents
Dynamically switch between specialized agents based on conversation context:
```python theme={null}
class RouterAgent:
def __init__(self, default_agent, specialists: dict):
self.default = default_agent
self.specialists = specialists
self.current = default_agent
async def process(self, env, event):
# Switch agent based on user input
if isinstance(event, UserTurnEnded):
user_text = event.content[0].content if event.content else ""
if "billing" in user_text.lower():
self.current = self.specialists.get("billing", self.default)
elif "technical" in user_text.lower():
self.current = self.specialists.get("technical", self.default)
async for output in self.current.process(env, event):
yield output
```
Use with `LlmAgent`:
```python theme={null}
async def get_agent(env, call_request):
return RouterAgent(
default_agent=LlmAgent(
model="gpt-5-nano",
api_key=os.getenv("OPENAI_API_KEY"),
config=LlmConfig(system_prompt="You are a helpful assistant..."),
),
specialists={
"billing": LlmAgent(
model="gpt-5-nano",
api_key=os.getenv("OPENAI_API_KEY"),
config=LlmConfig(system_prompt="You are a billing specialist..."),
),
"technical": LlmAgent(
model="anthropic/claude-haiku-4-5-20251001",
api_key=os.getenv("ANTHROPIC_API_KEY"),
config=LlmConfig(system_prompt="You are a technical support specialist..."),
),
}
)
```
### Best Practices
Keep wrappers focused on a single responsibility. Use `async for` and `yield` to preserve streaming. Stack simple wrappers rather than building one complex one.
```python theme={null}
# Composable wrappers
agent = LoggingWrapper(
ValidationWrapper(
LlmAgent(...)
)
)
```
***
## Example Implementations
Full working examples demonstrating these patterns:
| Example | Pattern | Description |
| --------------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------ |
| [Form Filler](https://github.com/cartesia-ai/line/tree/main/examples/form_filler) | Stateful tools | Walk users through a YAML-defined form with validation |
| [Multi-Agent Transfer](https://github.com/cartesia-ai/line/tree/main/examples/transfer_agent) | agent\_as\_handoff | English/Spanish agent handoff |
| [Chat Supervisor](https://github.com/cartesia-ai/line/tree/main/examples/chat_supervisor) | Background research | Separate agents for talking and longer-thinking |
# Tools
Source: https://docs.cartesia.ai/line/sdk/tools
Tools let your agent perform actions and retrieve information. The SDK supports three tool paradigms that differ in how they affect conversation flow.
## Defining Tools
Any properly annotated function can be a tool. The SDK uses the function's docstring as the description and type annotations for parameters:
```python theme={null}
from typing import Annotated
async def get_weather(
ctx,
city: Annotated[str, "The city to check weather for"],
units: Annotated[str, "celsius or fahrenheit"] = "fahrenheit"
):
"""
Look up the current weather in a given city.
"""
return f"72°F and sunny in {city}"
```
The first parameter of every tool must be `ctx` (the tool context). This provides access to conversation state and is required for forward compatibility. Your tool parameters follow after `ctx`.
***
## Tool Types
Plain functions passed to `tools` are automatically wrapped as loopback tools. Use decorators (`@loopback_tool`, `@passthrough_tool`, `@handoff_tool`) for explicit control.
### Loopback Tools (`@loopback_tool`)
The default behavior. The tool's result is sent back to the LLM, which can then continue generating a response.
```python theme={null}
from line.llm_agent import loopback_tool
@loopback_tool
async def get_account_balance(ctx, account_id: Annotated[str, "The account ID"]):
"""Look up the balance for a customer account."""
balance = await api.get_balance(account_id)
return f"${balance:.2f}"
```
**Use for:** Information retrieval, calculations, API queries.
### Passthrough Tools (`@passthrough_tool`)
Output events go directly to the user, bypassing the LLM. Use for deterministic actions.
```python theme={null}
from line.llm_agent import passthrough_tool
from line.events import AgentSendText, AgentEndCall
@passthrough_tool
async def end_call_with_message(ctx, message: Annotated[str, "Goodbye message"]):
"""End the call with a custom goodbye message."""
yield AgentSendText(text=message)
yield AgentEndCall()
```
**Use for:** Call control (`EndCall`, `TransferCall`, `SendDtmf`), deterministic responses.
### Handoff Tools (`@handoff_tool`)
Transfers control to another handler. All future events are routed to the handoff target instead of the original agent.
```python theme={null}
from typing import Annotated
from line.llm_agent import handoff_tool
from line.events import AgentHandedOff, AgentSendText, UserTurnEnded, AgentEndCall
@handoff_tool
async def run_satisfaction_survey(
ctx,
customer_name: Annotated[str, "The customer's name"],
event
):
"""Hand off to a customer satisfaction survey at the end of the call."""
if isinstance(event, AgentHandedOff):
# First call - send introduction
yield AgentSendText(
text=f"Thank you for your call, {customer_name}. "
"Please stay on the line for a brief satisfaction survey. "
"On a scale of 1 to 5, how would you rate your experience today?"
)
return
# Subsequent calls - handle survey responses
if isinstance(event, UserTurnEnded):
user_response = event.content[0].content if event.content else ""
yield AgentSendText(text=f"You rated us {user_response}. Thank you for your feedback!")
yield AgentEndCall()
```
**Use for:** Custom multi-step flows, specialized handlers with their own logic.
When using a handoff tool, the `event` parameter receives different values depending on timing:
* **First call**: `event` is `AgentHandedOff` — use this to send a transition message
* **Subsequent calls**: `event` is the actual `InputEvent` (`UserTurnEnded`, etc.)
Once a handoff occurs, the original agent no longer receives events. The handoff tool function handles all future conversation turns.
To hand off to another `LlmAgent`, use the [`agent_as_handoff`](#agent_as_handoff) helper instead of writing a raw `@handoff_tool`. It handles the delegation automatically.
***
## Built-in Tools
```python theme={null}
from line.llm_agent import end_call, send_dtmf, transfer_call, web_search
agent = LlmAgent(
model="anthropic/claude-haiku-4-5-20251001",
api_key=os.getenv("ANTHROPIC_API_KEY"),
tools=[end_call, send_dtmf, transfer_call, web_search],
config=LlmConfig(...),
)
```
| Tool | Description | When to Use |
| --------------- | ------------------------------------------ | ------------------------------------------------------------- |
| `end_call` | Ends the call | User says "goodbye" or the agent's objective has been met |
| `transfer_call` | Transfers to another number (E.164 format) | Escalating to human agents, routing to departments |
| `web_search` | Searches the web for real-time info | Current events, live prices, recent news the LLM doesn't know |
**Examples:**
```python theme={null}
# End call: Let the LLM decide when conversation is complete
tools=[end_call] # LLM calls this when user says "thanks, bye!"
# Transfer: Route to human support
tools=[transfer_call] # LLM calls transfer_call(target_phone_number="+18005551234")
# Web search with custom context size
tools=[web_search(search_context_size="high")] # More context for complex queries
```
### `end_call`
Ends the current call and disconnects. The actual hangup occurs after the agent's final speech completes, so the user hears the full goodbye message before disconnection.
```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, end_call
agent = LlmAgent(
model="anthropic/claude-haiku-4-5-20251001",
api_key=os.getenv("ANTHROPIC_API_KEY"),
tools=[end_call],
config=LlmConfig(...),
)
```
By default, `end_call` uses a conservative policy that only ends the call when:
* The user's objective is fully complete
* The user explicitly says goodbye
* The agent has said a natural goodbye
#### Custom Description
We recommend providing a custom description tailored to your use case. The description **fully replaces** the default—it is not appended—so include complete instructions with explicit do/don't guidance.
```python theme={null}
from line.llm_agent import end_call
# Restaurant reservation agent
tools=[end_call(description="""Ends the call and disconnects.
Call when ALL of the following are true:
- The reservation is confirmed with date, time, party size, and name.
- You have repeated the reservation details back to the guest.
- The guest confirms the details are correct or says goodbye.
Do not call when:
- The guest asks to modify the reservation.
- Details are missing or unconfirmed.
- The guest says 'okay' or 'thanks' without an explicit goodbye.
If unsure, ask: 'Is there anything else I can help you with for your reservation?'
""")]
# Order confirmation agent
tools=[end_call(description="""Ends the call and disconnects.
Call when ALL of the following are true:
- The order is placed and confirmed.
- You have provided the order number and estimated delivery time.
- The customer acknowledges with a goodbye phrase.
Do not call when:
- The customer has questions about their order.
- Payment has not been confirmed.
- The customer says 'got it' without saying goodbye.
""")]
```
| Parameter | Type | Description |
| ------------- | --------------- | ----------------------------------------------------------------------------------------------------------- |
| `description` | `Optional[str]` | Fully replaces the default description. Include complete instructions for when the LLM should end the call. |
### `agent_as_handoff`
Creates a handoff tool from another `Agent`—the easiest way to implement multi-agent workflows.
```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, agent_as_handoff, end_call, UpdateCallConfig
spanish_agent = LlmAgent(
model="gpt-5-nano",
api_key=os.getenv("OPENAI_API_KEY"),
tools=[end_call],
config=LlmConfig(
system_prompt="You speak only in Spanish.",
introduction="¡Hola! ¿Cómo puedo ayudarte?",
),
)
main_agent = LlmAgent(
model="anthropic/claude-haiku-4-5-20251001",
api_key=os.getenv("ANTHROPIC_API_KEY"),
tools=[
end_call,
agent_as_handoff(
spanish_agent,
handoff_message="Transferring to Spanish support...",
update_call=UpdateCallConfig(
voice_id="spanish-voice-id",
pronunciation_dict_id="spanish-pronunciation-dict-id"
),
name="transfer_to_spanish",
description="Use when user requests Spanish.",
),
],
config=LlmConfig(...),
)
```
| Parameter | Type | Description |
| ----------------- | ---------------------------- | --------------------------------------------------------------------------------------- |
| `agent` | `Agent` | The agent to hand off to |
| `handoff_message` | `Optional[str]` | Message spoken before the handoff |
| `update_call` | `Optional[UpdateCallConfig]` | Optional config to update call settings (voice, pronunciation, language) before handoff |
| `name` | `Optional[str]` | Tool name for the LLM |
| `description` | `Optional[str]` | When the LLM should use this tool |
When called, `agent_as_handoff` automatically sends the handoff message, updates the call settings if specified, triggers the new agent's introduction, and routes all future events to it.
See [Advanced Patterns](/line/sdk/patterns) for a complete multi-agent example with loopback, passthrough, and handoff tools.
***
## Long-Running Tools
By default, tool calls are terminated when the agent is interrupted (though any reasoning and tool call response values already produced are preserved for use in the next generation).
For tools that are expected to take a long time to complete, set `is_background=True`. The tool will continue running in the background until completion regardless of interruptions, then loop back to the LLM to produce a response.
```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool
@loopback_tool(is_background=True)
async def search_database(ctx, query: Annotated[str, "Search query"]) -> str:
"""Search the database - may take several seconds."""
results = await slow_database_search(query)
return format_results(results)
@loopback_tool(is_background=True)
async def generate_report(ctx, report_type: Annotated[str, "Type of report"]) -> str:
"""Generate a detailed report - runs in background."""
report = await compile_report(report_type)
return report
```
Background tools are useful when:
* The operation may take longer than typical user patience (e.g., complex searches, report generation)
* You want the user to be able to speak while the operation completes
* The result should be delivered even if the user interrupts with another question
# Agent Builder
Source: https://docs.cartesia.ai/line/start-building/agent-builder
Prototype voice agents in the Playground. Test prompts, configure voices, and deploy in seconds.
## Create your agent
Go to [play.cartesia.ai/agents](https://play.cartesia.ai/agents) and select **Start in Playground**.
Customize your agent's behavior, voice, and greeting.
**System Prompt** — Define your agent's role and guidelines. You can also provide a natural language description of your agent and the platform will generate a structured system prompt.
**Voice** — Choose from Cartesia's voice library. Preview voices before selecting.
**Initial Message** — Set the greeting your agent speaks when calls start. Check **Skip agent introduction** to have the agent wait for the user to speak first.
**Background Sound** — Add ambient audio for call center atmospheres or office environments.
**Preview** changes before publishing.
## Continue building in code
Connect your Playground agent to GitHub to customize with code.
On your agent page, click **Connect to GitHub**. Authorize Cartesia to create a repository.
```bash theme={null}
git clone https://github.com/your-org/your-agent.git
cd your-agent
```
```bash theme={null}
uv pip install .
```
Open `main.py` to add tools, custom logic, or modify the prompt.
Push to deploy your changes.
```bash theme={null}
git push
```
## Next steps
Build agents with the SDK
Prompts, voices, and pre-call configuration
# Quickstart
Source: https://docs.cartesia.ai/line/start-building/quickstart
Build an agent, deploy it, and make your first call within minutes.
## Prerequisites
* A free Cartesia account ([sign up here](https://play.cartesia.ai))
* Python 3.9+
* An LLM API key (Anthropic, OpenAI, Google, etc.)
* [uv](https://docs.astral.sh/uv/) (Python package and project manager)
## Install the CLI
```bash theme={null}
curl -fsSL https://cartesia.sh | sh
cartesia auth login
```
## Install uv
Install [uv](https://docs.astral.sh/uv/), a fast Python package manager to manage dependencies and virtual environments.
```bash theme={null}
curl -LsSf https://astral.sh/uv/install.sh | sh
```
## Create your agent
Create a new project and install dependencies. uv will automatically set up a virtual environment and manage your packages.
```bash theme={null}
uv init my-voice-agent && cd my-voice-agent
uv add cartesia-line
```
Create `main.py`:
```python theme={null}
import os
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import VoiceAgentApp
async def get_agent(env, call_request):
return LlmAgent(
model="anthropic/claude-haiku-4-5-20251001", # Or "gpt-5-nano", "gemini/gemini-2.5-flash", etc.
api_key=os.getenv("ANTHROPIC_API_KEY"),
tools=[end_call],
config=LlmConfig(
system_prompt="You are a helpful assistant.",
introduction="Hello! How can I help you today?",
),
)
app = VoiceAgentApp(get_agent=get_agent)
if __name__ == "__main__":
app.run()
```
## Test locally
Start your agent server.
```bash theme={null}
ANTHROPIC_API_KEY=your-api-key PORT=8000 uv run python main.py
```
In a separate terminal, chat with your agent by simply running:
```bash theme={null}
cartesia chat 8000
```
This lets you test your agent's reasoning before deploying.
## Deploy
Link your project and deploy.
```bash theme={null}
cartesia init # Choose "Create new" and name your agent
cartesia deploy
```
Your agent deploys in under 30 seconds on Cartesia's managed runtime.
## Set environment variables
Configure your API key for the deployed agent.
```bash theme={null}
cartesia env set ANTHROPIC_API_KEY=your-api-key
```
Or import from a `.env` file:
```bash theme={null}
cartesia env set --from .env
```
## Make a call
Call your agent from your phone.
```bash theme={null}
cartesia call +1XXXXXXXXXX
```
Or visit the [Playground](https://play.cartesia.ai/agents) to call from the web.
## Next steps
Connect databases, APIs, and external services
Customize system prompts and conversation flow
Connect web clients via WebSocket
Build agents visually in the Playground
# LLMs documentation files
Source: https://docs.cartesia.ai/tools/ai/llms-txt
Machine-readable index files for assistants and tooling that ingest Cartesia documentation.
Plain-text, machine-readable exports of the documentation.
Designed for systems that fetch and parse docs over HTTP, such as agents, MCP servers, and crawlers.
## Endpoints
Both endpoints are public over HTTPS and require no API key. Fetch directly inside a tool or pipeline: agents with web fetch or read URL tools, MCP servers, or custom crawlers.
**[llms.txt](https://docs.cartesia.ai/llms.txt)** (default)\
Condensed index aligned with the `llms.txt` convention ([llmstxt.org](https://llmstxt.org/)). Use it when your system can fetch specific docs over HTTP.
* Smaller context
* Faster processing
* Better for retrieval workflows
**[llms-full.txt](https://docs.cartesia.ai/llms-full.txt)**\
Fuller coverage of the docs site. Use it when your system needs broader upfront content in one fetch. Consumes more context and tokens when fed to your LLM.
* More URLs and text
* Higher recall
* Better for indexing and batch jobs
# MCP
Source: https://docs.cartesia.ai/tools/ai/mcp
The **`cartesia-mcp`** package exposes Cartesia through the **Model Context Protocol (MCP)** so MCP-capable clients—**Cursor**, **Claude Desktop**, **OpenAI Agents**, and similar—can list voices, run **TTS**, and use other Cartesia-backed tools via the protocol instead of custom scripts.
You need a [Cartesia API key](https://play.cartesia.ai/keys). The [PyPI package](https://pypi.org/project/cartesia-mcp/) currently requires **Python 3.13 or newer** as its minimum; confirm the supported version on PyPI before you install.
**Installation**, the **uvx** shortcut, and **MCP client configuration** (executable path, environment variables, Claude Desktop or Cursor) are documented in the **[cartesia-mcp](https://github.com/cartesia-ai/cartesia-mcp)** README so setup stays in sync with releases.
The official Cartesia MCP Server
# JavaScript/TypeScript
Source: https://docs.cartesia.ai/tools/client-libraries/javascript-typescript
The library that powers the Cartesia Playground.
The Official TS/JS client for the Cartesia API.
# Python
Source: https://docs.cartesia.ai/tools/client-libraries/python
The official Python library for the Cartesia API.
The official Python client for the Cartesia API.
# API Conventions
Source: https://docs.cartesia.ai/use-the-api/api-conventions
All endpoints use HTTPS. HTTP is not supported. API keys that call the API
over HTTP may be subject to automatic rotation.
All API requests use the following base URL: `https://api.cartesia.ai`. (For WebSockets the corresponding protocol is `wss://`.)
### Always send a `Cartesia-Version` header
Each request you send our API should have a `Cartesia-Version` header containing the date (`YYYY-MM-DD`) when you tested your integration. For WebSockets, you can alternately use the `?cartesia_version` query parameter, which will take precedence.
This will help us provide you with timely deprecation notices and enable us to provide automatic backwards compatibility where possible.
For a given `Cartesia-Version`, we will preserve existing input and output fields, but we may make non-breaking changes, such as:
1. Add optional request fields.
2. Add additional response fields.
3. Change conditions for specific error types
4. Add variants to enum-like output values.
Our versioning scheme is inspired by the [Anthropic API](https://docs.anthropic.com/en/api/versioning).
### Use API keys when making requests from a server
Create a new API key at [play.cartesia.ai/keys](https://play.cartesia.ai/keys). Include `Authorization: Bearer ` in the headers of your requests.
### Use access tokens when making requests from a client app
Never use API keys in client apps; they grant full account access and can be extracted from browser or mobile code.
Instead, your server can generate a short-lived access token and send it to the client. See the [Access Token API Reference](/api-reference/auth/access-token) for how to generate one.
* For HTTP requests, include `Authorization: Bearer ` in the headers.
* For WebSocket connections, pass the token as the `?access_token=` query parameter since browsers can't set headers on WebSocket handshakes.
### Check response codes
Our API uses standard HTTP response codes; refer to [httpstatuses.io](https://httpstatuses.io).
### Parse structured error responses
For `Cartesia-Version` values on or after `2026-03-01`, Cartesia returns structured JSON errors.
For the full error reference (all current error codes, schemas, and field nullability), see [API Errors](/use-the-api/api-errors).
```json HTTP error response (Cartesia-Version 2026-03-01 and newer) theme={null}
{
"error_code": "concurrency_limited",
"title": "Too many concurrent requests",
"message": "You have exceeded your plan's concurrency limit.",
"request_id": "550e8400-e29b-41d4-a716-446655440000"
}
```
Field meanings:
1. `error_code`: machine-readable identifier for client logic; can be `null`.
2. `title`: short human-readable summary.
3. `message`: detailed human-readable explanation.
4. `request_id`: request identifier for support/debugging.
5. `doc_url`: optional link to docs for the specific error (when available).
Common `error_code` values today include `quota_exceeded`, `concurrency_limited`, `voice_model_mismatch`, `voice_not_found`, `model_not_found`, `language_not_supported`, `file_too_large`, `unsupported_audio_format`, and `plan_upgrade_required`.
WebSocket and SSE error events include the same error fields plus transport context:
```json WebSocket/SSE error event (Cartesia-Version 2026-03-01 and newer) theme={null}
{
"type": "error",
"done": true,
"status_code": 429,
"error_code": "concurrency_limited",
"title": "Too many concurrent requests",
"message": "You have exceeded your plan's concurrency limit.",
"request_id": "550e8400-e29b-41d4-a716-446655440000:happy-monkeys-fly:8a0f5f3a-3b2f-4f28-b73e-8c5f27e2f8bb",
"context_id": "happy-monkeys-fly"
}
```
Notes:
1. `context_id` appears for TTS WebSocket errors when available.
2. `status_code` is included in WebSocket/SSE error payloads; for HTTP, status remains in the HTTP response status line.
3. `request_id` is always a string; HTTP and SSE request IDs are UUIDs, while WebSocket request IDs may include additional context.
For `Cartesia-Version` values before `2026-03-01` (and invalid versions), legacy error formats are returned instead:
1. HTTP errors are plain text in `Title: Message` format.
2. WebSocket/SSE errors use legacy envelopes with string-only error messages.
### Pass data according to the method
All GET requests use query parameters to pass data. All POST requests use a JSON body or `multipart/form-data`.
# API Errors
Source: https://docs.cartesia.ai/use-the-api/api-errors
For `Cartesia-Version: 2026-03-01` and newer, Cartesia returns structured JSON error objects.
For older API versions, errors may be plain text (for example `Title: Message`).
## HTTP Error Object
```json HTTP error response theme={null}
{
"error_code": "concurrency_limited",
"title": "Too many concurrent requests",
"message": "You have exceeded your plan's concurrency limit.",
"request_id": "550e8400-e29b-41d4-a716-446655440000"
}
```
| Field | Type | Required | Nullable | Notes |
| ------------ | --------------- | -------- | -------- | ----------------------------------------------------------------- |
| `error_code` | `string` | Yes | Yes | Machine-readable code. Can be `null` if no specific code applies. |
| `title` | `string` | Yes | No | Short human-readable error summary. |
| `message` | `string` | Yes | No | Detailed human-readable error explanation. |
| `request_id` | `string` (UUID) | Yes | No | Request identifier for support/debugging. |
| `doc_url` | `string` | No | No | Optional docs link for the error. Omitted when not available. |
## WebSocket Error Event Object
```json WebSocket error event theme={null}
{
"type": "error",
"done": true,
"status_code": 429,
"error_code": "concurrency_limited",
"title": "Too many concurrent requests",
"message": "You have exceeded your plan's concurrency limit.",
"request_id": "550e8400-e29b-41d4-a716-446655440000:happy-monkeys-fly:8a0f5f3a-3b2f-4f28-b73e-8c5f27e2f8bb",
"context_id": "happy-monkeys-fly"
}
```
| Field | Type | Required | Nullable | Notes |
| ------------- | --------- | -------- | -------- | --------------------------------------------------------------------------------------------------------------- |
| `type` | `string` | Yes | No | Always `"error"`. |
| `done` | `boolean` | Yes | No | Currently always `true` for error events. |
| `status_code` | `integer` | Yes | No | HTTP-like status code for the error. |
| `error_code` | `string` | Yes | Yes | Machine-readable code. Can be `null`. |
| `title` | `string` | Yes | No | Short human-readable error summary. |
| `message` | `string` | Yes | No | Detailed human-readable error explanation. |
| `request_id` | `string` | Yes | No | Request identifier for support/debugging. For WebSocket, this may be a UUID or a derived per-message ID string. |
| `doc_url` | `string` | No | No | Optional docs link for the error. Omitted when not available. |
| `context_id` | `string` | No | No | TTS context identifier. Present when available. |
## SSE Error Event Object
SSE errors are sent with `event: error` and JSON in the `data:` line.
```text SSE error event theme={null}
event: error
data: {"type":"error","done":true,"status_code":500,"error_code":null,"title":"Unexpected error","message":"An unexpected error occurred, please contact support@cartesia.ai if the problem persists.","request_id":"550e8400-e29b-41d4-a716-446655440000"}
```
| Field | Type | Required | Nullable | Notes |
| ------------- | --------------- | -------- | -------- | ------------------------------------------------------------- |
| `type` | `string` | Yes | No | Always `"error"`. |
| `done` | `boolean` | Yes | No | Currently always `true` for error events. |
| `status_code` | `integer` | Yes | No | HTTP-like status code for the error. |
| `error_code` | `string` | Yes | Yes | Machine-readable code. Can be `null`. |
| `title` | `string` | Yes | No | Short human-readable error summary. |
| `message` | `string` | Yes | No | Detailed human-readable error explanation. |
| `request_id` | `string` (UUID) | Yes | No | Request identifier for support/debugging. |
| `doc_url` | `string` | No | No | Optional docs link for the error. Omitted when not available. |
## Current Error Codes
More error codes may be added in the future. Integrations should handle unknown
`error_code` values gracefully.
| `error_code` | Meaning |
| -------------------------- | ------------------------------------------------------------------------- |
| `quota_exceeded` | The account has exceeded quota (for example credits or agents usage). |
| `concurrency_limited` | The account has exceeded the plan's concurrency limit. |
| `voice_model_mismatch` | The requested voice is incompatible with the requested model. |
| `voice_not_found` | The requested voice does not exist. |
| `model_not_found` | The requested model does not exist. |
| `language_not_supported` | The requested language is not supported for the requested model or voice. |
| `file_too_large` | The uploaded file is too large. |
| `unsupported_audio_format` | The provided audio format is not supported. |
| `plan_upgrade_required` | The feature requires a higher plan tier. |
# Compare TTS Endpoints
Source: https://docs.cartesia.ai/use-the-api/compare-tts-endpoints
How bytes, SSE, and WebSocket differ for text-to-speech, and when to use each.
Cartesia exposes three ways to turn text into speech. The same models, voices, and core parameters apply everywhere. What changes is how you connect, how audio is framed on the wire, and whether you get timestamps, continuations (streaming model output into one spoken line), or many generations on one connection.
All three endpoints stream audio as it is produced. The bytes endpoint delivers that stream as a single HTTP response body (the same pattern the playground uses). SSE and WebSocket stream too; they chunk audio into multiple events or messages, which is how per-chunk metadata such as timestamps is carried.
## Feature comparison
| | Multiple generations per connection | Timestamps | Continuations |
| --------- | ----------------------------------- | ---------- | ------------- |
| WebSocket | Yes | Yes | Yes |
| Bytes | No (one `POST` per generation) | No | No |
| SSE | No (one `POST` per generation) | Yes | No |
An **utterance** is one stretch of speech you want pronounced as a single unit (usually a sentence or a line of UI copy). **Continuations** let you send that utterance as several WebSocket messages that share a `context_id`. See [Stream inputs using continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) and [contexts](/use-the-api/tts-websocket/contexts).
```mermaid theme={null}
flowchart TD
Q1{"Are you streaming text from an LLM or other partial input?"}
Q2{"Do you need timestamps on HTTP without WebSocket?"}
Q3{"Will you speak often enough that repeated connect/TLS cost hurts?"}
WS["WebSocket"]
SSE["SSE"]
Bytes["Bytes"]
Q1 -- "Yes" --> WS
Q1 -- "No" --> Q2
Q2 -- "Yes" --> SSE
Q2 -- "No" --> Q3
Q3 -- "Yes" --> WS
Q3 -- "No" --> Bytes
```
If you care about time-to-first-byte on every turn, remember that a new HTTPS request pays for TCP and TLS again; that overhead is often on the same order as TTFB for the audio itself. WebSocket amortizes that cost when you keep the socket open.
SSE is still supported for stacks that already consume Server-Sent Events or when you want timestamps while staying on HTTP. For audio only, bytes is usually the better HTTP choice (smaller encoding than JSON-wrapped chunks).
## Pick an endpoint in one minute
| What you are building | Use this | Short label |
| -------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------- |
| Full transcript in one request; you want a streaming HTTP body (efficient; same pattern as the playground) | [`POST /tts/bytes`](/api-reference/tts/bytes) | Stream speech (bytes) |
| Full transcript in one request; you need timestamps without WebSocket, or your stack already uses SSE | [`POST /tts/sse`](/api-reference/tts/sse) | Stream speech with timestamps (SSE) |
| Long-lived session, partial transcript (for example LLM tokens), lowest latency across many turns, timestamps, or several utterances on one socket | [WebSocket `/tts/websocket`](/api-reference/tts/websocket) | Live session (WebSocket) |
If the full transcript is not known up front, use WebSocket with contexts, not bytes or SSE.
***
## Bytes (`POST /tts/bytes`)
Best for batch jobs, caching files, notifications, and anywhere one `POST` per generation is enough.
The response body streams while audio is generated. You can read progressively or buffer to the end. For many output formats this is leaner on the wire than SSE because you receive raw or file bytes instead of JSON-wrapped chunks.
Typical flow:
1. One JSON payload with the full `transcript`, voice, model, and output format (WAV, MP3, raw PCM, and so on).
2. `POST` to `/tts/bytes`.
3. Read the body as data arrives, or consume it to completion.
One request is one generation. For another line of speech, send another `POST`.
See [bytes reference](/api-reference/tts/bytes).
***
## SSE (`POST /tts/sse`)
Best when you need timestamps while staying on HTTP without WebSocket, or when your integration already uses SSE. If you only need audio and not SSE-shaped events, bytes is usually simpler. WebSocket is otherwise the full-featured option for real-time use and supports timestamps as well.
SSE remains available largely for backward compatibility and for teams that standardize on Server-Sent Events.
Typical flow:
1. Same as bytes: one JSON body with the full transcript.
2. `POST` to `/tts/sse`.
3. Consume Server-Sent Events; each event carries the next chunk until completion.
Bytes vs SSE:
| | Bytes | SSE |
| ---------- | ----------------------------------------------- | ---------------------------------------------- |
| Shape | One streaming response body (raw or file bytes) | Many SSE events (often JSON plus base64 audio) |
| Timestamps | No | Yes (in the event payload) |
You still send one full transcript per request: SSE does not support WebSocket-style continuations across multiple `POST`s. An optional `context_id` is echoed for your logs; it does not merge multiple HTTP requests into one utterance. To send text in pieces over time, use WebSocket.
See [SSE reference](/api-reference/tts/sse).
***
## WebSocket (`/tts/websocket`)
Best for assistants, games, telephony-style stacks, or any case where the connection stays open and transcript text may arrive over time.
Why people choose WebSocket:
1. Latency: you pay connect cost once; later generations avoid extra TCP/TLS round trips (often tens to low hundreds of ms per turn).
2. Streaming input: send fragments as they arrive (for example from an LLM) and keep prosody across them. See [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) and [contexts](/use-the-api/tts-websocket/contexts).
3. Timestamps: word- or segment-level timing (model and language limits apply; see WebSocket docs).
4. Multiplexing: multiple `context_id` values on one connection for overlapping utterances.
Typical flow:
1. Open the WebSocket.
2. Send JSON messages. When one utterance is split across messages, use `context_id` and `continue`: set `continue: true` on partials, and `continue: false` on the last part of that utterance (or use the empty-transcript pattern in [contexts](/use-the-api/tts-websocket/contexts) if you cannot know the final string yet).
3. Read audio until the server finishes that context.
See [WebSocket reference](/api-reference/tts/websocket).
***
## Continuations
If you are not streaming text from a model, start with bytes or SSE, not continuations.
When you do use WebSocket streaming, keep one stable `context_id` per utterance, `continue: true` on every partial, and `continue: false` on the final message for that utterance (see [contexts](/use-the-api/tts-websocket/contexts)).
Things that break text or prosody:
* Concatenation: chunks are joined verbatim. A missing space produces `"...world!How..."` instead of `"...world! How..."`.
* SSML and numbers: avoid splitting tokens that must stay together (for example decimals in SSML). See `max_buffer_delay_ms` in the [continuations guide](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).
If you leave `continue: true` longer than you meant, contexts eventually expire on their own and audio is still generated and flushed according to server rules. It is not a runaway failure mode. You should still send `continue: false` when you know the utterance is complete so your client state matches the server. Do not reuse old `context_id` values for unrelated utterances.
***
## Why WebSocket uses `context_id` (and HTTP does not)
On `POST /tts/bytes` and `POST /tts/sse`, you send a complete transcript in one JSON body. There is no continuation protocol across requests.
`context_id` and `continue` matter on WebSocket when one utterance's text is split across multiple messages. The server concatenates chunks that share a `context_id`. `continue: true` means more text is coming; `continue: false` finalizes that utterance.
Mental model:
* Whole line of speech in one string: bytes or SSE. No context API.
* Text arrives in pieces: WebSocket, one `context_id` per utterance, with continuations.
***
## API ergonomics (all endpoints)
* For server-side calls, prefer the API key in the `Authorization` header instead of query strings (headers are less likely to appear in access logs). WebSocket URLs in browsers may need different patterns for your platform.
* Model IDs, voices, and core generation parameters match across bytes, SSE, and WebSocket. What differs is wire format, how chunks are exposed, and whether input can be streamed with continuations.
***
## Where to go next
One POST, streaming response body
Timestamps and SSE-chunked audio
Streaming input, multiplexing, lowest latency across turns
# Concurrency and WebSocket Limits
Source: https://docs.cartesia.ai/use-the-api/concurrency-limits-and-timeouts
Learn about concurrency limits and timeouts with the Cartesia API.
Your account is subject to two types of rate limits: WebSocket limits and generation concurrency limits.
## Concurrency limits by subscription plan
Your subscription plan determines how many requests can be processed simultaneously. Sonic Text-to-Speech (TTS) and Ink Speech-to-Text (STT) each have separate concurrency limits with the same values per plan.
| Plan | TTS Concurrent Requests | STT Concurrent Requests |
| ---------- | ----------------------- | ----------------------- |
| Free | 2 | 8 |
| Pro | 3 | 12 |
| Startup | 5 | 20 |
| Scale | 15 | 60 |
| Enterprise | Custom | Custom |
Sonic (Text-to-Speech) and Ink (Speech-to-Text) services have separate concurrent request limits. For example, if you're on the Scale plan, you can have up to 15 concurrent TTS requests AND 60 concurrent STT requests running simultaneously.
## Text-to-Speech (TTS) Concurrency
We measure TTS generation concurrency in terms of the number of unique contexts active at a given time.
* For HTTP endpoints, each request is treated as a separate context and counts toward your concurrency limit.
* For WebSockets, a unique context\_id defines a context—sending additional requests with the same context\_id does not increase your concurrency usage. This is because requests to the same context are processed sequentially.
* STT **does not** count towards your TTS concurrency limit
If you exceed your TTS concurrency limit, you will receive a `429 Too Many Requests` error. You can check your concurrency limit and upgrade it on the playground at [play.cartesia.ai](https://play.cartesia.ai).
### Interpreting TTS concurrency limits
How you interpret your TTS concurrency limit depends on how you're using the Sonic model family.
For real-time conversational use cases, such as powering voice agents, we've found that the number of parallel conversations you can support is effectively 4X your concurrency limit. This is just a rule of thumb, and depends on the types of conversations you're supporting. You can reach out to us to discuss your specific use case.
For example, if you have a TTS concurrency limit of 15, you can typically support 60 parallel conversations.
For non-conversational use cases, such as generating speech in batch jobs, there is a more direct relationship between your concurrency limit and the number of parallel generations you can support.
For example, if you have a TTS concurrency limit of 15, you can typically support 15 parallel TTS generations. You can use a connection pool to ensure you don't exceed your concurrency limit.
### TTS WebSocket limits
We limit the number of parallel TTS WebSocket connections to 10X your concurrency limit. For example, if you have a concurrency limit of 15, you can have up to 150 parallel TTS WebSocket connections.
If you exceed your WebSocket limit, you will receive a `429 Too Many Requests` error on trying to open a new WebSocket connection.
Usually, when users run into TTS WebSocket limits (even at scale), it's because they're not properly closing idle connections. Beyond closing idle connections, you can also create a connection pool to ensure you don't exceed your WebSocket limit.
### TTS WebSocket timeouts
We close idle TTS WebSocket connections after 5 minutes. We recommend closing and re-opening a new websocket connection when connections stay idle for long periods of time.
## Speech-to-Text (STT) Concurrency
Each active transcription stream counts as one concurrent request, regardless of whether you're using HTTP or WebSocket connections.
* Each concurrent HTTP or WebSocket connection counts toward your STT concurrency limit
* Idle STT WebSockets still count towards your STT concurrency limit
* TTS **does not** count towards your STT concurrency limit
If you exceed your STT concurrency limit, you will receive a `429 Too Many Requests` error.
### STT WebSocket timeouts
We close idle STT WebSocket connections after 3 minutes. We recommend closing and re-opening a new websocket connection when connections stay idle for long periods of time.
# Migrating From OpenAI Whisper to Cartesia Ink
Source: https://docs.cartesia.ai/use-the-api/migrate-from-open-ai
Use Cartesia's Batch Speech-to-Text API with OpenAI's client libraries
Batch Speech-to-Text: This documentation covers OpenAI SDK compatibility for Cartesia Ink's batched transcription endpoint.
For real-time transcription, use our [Streaming STT endpoint](/api-reference/stt/stt).
Cartesia's Batch Speech-to-Text API is compatible with OpenAI's client libraries, enabling seamless migration from OpenAI Whisper.
## Endpoints
**Cartesia Native:** `/stt` - Full feature support\
**OpenAI Compatible:** `/audio/transcriptions` - Drop-in replacement for Whisper on the OpenAI SDK
## Migration Guide for OpenAI SDK
Replace your OpenAI base URL with `https://api.cartesia.ai` to use the compatibility layer for Cartesia:
### Parameter Support
**Supported Parameters**:
* `file` - The audio file to transcribe
* `model` - Use `ink-whisper` for Cartesia's latest model
* `language` - Input audio language (ISO-639-1 format)
* `timestamp_granularities` - Include `["word"]` to get word-level timestamps
**Response Format**: Always returns JSON with transcribed text, duration, language, and optionally word timestamps.
For the complete parameter reference, see our [Batch STT API documentation](/api-reference/stt/transcribe).
### Python Example
```python theme={null}
from openai import OpenAI
client = OpenAI(
api_key="your-cartesia-api-key",
base_url="https://api.cartesia.ai"
)
with open("audio.wav", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
file=audio_file,
model="ink-whisper",
language="en",
timestamp_granularities=["word"]
)
print(transcript.text)
```
### Node.js Example
```typescript theme={null}
import OpenAI from 'openai';
import fs from 'fs';
const client = new OpenAI({
apiKey: 'your-cartesia-api-key',
baseURL: 'https://api.cartesia.ai'
});
const transcription = await client.audio.transcriptions.create({
file: fs.createReadStream('audio.wav'),
model: 'ink-whisper',
language: 'en',
timestamp_granularities: ['word']
});
console.log(transcription.text);
```
## Direct API Usage
Both endpoints accept identical parameters and return the same JSON response format:
### Cartesia Native Endpoint
```bash theme={null}
curl -X POST https://api.cartesia.ai/stt \
-H "X-API-Key: your-cartesia-api-key" \
-F "file=@audio.wav" \
-F "model=ink-whisper" \
-F "language=en" \
-F "timestamp_granularities[]=word"
```
### OpenAI-Compatible Endpoint
```bash theme={null}
curl -X POST https://api.cartesia.ai/audio/transcriptions \
-H "X-API-Key: your-cartesia-api-key" \
-F "file=@audio.wav" \
-F "model=ink-whisper" \
-F "language=en" \
-F "timestamp_granularities[]=word"
```
## Migration from OpenAI
To migrate from OpenAI's Whisper API to Cartesia:
1. **Update the base URL**: Change from `https://api.openai.com/v1` to `https://api.cartesia.ai`
2. **Update authentication**: Replace your OpenAI API key with your Cartesia API key
3. **Update model names**: Use `ink-whisper` instead of OpenAI's model names
4. **Keep the same endpoint**: Continue using `/audio/transcriptions`
5. **Avoid unsupported parameters**: Remove `prompt`, `temperature`, and `response_format` parameters
6. **Use timestamp\_granularities (Optional)**: Add `timestamp_granularities: ["word"]` to get word-level timestamps
The core functionality remains the same, with JSON responses containing transcribed text and optional word timestamps.
# Buffering
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/buffering
Control how text is buffered before speech generation to balance prosody and latency.
Cartesia supports two buffering modes for streaming TTS: **managed buffering** and **custom buffering**. The right choice depends on how much control you need over the prosody-latency tradeoff.
**Start with managed buffering.** It produces natural-sounding speech with minimal integration effort. Switch to custom buffering only if you need fine-grained control.
## Managed buffering
Stream LLM tokens directly to Cartesia and let the API decide when to start generating speech. This is the same approach used in Cartesia's managed voice agents platform.
Set `max_buffer_delay_ms` to a value greater than 0 (the default is 3000ms) and stream text token by token.
```json theme={null}
{
"model_id": "sonic-3",
"transcript": "Hello",
"voice": {
"mode": "id",
"id": "a0e99841-438c-4a64-b679-ae501e7d6091"
},
"context_id": "my-context",
"continue": true,
"max_buffer_delay_ms": 3000
}
```
The API buffers incoming text until it has enough context to produce high-quality speech, or until `max_buffer_delay_ms` elapses—whichever comes first. This produces results similar to sentence-level aggregation while still optimizing for latency.
**When to use managed buffering:**
* You're streaming LLM output token by token
* You want natural-sounding speech without building buffering logic
* You want a simple integration with good defaults
## Custom buffering
Handle buffering yourself and send complete phrases or sentences to Cartesia. Set `max_buffer_delay_ms` to `0` so the API generates speech immediately from whatever you provide.
```json theme={null}
{
"model_id": "sonic-3",
"transcript": "Hello, my name is Sonic.",
"voice": {
"mode": "id",
"id": "a0e99841-438c-4a64-b679-ae501e7d6091"
},
"context_id": "my-context",
"continue": true,
"max_buffer_delay_ms": 0
}
```
With custom buffering, you control the prosody-latency tradeoff directly:
* **Full sentences** produce the best prosody but add latency while you wait for the sentence to complete.
* **Partial sentences** reduce latency but may result in less natural speech at chunk boundaries.
**When to use custom buffering:**
* You need precise control over when speech generation starts
* You have your own sentence detection or text aggregation logic
* You're optimizing for a specific latency target
## Avoid the middle ground
A common mistake is to aggregate text client-side into sentences or phrases *and* use the default `max_buffer_delay_ms` of 3000ms. This can cause unnecessary latency—after receiving a complete sentence, the API may wait up to 3000ms for additional input before generating speech.
Pick one approach:
* **Managed buffering:** Stream tokens with `max_buffer_delay_ms > 0` and let Cartesia handle aggregation.
* **Custom buffering:** Aggregate text yourself and set `max_buffer_delay_ms = 0`.
## Configuration reference
Maximum time in milliseconds the API waits for additional input before generating speech from buffered text.
* **Range:** 0–5000ms
* **Default:** 3000ms
* Set to `0` for custom buffering (no server-side buffering)
* Set to `> 0` for managed buffering
If you use `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with managed buffering, make sure decimal values are not split across tokens. Submitting `1.0` as `1`, `.`, `0` will cause parsing errors.
## Tips for best results
* **End sentences with punctuation.** Without closing punctuation (`.`, `?`, `!`), the model may treat text as incomplete and wait for the buffer delay to elapse before generating. See [streaming inputs with continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) for more details.
* **Signal when input is done.** When a turn is complete, use `continue: false` (WebSocket) or `no_more_inputs()` (SDK) so the model doesn't wait for more text.
* **Test with realistic input patterns.** Buffering behavior depends on how text arrives—test with actual LLM output rather than pre-written text.
# Context Flushing and Flush IDs
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/context-flushing-and-flush-i-ds
Learn about managing multiple transcript generations with context flushing.
## Overview
When using [context IDs with the WebSocket API](/use-the-api/tts-websocket/contexts), all audio chunks for transcripts submitted to a single context share the same context ID. This makes it difficult to determine which audio chunks correspond to specific transcript submissions.
While this behavior works well for streaming audio, some implementations require the ability to map audio chunks back to their originating transcripts.
## Manual Flushing
Manual flushing creates clear boundaries between transcript submissions within the same context.
### How It Works
Each time you trigger a manual flush, the system increments a `flush_id` counter. This ID is included in corresponding response audio chunk payloads, allowing you to track which transcript generated specific audio chunks.
### Implementation
To trigger a manual flush:
1. Send a request with these parameters:
* `continue=True` (indicates you're continuing with the same context)
* `flush=True` (triggering the flush operation)
* Empty transcript
* Same context ID as your previous request
### Example Flow
```
1. Submit transcript 1 on context 1
2. Flush context 1
3. Submit transcript 2 on context 1
```
In this flow:
* All audio chunks from transcript 1 will have `flush_id=1`
* The manual flush increments the ID
* All audio chunks from transcript 2 will have `flush_id=2`
## Payload Structure
Each audio chunk payload includes a `flush_id` field that serves as a transcript identifier. This ID increments with each manual flush operation, creating a clear boundary between transcript submissions.
## When to Use Manual Flushing
Consider using manual flushing when:
* You need to associate audio chunks with their originating transcripts
* Your application architecture expects a one-to-one relationship between transcripts and response streams
* You're integrating with frameworks that assume each transcript has a corresponding generator
This feature is particularly helpful when using multiple providers, as it aligns the Cartesia API with systems that expect discrete generator responses per transcript.
# Contexts
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/contexts
This is a hands-on guide to input streaming using WebSocket contexts. For a conceptual overview of how input streaming works in Sonic, see the [input streaming guide](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).
> In many real time use cases, you don't have your transcripts available upfront—like when you're generating them using an LLM. For these cases, Sonic supports input streaming.
The context IDs you pass to the Cartesia API identify speech contexts. Contexts maintain prosody between their inputs—so you can send a transcript in multiple parts and receive seamless speech in return.
To stream in inputs on a context, just pass a `continue` flag (set to `true`) for every input that you expect will be followed by more inputs. (By default, this flag is set to `false`.)
To finish a context, just set `continue` to `false`. If you do not know the last transcript in advance, you can send an input with an empty transcript and `continue` set to `false`.
Contexts automatically expire 1 second after the last audio output is streamed out. Attempting to send another input on the same context ID after expiry is not supported.
Whether this input may be followed by more inputs.
### Input Format
1. Inputs on the same context must keep all fields except `transcript`, `continue`, and `duration` the same.
2. Transcripts are concatenated verbatim, so make sure they form a valid transcript when joined together. Make sure to include any spaces between words or punctuations as necessary. For example, in languages with spaces, you should include a space at the end of the preceding transcript, e.g. transcript 1 is `Thanks for coming, ` and transcript 2 is `it was great to see you.`
### Example
Let's say you're trying to generate speech for "Hello, Sonic! I'm streaming inputs." You should stream in the following inputs (repeated fields omitted for brevity). Note: all other fields (e.g. `model_id`, `language`) are required and should be passed unchanged between requests with input streaming.
```json Input Streaming theme={null}
{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "inputs.", "continue": false, "context_id": "happy-monkeys-fly"}
```
If [streaming in input tokens](/build-with-cartesia/capability-guides/stream-inputs-using-continuations), we recommend using `max_buffer_delay_ms`, which sets the maximum time the model will buffer text before starting generation.
Without this option set, the model will start generating immediately on the first request, giving you full control over buffering of inputs.
If you don't know the last transcript in advance, you can send an input with an empty transcript and `continue` set to `false`:
```json Input Streaming theme={null}
{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "inputs.", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "", "continue": false, "context_id": "happy-monkeys-fly"}
```
### Output
You will only receive `done: true` after outputs for the entire context have been returned.
Outputs for a given context will always be in order of the inputs you streamed in. (That is, if you send input A and then input B on a context, you will first receive the chunks corresponding to input A, and then the chunks corresponding to input B.)
## Cancelling Requests
You may also cancel outgoing requests through the websocket.
To cancel a request, send a JSON message with the following structure:
```json WebSocket Request theme={null}
{
"context_id": "happy-monkeys-fly",
"cancel": true
}
```
When you send a cancel request:
1. It will only halt requests that have not begun generating a response yet.
2. Any currently generating request will continue sending responses until completion.
The `context_id` in the cancel request should match the `context_id` of the request you want to cancel.
# Get API Key
Source: https://docs.cartesia.ai/api-reference/api-keys/get
/latest.yml GET /api-keys/{id}
Returns metadata for a single API key.
# List API Keys
Source: https://docs.cartesia.ai/api-reference/api-keys/list
/latest.yml GET /api-keys
Returns a paginated list of standard API keys owned by the authenticating organization. Only metadata is returned, not the keys themselves. Admin API keys are not included.
# Generate a New Access Token
Source: https://docs.cartesia.ai/api-reference/auth/access-token
/latest.yml POST /access-token
Generates a new Access Token for the client. These tokens are short-lived and should be used to make requests to the API from authenticated clients.
# Create
Source: https://docs.cartesia.ai/api-reference/datasets/create
/latest.yml POST /datasets/
Create a new dataset
# Delete
Source: https://docs.cartesia.ai/api-reference/datasets/delete
/latest.yml DELETE /datasets/{id}
Delete a dataset
# Delete file
Source: https://docs.cartesia.ai/api-reference/datasets/delete-file
/latest.yml DELETE /datasets/{id}/files/{fileID}
Remove a file from a dataset
# Get
Source: https://docs.cartesia.ai/api-reference/datasets/get
/latest.yml GET /datasets/{id}
Retrieve a specific dataset by ID
# List
Source: https://docs.cartesia.ai/api-reference/datasets/list
/latest.yml GET /datasets/
Paginated list of datasets
# List files
Source: https://docs.cartesia.ai/api-reference/datasets/list-files
/latest.yml GET /datasets/{id}/files
Paginated list of files in a dataset
# Update
Source: https://docs.cartesia.ai/api-reference/datasets/update
/latest.yml PATCH /datasets/{id}
Update an existing dataset
# Upload file
Source: https://docs.cartesia.ai/api-reference/datasets/upload-file
/latest.yml POST /datasets/{id}/files
Upload a new file to a dataset
# Create
Source: https://docs.cartesia.ai/api-reference/fine-tunes/create
/latest.yml POST /fine-tunes/
Create a new fine-tune
# Delete
Source: https://docs.cartesia.ai/api-reference/fine-tunes/delete
/latest.yml DELETE /fine-tunes/{id}
Delete a fine-tune
# Get
Source: https://docs.cartesia.ai/api-reference/fine-tunes/get
/latest.yml GET /fine-tunes/{id}
Retrieve a specific fine-tune by ID
# List
Source: https://docs.cartesia.ai/api-reference/fine-tunes/list
/latest.yml GET /fine-tunes/
Paginated list of all fine-tunes for the authenticated user
# List Voices
Source: https://docs.cartesia.ai/api-reference/fine-tunes/list-voices
/latest.yml GET /fine-tunes/{id}/voices
List all voices created from a fine-tune
# Infill (Bytes)
Source: https://docs.cartesia.ai/api-reference/infill/bytes
/latest.yml POST /infill/bytes
Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.
**The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.**
At least one of `left_audio` or `right_audio` must be provided.
As with all generative models, there's some inherent variability, but here's some tips we recommend to get the best results from infill:
- Use longer infill transcripts
- This gives the model more flexibility to adapt to the rest of the audio
- Target natural pauses in the audio when deciding where to clip
- This means you don't need word-level timestamps to be as precise
- Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
- This helps the model generate more natural transitions
# Create
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/create
/latest.yml POST /pronunciation-dicts/
Create a new pronunciation dictionary
# Delete
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/delete
/latest.yml DELETE /pronunciation-dicts/{id}
Delete a pronunciation dictionary
# Get
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/get
/latest.yml GET /pronunciation-dicts/{id}
Retrieve a specific pronunciation dictionary by ID
# List
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/list
/latest.yml GET /pronunciation-dicts/
List all pronunciation dictionaries for the authenticated user
# Update
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/update
/latest.yml PATCH /pronunciation-dicts/{id}
Update a pronunciation dictionary
# Get Agent Usage
Source: https://docs.cartesia.ai/api-reference/usage/agents
/latest.yml GET /usage/agents
Returns your agent usage over time, bucketed by the requested interval.
# Get Credit Usage
Source: https://docs.cartesia.ai/api-reference/usage/credits
/latest.yml GET /usage/credits
Returns your credit usage over time, bucketed by the requested interval.
# Voice Changer (Bytes)
Source: https://docs.cartesia.ai/api-reference/voice-changer/bytes
/latest.yml POST /voice-changer/bytes
Takes an audio file of speech, and returns an audio file of speech spoken with the same intonation, but with a different voice.
This endpoint is priced at 15 characters per second of input audio.
# Voice Changer (SSE)
Source: https://docs.cartesia.ai/api-reference/voice-changer/sse
/latest.yml POST /voice-changer/sse
# Audio encodings
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/audio-encodings
Pick the encoding that matches your downstream pipeline.
## TTS output encodings
Used in the `output_format.encoding` field when generating audio.
| Encoding | Bit depth | Best for | Pair with sample rate |
| ----------- | ---------------- | --------------------------------------------------------------- | --------------------------------- |
| `pcm_s16le` | 16-bit int | General-purpose playback, browsers, audio players, most devices | 44100 (CD quality) or 16000–48000 |
| `pcm_f32le` | 32-bit float | ML post-processing, high-fidelity recording, audio analysis | 48000 |
| `pcm_mulaw` | 8-bit compressed | North American / Japanese telephony (G.711μ), Twilio | 8000 |
| `pcm_alaw` | 8-bit compressed | European / international telephony (G.711A) | 8000 |
### `pcm_s16le`
16-bit signed integer PCM, little-endian. Matches the standard audio CD format and is the most widely supported encoding across audio players, browsers, and hardware. Use this as your default unless you have a specific reason to choose another format.
```json theme={null}
{
"container": "raw",
"encoding": "pcm_s16le",
"sample_rate": 44100
}
```
### `pcm_f32le`
32-bit floating point PCM, little-endian. Provides the highest precision and dynamic range. Use when your pipeline handles float audio end-to-end—for example, feeding generated audio into an ML model, performing signal processing with NumPy/SciPy, or recording to a lossless format for later mastering.
```json theme={null}
{
"container": "raw",
"encoding": "pcm_f32le",
"sample_rate": 48000
}
```
### `pcm_mulaw`
8-bit μ-law compressed PCM. The standard encoding for North American and Japanese telephone networks (G.711μ). Use this when sending audio to Twilio or any telephony provider that expects μ-law. Always pair with an 8000 Hz sample rate to match the telephony standard.
```json theme={null}
{
"container": "raw",
"encoding": "pcm_mulaw",
"sample_rate": 8000
}
```
### `pcm_alaw`
8-bit A-law compressed PCM. The standard encoding for European and international telephone networks (G.711A). Use when your telephony infrastructure expects A-law rather than μ-law. Always pair with an 8000 Hz sample rate.
```json theme={null}
{
"container": "raw",
"encoding": "pcm_alaw",
"sample_rate": 8000
}
```
## STT input encodings
Used in the `encoding` parameter when sending audio for transcription. Must match the actual encoding of your audio source.
| Encoding | Bit depth | Common sources |
| ----------- | ---------------- | ------------------------------------------------------------------- |
| `pcm_s16le` | 16-bit int | Microphones, browsers (Web Audio API), most audio capture libraries |
| `pcm_s32le` | 32-bit int | Professional audio interfaces |
| `pcm_f16le` | 16-bit float | Half-precision ML pipelines |
| `pcm_f32le` | 32-bit float | ML models, Web Audio API `AudioWorklet` nodes, NumPy/SciPy |
| `pcm_mulaw` | 8-bit compressed | North American telephony, Twilio streams |
| `pcm_alaw` | 8-bit compressed | European telephony systems |
For best STT performance, resample your audio to `pcm_s16le` at 16000 Hz before sending.
# Choosing a Voice
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/choosing-a-voice
How to pick the best voice for your Voice Agents
When designing a voice agent experience, the voice that your agents will speak in is a critical choice that will influence your customers' experience.
Cartesia offers 500+ voices out-out-of-box, as well as the ability to clone your own voices.
### Featured Voices
We feature a set of Voices that we've found work well for our customers and pass our internal quality checks. These voices are a great starting point to find the best Voice for your voice agent.
Featured Voices are displayed with a check mark icon next to their names on [play.cartesia.ai](https://play.cartesia.ai/).
### Stable voices (best for voice agents)
For voice agents in production, we've found that more stable, realistic voices perform better than studio quality, emotive voices. From our testing, we think these are the top performing English Voices for voice agents in Sonic 3:
* **Male**: Ronald, Carson
* **Female**: Katie, Jacqueline, Brooke
### Emotive voices (best for AI characters)
Our latest model, Sonic 3, is very expressive with some voices like Tessa and Maya labeled as emotive in the playground, and respond well to [emotion instructions](/build-with-cartesia/sonic-3/volume-speed-emotion).
If your use case requires more expressive speech (e.g. companion apps, game characters), then we suggest trying:
* **Male**: Kyle, Cory
* **Female**: Tessa, Ariana
We tag such voices as Emotive in our playground and you can see a full list [here](https://play.cartesia.ai/voices?tags=Emotive).
# Choosing TTS parameters
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/choosing-tts-parameters
Our Text-to-Speech API includes many parameters that can be bewildering to developers who have not
worked with audio before.
In general, you should pick the highest precision and sample rate supported by every stage of your audio
pipeline, including telephony and device outputs.
A typical digital audio setup will perform well with these settings, which match the standard audio CD format:
```
output_format: {
container: "raw",
encoding: "pcm_s16le",
sample_rate: 44100,
}
```
If you know your pipeline supports a higher encoding and sample rate end to end, the highest quality settings are:
```
output_format: {
container: "raw",
encoding: "pcm_f32le",
sample_rate: 48000,
}
```
## Reference
The container format (if any), for the audio output.
Available options: `RAW`, `WAV`, `MP3`. Only the Bytes endpoint supports all container formats;
our streaming endpoints (SSE, Websockets) only support `RAW`.
The encoding of the output audio. Available options: `pcm_f32le`, `pcm_s16le`, `pcm_mulaw`, `pcm_alaw`.
For detailed guidance on when to use each encoding, see [Audio encodings](/build-with-cartesia/capability-guides/audio-encodings).
The sample rate of the output audio. Remember that to represent a given signal, the sample rate
must be at least twice the highest frequency component of the signal (Nyquist theorem).
Available options: `8000`, `16000`, `22050`, `24000`, `44100`, `48000`.
## Examples
### Audio CD quality
Standard audio CDs are encoded as `pcm_s16le` at 44.1kHz sample rate:
```
output_format: {
container: "raw",
encoding: "pcm_s16le",
sample_rate: 44100,
}
```
This performs well for consumer digital audio setups.
### Telephony
Many customers send their audio output over Twilio. Since all audio sent over Twilio is
transcoded to µlaw encoding with 8kHz sample rate (to match the telephony standard), you should
specify the following output\_format:
```
output_format: {
container: "raw",
encoding: "pcm_mulaw",
sample_rate: 8000,
}
```
### Bluetooth headsets
If you happen to know that that the user is using a Bluetooth headset (such as AirPods) to multiplex
both microphone input and headphone output, the user will be on the Bluetooth Hands-Free Profile
(HFP), limiting sample rate to 16kHz. (In practice, it's difficult to programmatically determine the
end-user's microphone/speaker devices, so this example is a bit contrived.)
```
output_format: {
container: "raw"
encoding: "pcm_s16le",
sample_rate: 16000,
}
```
# Clone Voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices
Learn how to get the best voice clones from your audio clips.
Voice cloning is available through the [playground](https://play.cartesia.ai) and the [API](/2024-11-13/api-reference/voices/clone). With current API versions, instant cloning uses **high-similarity** mode: clones sound more like the source clip, but may reproduce background noise. For the legacy **stability** workflow, pin API version `2024-11-13` and see [Older TTS models](/build-with-cartesia/tts-models/older-models).
For the best voice clones, we recommend following these best practices:
## General best practices for voice cloning
1. **Choose an appropriate script to speak.** You want your recording to align as closely as possible with the voice you want to generate. For example, don't read a colorless transcript in a monotone voice unless you're aiming for a monotonous clone. Instead, prepare a script that is suited to your use case and has the right energy.
2. **Speak as clearly as possible and avoid background noise.** For example, when recording yourself, try to use a high-quality microphone and be in a quiet space.
3. **Avoid long pauses.** Pauses in the recording will be mimicked by the cloned voice, such as between sentences. Ensure your recording matches the pacing you want your voice to follow.
4. **Trim your recording.** The audio you provide should roughly contain speech from start to finish. Make sure the speaker is not cut-off and that there's no excessive silence at the beginning or end. You can use a tool like Audacity or our playground make the perfect clip from your recording.
5. **Speak in the target language.** For instance, if you want the cloned voice to speak Spanish, speak Spanish in the recording. If this is not possible, you can use Cartesia's localization feature—available in the playground and in the API—to convert your clone to a different language.
## Best practices for high-similarity clones
1. **Limit your recording to ten seconds.** This is the sweet spot for high-similarity clones. A longer clip will not result in a better clone.
2. **Set `enhance` to `false` when cloning.** Unless your source clip has substantial background noise, any postprocessing will reduce the similarity of the clone to the source clip.
# End-to-end Pro Voice Cloning (Python)
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices-pro/api
Use Cartesia's REST API to create a Pro Voice Clone.
> **Prerequisites**
>
> 1. You have a **Cartesia API token** (export it as `CARTESIA_API_TOKEN`).
> 2. You have at least 1 M credits on your account.
> 3. You have a folder called `samples/` with one or more `.wav` files.
```python lines theme={null}
"""
End-to-end Pro Voice Cloning example.
Steps
-----
1. Create a dataset.
2. Upload audio files from samples/ to the dataset.
3. Kick off a fine-tune from that dataset.
4. Poll until fine-tune is completed.
5. Get the voices produced by the fine-tune.
"""
import os
import time
from pathlib import Path
import requests
API_BASE = "https://api.cartesia.ai"
API_HEADERS = {
"Cartesia-Version": "2025-04-16",
"Authorization": f"Bearer {os.environ['CARTESIA_API_KEY']}",
}
def create_dataset(name: str, description: str) -> str:
"""POST /datasets → dataset id."""
res = requests.post(
f"{API_BASE}/datasets",
headers=API_HEADERS,
json={"name": name, "description": description},
)
res.raise_for_status()
return res.json()["id"]
def upload_file_to_dataset(dataset_id: str, path: Path) -> None:
"""POST /datasets/{dataset_id}/files (multipart/form-data)."""
with path.open("rb") as fp:
res = requests.post(
f"{API_BASE}/datasets/{dataset_id}/files",
headers=API_HEADERS,
files={"file": fp, "purpose": (None, "fine_tune")},
)
res.raise_for_status()
def create_fine_tune(dataset_id: str, *, name: str, language: str, model_id: str) -> str:
"""POST /fine-tunes → fine-tune id."""
body = {
"name": name,
"description": "Pro Voice Clone demo",
"language": language,
"model_id": model_id,
"dataset": dataset_id,
}
res = requests.post(f"{API_BASE}/fine-tunes", headers=API_HEADERS, json=body, timeout=60)
res.raise_for_status()
return res.json()["id"]
def wait_for_fine_tune(ft_id: str, every: float = 10.0) -> None:
"""Poll GET /fine-tunes/{id} until status == completed."""
start = time.monotonic()
while True:
res = requests.get(f"{API_BASE}/fine-tunes/{ft_id}", headers=API_HEADERS)
res.raise_for_status()
status = res.json()["status"]
print(f"fine-tune {ft_id} -> {status}. Elapsed: {time.monotonic() - start:.0f}s")
if status == "completed":
return
if status == "failed":
raise RuntimeError(f"fine-tune ended with status={status}")
time.sleep(every)
def list_voices(ft_id: str) -> list[dict]:
"""GET /fine-tunes/{id}/voices → list of voices."""
res = requests.get(f"{API_BASE}/fine-tunes/{ft_id}/voices", headers=API_HEADERS)
res.raise_for_status()
return res.json()["data"]
if __name__ == "__main__":
# Create the dataset
DATASET_ID = create_dataset("PVC demo", "Samples for a Pro Voice Clone")
print("Created dataset:", DATASET_ID)
# Upload .wav files to the dataset
for wav_path in Path("samples").glob("*.wav"):
upload_file_to_dataset(DATASET_ID, wav_path)
print(f"Uploaded {wav_path.name} to dataset {DATASET_ID}")
# Ask for confirmation before kicking off the fine-tune
confirmation = input(
"Are you sure you want to start the fine-tune? It will cost 1M credits upon successful completion (yes/no): "
)
if confirmation.lower() != "yes":
print("Fine-tuning cancelled by user.")
exit()
# Kick off the fine-tune
FINE_TUNE_ID = create_fine_tune(
DATASET_ID,
name="PVC demo",
language="en",
model_id="sonic-2",
)
print(f"Started fine-tune: {FINE_TUNE_ID}")
# Wait for training to finish
wait_for_fine_tune(FINE_TUNE_ID)
print("Fine-tune completed!")
# Fetch the voices created by the fine-tune
voices = list_voices(FINE_TUNE_ID)
print("Voices IDs:")
for voice in voices:
print(voice["id"])
```
# Pro Voice Cloning
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices-pro/playground
## Why use Pro Voice Cloning?
A Professional Voice Clone (PVC) is a voice that uses a fine-tune of our TTS model on your data, which allows it to create an almost exact replica of the voice it hears including accent, speaking style, and audio quality.
Compared to [Instant Voice Cloning](/build-with-cartesia/capability-guides/clone-voices), Pro Voice Cloning can capture the exact nuances of your hours of studio-quality audio voice data.
## Overview
Pro Voice Cloning is available in the [Playground](https://play.cartesia.ai/pro-voice-cloning) for anyone with a Cartesia subscription of Startup or higher. It allows you to create highly accurate voice clones by leveraging a larger amount of data compared to instant cloning.
| Feature | Required audio data | Pricing: cost to create | Pricing: cost to use for TTS |
| ------------------- | ------------------- | ----------------------- | ---------------------------- |
| Instant Voice Clone | 10 seconds | Free | 1 credit per character |
| Pro Voice Clone | 3 hours | 1M credits on success | 1.5 credits per character |
When you create a Pro Voice Clone, Cartesia first fine-tunes a model on your data, then creates Voices from selected clips of your data. These Voices are tied to the fine-tuned model and will be automatically used with these Voices for text-to-speech.
## Get started
Visit the Pro Voice Clone tab to get started on your first PVC. On the home page, you can to see all your fine-tuned models and their statuses (i.e Draft, Failed, Training, Completed).
Fill out the form to create a Pro Voice Clone.
Then, upload all of the audio files you want to use for training. You can upload multiple
files at once. Files must be one of the following audio formats:
* .wav
* .mp3
* .flac
* .ogg
* .oga
* .ogx
* .aac
* .wma
* .m4a
* .opus
* .ac3
* .webm
Pro Voice Clones require a minimum of 30 minutes of audio, but we recommend 2 hours of audio for optimal balance of quality and effort. The Pro Voice Clone will closely match your uploaded data, so make sure it sounds the way you like in terms of background noise, loudness, and speech quality.
Generally, it's better to upload audio with only the speaker you which to clone. Multi-speaker audio can interfere with cloning quality.
If you also reused data from past Pro Voice Clones. Switch to the **Select dataset** tab to view previous datasets. These datasets can be edited separately from your PVCs and are helpful for managing your audio files.
Training should take 3 hours to complete. You'll only be charged if the training is successful. If training fails, you can click the `Re-attempt Training` button to try again or contact [support](mailto:support@cartesia.ai) if the failures persist.
Once training is complete, we'll automatically create four Voices based on different source audio clips from your dataset. These Voices are internally linked to your fine-tuned model, which will be used when you specify the model ID of the fine-tuned model in your requests.
The Voices are also available in the Voice Library under My Voices and can be used through the API.
**Note about base model updates:**
We've fine-tuned the latest base model available in production, which is reflected in the displayed model ID. This means that the fine-tuned model is fixed to this particular model ID and will not be activated if you use a different `model-id`. PVCs will not automatically be updated for future base models, and will need to be retrained on each new base model.
Retraining a new fine-tuned model with new data or the latest base model will again cost 1M credits.
# Localize voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/localize-voices
Learn how to localize voices for your brand or product.
The localization feature accepts a voice to localize, the gender of the voice, and the target language and accent to localize to, and produces a Voice that you can use to generate speech (or save as a new voice).
# Stream Inputs using Continuations
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/stream-inputs-using-continuations
Learn how to stream input text to Sonic TTS.
In many real-time use cases, you don't have input text available upfront—like when you're generating it on the fly using a language model. For these cases, we support input streaming through a feature we call *continuations*.
This guide will cover how input streaming works from the perspective of the TTS model. If you just want to implement input streaming, see [the WebSocket API reference](/api-reference/tts/tts), which implements continuations using *contexts*.
## Continuations
Continuations are generations that extend already generated speech. They're called continuations because you're continuing the generation from where the last one left off, maintaining the *prosody* of the previous generation.
If you don't use continuations, you get sudden changes in prosody that create seams in the audio.
Prosody refers to the rhythm, intonation, and stress in speech. It's what makes speech flow naturally and sound human-like.
Let's say we're using an LLM and it generates a transcript in three parts, with a one second delay between each part:
1. `Hello, my name is Sonic.`
2. ` It's very nice`
3. ` to meet you.`
To generate speech for the whole transcript, we might think to generate speech for each part independently and stitch the audios together:
Unfortunately, we end up with speech that has sudden changes in prosody and strange pacing:
Your browser does not support the audio element.
Now, let's try the same transcripts, but using continuations. The setup looks like this:
Here's what we get:
Your browser does not support the audio element.
As you can hear, this output sounds seamless and natural.
You can scale up continuations to any number of inputs. There is no limit.
## Caveat: Streamed inputs should form a valid transcript when joined
This means that `"Hello, world!"` can be followed by `" How are you?"` (note the leading space) but not `"How are you?"`, since when joined they form the invalid transcript `"Hello, world!How are you?"`.
In practice, this means you should maintain spacing and punctuation in your streamed inputs.
**End complete sentences with closing punctuation** (for example `.`, `?`, or `!`).
If a streamed chunk does not end with sentence-ending punctuation, the model often treats it as an incomplete sentence. That can cause:
* **Extra latency:** Text may stay in the automatic input buffer until the model sees a clearer boundary or until `max_buffer_delay_ms` elapses (**3000ms by default**), so audio starts later than you expect.
* **Audio artifacts:** The model expects natural sentence endings; without closing punctuation, the generated audio sometimes ends with odd or distorted sounds.
When a user-facing utterance is finished, put terminal punctuation on the final segment (and signal that no more text is coming on the context when appropriate, for example `no_more_inputs()` in the SDK or `continue: false` over the WebSocket).
## Automatic buffering with `max_buffer_delay_ms`
When streaming inputs from LLMs word-by-word or token-by-token, we buffer text until the optimal transcript length for our model. The default buffer is 3000ms, if you wish to modify this you can use the `max_buffer_delay_ms` parameter, though we *do not recommend making this change*.
If you plan on using `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with buffering, make sure decimal values are not split up.
Submitting `1.0` as `1`, `.`, `0` will result in unintended failure modes.
### How it works
When set, the model will buffer incoming text chunks until it's confident it has enough context to generate high-quality speech, or the buffer delay elapses, whichever comes first.
Without this buffer, the model would immediately start generating with each input, which could result in choppy audio or unnatural prosody if inputs are very small (like single words or tokens).
### Configuration
* **Range**: Values between 0-5000ms are supported
* **Default**: 3000ms
Use this *only* if
* you have custom buffering client side, in which case you can set this to 0
* you have choppiness even at 3000ms, in which case you can try a higher value
```js lines theme={null}
// Example WebSocket request with `max_buffer_delay_ms`
{
"model_id": "sonic-3",
"transcript": "Hello", // First word/token
"voice": {
"mode": "id",
"id": "a0e99841-438c-4a64-b679-ae501e7d6091"
},
"context_id": "my-conversation-123",
"continue": true,
"max_buffer_delay_ms": 3000 // Buffer up to 3000ms
}
```
Let's try the following transcripts with continuations and the default `max_buffer_delay_ms=3000`: `['Hello', 'my name', 'is Sonic.', "It's ", 'very ', 'nice ', 'to ', 'meet ', 'you.']`
Your browser does not support the audio element.
# Custom Pronunciations
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/custom-pronunciations
Learn how to specify custom pronunciations for words that are hard to get right, like proper nouns or domain-specific terms.
All models in the Sonic TTS family support custom pronunciations in your transcripts. Try out the pronunciation tool on our [demo](https://play.cartesia.ai/demos/pronunciation) page.
`sonic-3` supports custom pronunciation dictionaries, which allow specifying how to pronounce a specific word or words more easily and sustainably.
At its core, a dictionary is a simple search and replace, which directs the model to use another string in lieu of the text for the transcript. The pronunciation can either be an [IPA pronunciation](/build-with-cartesia/sonic-3/phonemes), or a "sounds-like" guidance:
```json lines theme={null}
[
{
"text": "bayou",
"pronunciation": "<<ˈ|b|ɑ|ˈ|j|u>>"
},
{
"text": "jambalaya",
"pronunciation": "<<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>>"
},
{
"text": "tchoupitoulas",
"pronunciation": "chop-uh-TOO-liss"
}
]
```
These JSONs can then be saved as a pronunciation dictionaries [through our API](https://docs.cartesia.ai/api-reference/pronunciation-dicts/create), or through our [playground](https://play.cartesia.ai/pronunciation). The playground gives affordances for creating and manipulating dictionaries also directly in the UI:
Once the dictionaries are created, they can be used in any of the TTS APIs by specifying the id in `pronunciation_dict_id`.
With the above dictionary, the string: `I ate some jambalaya on tchoupitoulas street` would become `I ate some <<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>> on chop-uh-TOO-liss street` before being handed off to the model, which in turn, would do a better job in pronouncing it properly.
## Case Sensitivity
Dictionary matching is **case-sensitive**, with one exception: a lowercase entry also matches its sentence-start capitalized form. For example, `cat` matches both `cat` and `Cat`, but not `CAT`. An entry for `CAT` only matches `CAT`.
This applies to multi-word entries too. An entry for `green valley` matches `green valley` and `Green valley`, but not `Green Valley`.
**Use lowercase entries for common words.** These match the word both mid-sentence (`cat`) and at the start of a sentence (`Cat`), covering the two most common positions.
**Use exact capitalization for proper nouns.** A term like "LaTeX" should be entered as `LaTeX` so it doesn't collide with a different pronunciation for the common word `latex`. For multi-word proper nouns, enter the exact casing as it appears in your transcripts — for example, `Green Valley` if the transcript capitalizes both words.
> For the best controllability around pronunciation, we recommend using `sonic-3`.
`sonic-2` and `sonic-turbo` use MFA-style IPA for all languages.
For the best controllability around pronunciation, we recommend using `sonic-2`.
You can also get custom pronunciations with older Sonic models.
The `sonic`, `sonic-2024-12-12`, and `sonic-2024-10-19` models use Sonic-flavored IPA phonemes for English.
The `sonic` and `sonic-2024-12-12` use MFA-style IPA for languages other than English, and the Sonic Preview model uses MFA-style IPA for all languages.
Note that `sonic-2024-10-19` does not support custom pronunciations for languages other than English.
We will soon be updating all models to use MFA-style IPA.
Custom words should be wrapped in double angle brackets `<<` `>>` , with pipe characters `|` between phonemes and no whitespace.
For example:
* `Can I get <> on that?` (MFA-style IPA)
* `Can I get <> on that?` (Sonic-flavored IPA)
Each individual word should be wrapped in it’s own set of angle brackets.
# MFA-style IPA
## Constructing Pronunciations
We use the IPA phoneset as defined by the [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/). Because of the size and complexity of this phoneset, you may find it easier to construct your custom pronunciation starting from existing words with known phonemizations. We suggest the following workflow for constructing a custom pronunciation for a word:
1. Go to the [MFA pronunciation dictionary index](https://mfa-models.readthedocs.io/en/latest/dictionary/index.html) and find the page corresponding to your language. Make sure the phoneset is MFA, and try to download the latest version (for most languages, v3.0 or v3.1).
1. This page will give you the full range of acceptable phones for your language under the “phones” section.
2. Scroll down to the `Installation` section and click on the `Download from the release page` link.
3. Scroll to the bottom of the release page and download the .dict file; this is a text file mapping words to their constituent phonemes.
1. The first column in the file contains words, and the last column contains space delimited phonemes. Ignore the other columns.
4. Look up your word or words that sound similar to your intended pronunciation in the dictionary. Use these pronunciations as a starting point to construct your custom pronunciation.
Automatic pronunciation suggestions based on audio samples will be added in a future update. Note that MFA-style IPA does not support stress markers.
## Example
Suppose I want to generate the text “This is a generation from Cartesia” and the model is not pronouncing “Cartesia” correctly. I would do the following:
1. Go to the [MFA pronunciation dictionary index](https://mfa-models.readthedocs.io/en/latest/dictionary/index.html) and look for English pronunciation dictionaries. I see that for US English, the most recent version is v3.1.
1. I note that the page says that the acceptable phones for US english are `aj aw b bʲ c cʰ cʷ d dʒ dʲ d̪ ej f fʲ h i iː j k kʰ kʷ l m mʲ m̩ n n̩ ow p pʰ pʲ pʷ s t tʃ tʰ tʲ tʷ t̪ v vʲ w z æ ç ð ŋ ɐ ɑ ɑː ɒ ɒː ɔj ə ɚ ɛ ɝ ɟ ɟʷ ɡ ɡʷ ɪ ɫ ɫ̩ ɱ ɲ ɹ ɾ ɾʲ ɾ̃ ʃ ʉ ʉː ʊ ʎ ʒ ʔ θ`
2. Download the .dict file from the bottom of the [release page](https://github.com/MontrealCorpusTools/mfa-models/releases/tag/dictionary-english_us_mfa-v3.1.0).
3. Find a word in this dictionary that sounds similar to how I want “Cartesia” to be pronounced. I see this entry in the dictionary:
`cartesian 0.99 0.14 1.0 1.0 kʰ ɑ ɹ tʲ i ʒ ə n`
4. Ignore the middle four numeric columns. I want to cut off the part of the pronunciation that corresponds to “-an” and replace it with an “uh” sound. I know that the MFA phoneme for “uh” is `ɐ` (if I didn’t know that, I could also look up “uh” in the dictionary). So the pronunciation I want is `kʰ ɑ ɹ tʲ i ʒ ɐ`.
5. Format the phonemes it in angle brackets with pipe characters between phonemes and no whitespace. So my transcript is `This is a generation from <>`.
# (Deprecated) Sonic-flavored IPA
Sonic-flavored IPA is only for `sonic` and users of our latest models (`sonic-2` and `sonic-turbo`) should use MFA-style IPA.
Here is a pronunciation guide for Sonic-flavored IPA.
It follows the [English phonology article on Wikipedia](https://en.wikipedia.org/wiki/English_phonology) for most phonemes,
but in spots where our model requires different notation than you may expect, we've included a blue `<=` in the margins.
You can copy/paste some of these uncommon symbols from the original [charts here](https://docs.google.com/spreadsheets/d/1OJbiKtxLyodpNPqVfOu43X2HloLsAixTtFppEuQ_4pI/edit?usp=sharing).
## Stresses and vowel length markers
Sonic English requires stress markers for first (`ˈ`) and second (`ˌ`) stressed syllables, which go directly before the vowel. We also use annotations for vowel length (`ː`). The model can also operate without them, but you will have noticeably better robustness and control when using them.
# Prompting tips
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/prompting-tips
1. **Use appropriate punctuation.** Add punctuation where appropriate and at the end of each transcript whenever possible.
2. **Use dates in MM/DD/YYYY form.** For example, 04/20/2023.
3. **Add spaces between time and AM/PM.** For example, `7:00 PM`, `7 PM`, `7:00 P.M`.
4. **Insert pauses.** To insert pauses, insert "-" or use [break tags](/build-with-cartesia/formatting-text-for-sonic-2/inserting-breaks-pauses) where you need the pause. These tags are considered 1 character and do not need to be separated with adjacent text using a space -- to save credits you can remove spaces around break tags.
5. **Match the voice to the language.** Each voice has a language that it works best with. You can use the playground to quickly understand which voices are most appropriate for a language.
6) **Stream in inputs for contiguous audio.** Use [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) if generating audio that should sound contiguous in separate chunks.
7) **Specify [custom pronunciations](/build-with-cartesia/sonic-3/custom-pronunciations) for
domain-specific or ambiguous words.** You may want to do this for proper nouns and trademarks, as
well as for words that are spelled the same but pronounced differently, like the city of Nice and
the adjective "nice."
8) **Force [spelling out numbers and letters](/build-with-cartesia/sonic-3/ssml-tags#spelling-out-numbers-and-letters).** You may want to do this for IDs, email addresses, or numeric values.
For sonic-2, see [Formatting Text for Sonic-2](/build-with-cartesia/formatting-text-for-sonic-2/best-practices).
# SSML Tags
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/ssml-tags
Tags for volume, speed, and emotions is in beta and subject to change in the
future.
Sonic-3 supports SSML-like (Speech Synthesis Markup Language) tags to control generated speech.
## Speed
Note that if you're streaming token by token, you'll need to buffer the whole value of the speed or volume tags.
Passing in `1`, `.`, `0` as separate inputs, for example, will result in reading out the tags.
You can guide the speed of a TTS generation with a `speed` tag, which takes a scalar between `0.6` and `1.5`.
This value is roughly a multiplier on the default speed. For example, `1.5` will generate audio at roughly 1.5x the
default speed.
```xml theme={null}
I like to speak quickly because it makes me sound smart.
```
## Volume
You can guide the volume of a TTS generation with a `volume` tag, which is a number between `0.5`
and `2.0`. The default volume is `1`.
```xml theme={null}
I will speak softly.
```
## Emotion Beta
Emotion control is highly experimental, particularly when emotion shifts occur
mid-generation. If you need to change the emotion in a transcript, we recommend
using separate generation contexts for each emotion. For best results, use [Voices
tagged as "Emotive"](https://play.cartesia.ai/voices?tags=Emotive), as emotions may not work reliably with other Voices.
```xml theme={null}
I will not allow you to continue this! I was hoping for a peaceful resolution.
```
## Pauses and breaks
To insert breaks (or pauses) in generated speech, use a `break` tags with one attribute, `time`. For
example, ``. You can specify the time in seconds (`s`) or milliseconds (`ms`).
For accounting purposes, these tags are considered 1 character and do not need to be separated with adjacent text using a
space -- to save credits you can remove spaces around break tags.
```xml theme={null}
Hello, my name is Sonic.Nice to meet you.
```
## Spelling out numbers and letters
To spell out input text, you can wrap it in `` tags.
This is particularly useful for pronouncing long numbers or identifiers, such as credit card numbers, phone numbers, or unique IDs.
```xml theme={null}
My name is Bob, spelled Bob, my account number is ABC-123, my phone number is (123) 456-7890, and my credit card is 1234-5678-9012-3456.
```
If you want to spell out numbers or identifiers and have planned breaks between the generations (e.g. taking a break between the area code of a phone number and the rest of that number), you can combine `` and `` tags. These tags are considered 1 character and do not need to be separated with adjacent text using a space -- to save credits you can remove spaces around break and spell tags.
```xml theme={null}
My phone number is (123)4712177 and my credit card number is 1234567863474537.
```
# Volume, Speed, and Emotion
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/volume-speed-emotion
Sonic-3 provides rich controls for the speed, volume, and emotion of generated speech. These controls are available on play.cartesia.ai using the UI controls, or by passing in a `generation_config` parameter, or by using SSML tags within the transcript itself.
**Sonic-3 interprets these parameters as guidance** instead of as strict adjustments to ensure natural speech, so we recommend testing them against your content to ensure the output matches your expectations.
## Speed and Volume Controls
You can guide the speed and volume of a TTS generation with the `generation_config.speed` and `generation_config.volume` parameters. These values are roughly a multiplier on the default speed and volume, eg, `1.5` will generate audio at 1.5x the default speed.
The speed of the generation, ranging from `0.6` to `1.5`.
The volume of the generation, ranging from `0.5` to `2.0`.
You can also specify these inside the transcript itself, using [SSML](/build-with-cartesia/sonic-3/ssml-tags), for example:
```xml lines theme={null}
I like to speak quickly because it makes me sound smart.
And I can be loud, too!
```
## Emotion Controls Beta
By default, the model attempts to interpret the emotional subtext present in the provided transcript. You can guide the emotion of a TTS generation, like a director providing guidance to an actor, using the `generation_config.emotion` parameter.
Emotion tags are good to push the model to be more emotive, but they only work when the emotion is consistent with transcript. For instance, the mismatch below is unlikely to work well:
```xml theme={null}
I'm so excited!
```
The emotional guidance for a generation, one of the emotions below.
The primary emotions, for which we have the most data and produce the best results, are: `neutral`, `angry`, `excited`, `content`, `sad`, and `scared`.
The complete list of available emotions is: `happy`, `excited`, `enthusiastic`, `elated`, `euphoric`, `triumphant`, `amazed`, `surprised`, `flirtatious`, `joking/comedic`, `curious`, `content`, `peaceful`, `serene`, `calm`, `grateful`, `affectionate`, `trust`, `sympathetic`, `anticipation`, `mysterious`, `angry`, `mad`, `outraged`, `frustrated`, `agitated`, `threatened`, `disgusted`, `contempt`, `envious`, `sarcastic`, `ironic`, `sad`, `dejected`, `melancholic`, `disappointed`, `hurt`, `guilty`, `bored`, `tired`, `rejected`, `nostalgic`, `wistful`, `apologetic`, `hesitant`, `insecure`, `confused`, `resigned`, `anxious`, `panicked`, `alarmed`, `scared`, `neutral`, `proud`, `confident`, `distant`, `skeptical`, `contemplative`, `determined`.
The Voices with the best emotional response are:
* [Leo](https://play.cartesia.ai/voices/0834f3df-e650-4766-a20c-5a93a43aa6e3) (id: `0834f3df-e650-4766-a20c-5a93a43aa6e3`)
* [Jace](https://play.cartesia.ai/voices/6776173b-fd72-460d-89b3-d85812ee518d) (id: `6776173b-fd72-460d-89b3-d85812ee518d`)
* [Kyle](https://play.cartesia.ai/voices/c961b81c-a935-4c17-bfb3-ba2239de8c2f) (id: `c961b81c-a935-4c17-bfb3-ba2239de8c2f`)
* [Gavin](https://play.cartesia.ai/voices/f4a3a8e4-694c-4c45-9ca0-27caf97901b5) (id: `f4a3a8e4-694c-4c45-9ca0-27caf97901b5`)
* [Maya](https://play.cartesia.ai/voices/cbaf8084-f009-4838-a096-07ee2e6612b1) (id: `cbaf8084-f009-4838-a096-07ee2e6612b1`)
* [Tessa](https://play.cartesia.ai/voices/6ccbfb76-1fc6-48f7-b71d-91ac6298247b) (id: `6ccbfb76-1fc6-48f7-b71d-91ac6298247b`)
* [Dana](https://play.cartesia.ai/voices/cc00e582-ed66-4004-8336-0175b85c85f6) (id: `cc00e582-ed66-4004-8336-0175b85c85f6`)
* [Marian](https://play.cartesia.ai/voices/26403c37-80c1-4a1a-8692-540551ca2ae5) (id: `26403c37-80c1-4a1a-8692-540551ca2ae5`)
View the full list of emotive Voices on our [Voice Library with voices tagged 'Emotive'](https://play.cartesia.ai/voices?tags=Emotive).
You can also use [SSML](/build-with-cartesia/sonic-3/ssml-tags) tags for emotions, for example:
```xml theme={null}
How dare you speak to me like I'm just a robot!
```
## Nonverbalisms
Insert `[laughter]`in your transcript to make the model laugh. In the future we plan to add more non-speech verbalisms like sighs and coughs.
# STT Models
Source: https://docs.cartesia.ai/build-with-cartesia/stt-models
Ink is a new family of streaming speech-to-text (STT) models for developers building real-time voice applications.
* the latest **stable** snapshot of the model
To use the stable version of the model, we recommend using the base model name (e.g. `ink-whisper`).
In many cases the stable and preview snapshots are the same, but in some cases the preview snapshot may have additional features or improvements.
## `ink-whisper`
Ink Whisper is the fastest, most affordable speech-to-text model — engineered for enterprise deployment in production-grade voice agents. It delivers higher accuracy than baseline Whisper and is optimized for real-time performance in a wide variety of real-world conditions.
Additional Capabilities:
* Handles variable-length audio chunks and interruptions gracefully using dynamic chunking.
* Reliably transcribes speech with background noise.
* Accurately transcribes audio with telephony artifacts, accents, and disfluencies.
* Excels at transcribing proper nouns and domain-specific terminology.
| Snapshot | Release Date | Languages | Status |
| ------------------------------------ | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |
| `ink-whisper` | June 10, 2025 | `en`, `zh`, `de`, `es`, `ru`, `ko`, `fr`, `ja`, `pt`, `tr`, `pl`, `ca`, `nl`, `ar`, `sv`, `it`, `id`, `hi`, `fi`, `vi`, `he`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `ta`, `no`, `th`, `ur`, `hr`, `bg`, `lt`, `la`, `mi`, `ml`, `cy`, `sk`, `te`, `fa`, `lv`, `bn`, `sr`, `az`, `sl`, `kn`, `et`, `mk`, `br`, `eu`, `is`, `hy`, `ne`, `mn`, `bs`, `kk`, `sq`, `sw`, `gl`, `mr`, `pa`, `si`, `km`, `sn`, `yo`, `so`, `af`, `oc`, `ka`, `be`, `tg`, `sd`, `gu`, `am`, `yi`, `lo`, `uz`, `fo`, `ht`, `ps`, `tk`, `nn`, `mt`, `sa`, `lb`, `my`, `bo`, `tl`, `mg`, `as`, `tt`, `haw`, `ln`, `ha`, `ba`, `jw`, `su`, `yue` | Stable |
| `ink-whisper-2025-06-04` | June 4, 2025 | `en`, `zh`, `de`, `es`, `ru`, `ko`, `fr`, `ja`, `pt`, `tr`, `pl`, `ca`, `nl`, `ar`, `sv`, `it`, `id`, `hi`, `fi`, `vi`, `he`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `ta`, `no`, `th`, `ur`, `hr`, `bg`, `lt`, `la`, `mi`, `ml`, `cy`, `sk`, `te`, `fa`, `lv`, `bn`, `sr`, `az`, `sl`, `kn`, `et`, `mk`, `br`, `eu`, `is`, `hy`, `ne`, `mn`, `bs`, `kk`, `sq`, `sw`, `gl`, `mr`, `pa`, `si`, `km`, `sn`, `yo`, `so`, `af`, `oc`, `ka`, `be`, `tg`, `sd`, `gu`, `am`, `yi`, `lo`, `uz`, `fo`, `ht`, `ps`, `tk`, `nn`, `mt`, `sa`, `lb`, `my`, `bo`, `tl`, `mg`, `as`, `tt`, `haw`, `ln`, `ha`, `ba`, `jw`, `su`, `yue` | Stable |
To learn how to use the Ink STT family, see [the Speech-to-Text API Reference](/api-reference/stt/stt). You can find a detailed mapping of codes to languages, see the [STT supported languages](/api-reference/stt/stt#request.query.language) list.
## Selecting a Model
When making API calls, you can specify either:
```python lines theme={null}
// Use the base model (automatically routes to the latest snapshot)
{
model = "ink-whisper",
...
}
// Or specify a particular snapshot for consistency
{
model = "ink-whisper-2025-06-04",
...
}
```
### Continuous updates
All models have a base model name (e.g. `ink-whisper`).
We recommend using these for prototyping and development, then switching to a date-versioned model for production use cases to ensure stability.
## Future Updates
New snapshots are released periodically with improvements in performance, additional language support, and new capabilities. Check back regularly for updates.
# API Changes
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/api-changes
Starting June 1, 2026, we are discontinuing several models, snapshots, and languages, and removing voice embeddings from our voice API. Migrate to `sonic-3` for improved naturalness, 42-language support, and fine-grained controls.
## Deprecated models and languages
You can check if you're making requests to deprecated models on [play.cartesia.ai/deprecation/traffic](https://play.cartesia.ai/deprecation/traffic).
### Fully deprecated models
These models will stop serving requests on June 1, 2026.
| Model | Snapshots affected | Deprecated languages |
| -------------------- | ------------------------ | -------------------- |
| `sonic` | All | All |
| `sonic-english` | — | All |
| `sonic-multilingual` | — | All |
| `sonic-2` | `sonic-2-2025-03-07` | All |
| `sonic-turbo` | `sonic-turbo-2025-03-07` | All |
### Partially deprecated models
These models will continue to serve a reduced set of languages. The languages listed below will be discontinued on June 1, 2026.
| Model | Snapshots affected | Deprecated languages |
| ------------- | ---------------------------------------------------------------- | -------------------------- |
| `sonic-2` | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | it, nl, pl, ru, sv, tr, hi |
| `sonic-turbo` | `sonic-turbo-2025-06-04` | it, nl, pl, ru, sv, tr |
## Stable offerings
The following will remain available beyond June 1.
| Model | Snapshots | Supported Languages |
| ------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| `sonic-3` | All | 42 languages — [full list](/build-with-cartesia/tts-models/latest#language-support) |
| `sonic-2` | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | en, de, es, fr, ja, ko, pt, zh |
| `sonic-turbo` | `sonic-turbo-2025-06-04` | en, de, es, fr, ja, ko, pt, zh, hi |
## API changes
These endpoints will be discontinued on June 1, 2026.
| Discontinued Endpoint | Replacement |
| ------------------------------------------ | ------------------------------------------ |
| Voice Embedding: `POST /voices/clone/clip` | [Clone Voice](/api-reference/voices/clone) |
| Mix Voices: `POST /voices/mix` | — |
| Create Voice: `POST /voices` | [Clone Voice](/api-reference/voices/clone) |
These endpoints will stop accepting voice embeddings on June 1, 2026.
| Endpoint with a breaking change | Replacement |
| ------------------------------------- | ------------------------------------------------------ |
| TTS (bytes): `POST /tts/bytes` | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |
| TTS (SSE): `POST /tts/sse` | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |
| TTS (WebSocket): `WSS /tts/websocket` | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |
You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.
### Moving off of deprecated endpoints
1. Change how you create voices — see [Migrating Voices](/build-with-cartesia/tts-models/migrating-voices).
2. Switch from voice embeddings to IDs — see [Voice IDs](/build-with-cartesia/tts-models/voice-ids).
## Full Checklist
1. Move off of [deprecated models / snapshots / languages](/build-with-cartesia/tts-models/api-changes#deprecated-models-and-languages) onto `sonic-3` or another stable model
2. Move off of [deprecated endpoints](/build-with-cartesia/tts-models/api-changes#api-changes) when creating voices
3. Use [Voice IDs](/build-with-cartesia/tts-models/voice-ids)
4. Check your deprecated model traffic on [play.cartesia.ai/deprecation/traffic](https://play.cartesia.ai/deprecation/traffic)
5. Make sure your voices are migrated on [play.cartesia.ai/deprecation/voices](https://play.cartesia.ai/deprecation/voices)
6. (Optional) Upgrade your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`
## Why are we doing this?
Since the launch of Sonic 3, we've made improvements across pacing, prosody, and naturalness; the vast majority of our customers have migrated to these models with great success. In order to increase our capacity, availability, and serving performance, we have to discontinue our oldest models.
Additionally, our newer models have evolved beyond voice embeddings in order to sound more natural. The parts of our API that accept voice embeddings cannot be made forward-compatible. Migrating to voice IDs will allow us to continue to improve both our models and your voices in tandem.
If you have questions, reach out to [support@cartesia.ai](mailto:support@cartesia.ai).
# Migrating Voices
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/migrating-voices
On June 1, 2026, we are discontinuing our voice embedding (aka stability) TTS models.
Voices listed on [play.cartesia.ai/deprecation/voices](https://play.cartesia.ai/deprecation/voices) will stop working. Simply click "Auto Migrate" to make these voices compatible with the latest Sonic 3, 2, and Turbo snapshots.
If you use voice embeddings rather than voice IDs, see [Voice IDs](/build-with-cartesia/tts-models/voice-ids).
For an overview of all changes, see [API Changes](/build-with-cartesia/tts-models/api-changes).
## Where do these voices come from?
Voices created by these endpoints rely on our voice embedding models:
* [POST /voices](/2024-06-10/api-reference/voices/create)
* [POST /voices/mix](/2024-06-10/api-reference/voices/mix)
* `POST /voices/clone/clip`
## Creating voices
You can move to our [Clone Voice API](/api-reference/voices/clone) or use our [web UI](https://play.cartesia.ai/voices/create/clone) to create voices from 3–10 seconds of source audio.
You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.
Here is an example using the Cartesia SDK:
```python theme={null}
your_api_key: str = ""
client = Cartesia(api_key=your_api_key)
print("Cloning a voice")
with open("3 to 10 seconds of source audio.wav", mode="rb") as f:
voice = client.voices.clone(
clip=f,
# this must match the source audio
language="en",
name="My Voice",
mode="similarity",
)
print(f"Cloned voice {voice.id}")
print("Generating audio...")
generated_audio = client.tts.bytes(
# voice embeddings will not work after June 1, 2026!
voice={"mode": "id", "id": voice.id},
model_id="sonic-3",
transcript="Hello from Cartesia!",
language="en",
output_format={
"container": "wav",
"encoding": "pcm_f32le",
"sample_rate": 44100
},
)
```
# Older TTS Models
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/older-models
We recommend using [Sonic 3](/build-with-cartesia/tts-models/latest) for best
results, most languages, and controllability. We continue to serve these older
models for compatibility.
Some models and snapshots are being discontinued on June 1, 2026 — see [API Changes](/build-with-cartesia/tts-models/api-changes) for details.
> the latest **stable** snapshot of the model\
> to be discontinued June 1, 2026
All models have a base model name (e.g. `sonic-2`, `sonic-turbo`) and date-versioned model names
(e.g. `sonic-2-2025-06-11`).
We recommend using base model names for prototyping and development, then switching to a date-versioned model for production use cases to ensure stability.
When making API calls, you can specify either:
```python lines theme={null}
# Use the base model
# (automatically routes to the latest stable snapshot)
model_id = "sonic-3"
# Or specify a particular snapshot for consistency
model_id = "sonic-3-2026-01-12"
```
## `sonic-2`
Sonic-2 provides ultra-realistic speech with accurate transcript following, minimal hallucinations, and excellent voice cloning. It's latency optimized and achieves 90ms model latency.
Additional Capabilities:
* Higher fidelity voice cloning
* Timestamps for all 15 languages
* [Infill](/2024-11-13/api-reference/infill/bytes) support
| Snapshot | Release Date | Languages | Status |
| ------------------------------------------- | -------------- | ---------------------------------------------------------- | ---------------- |
| `sonic-2-2025-06-11` | June 11, 2025 | en, fr, de, es, pt, zh, ja, ko | Stable |
| `sonic-2-2025-06-11` | June 11, 2025 | hi, it, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| `sonic-2-2025-05-08` | May 8, 2025 | en, fr, de, es, pt, zh, ja, ko | Stable |
| `sonic-2-2025-05-08` | May 8, 2025 | hi, it, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| `sonic-2-2025-04-16` | April 16, 2025 | en, fr, de, es, pt, zh, ja, ko | Stable |
| `sonic-2-2025-04-16` | April 16, 2025 | hi, it, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| `sonic-2-2025-03-07` | March 7, 2025 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |
Read these pages to learn more about how to use Sonic-2:
* [Best practices](/build-with-cartesia/formatting-text-for-sonic-2/best-practices)
* [Inserting breaks](/build-with-cartesia/formatting-text-for-sonic-2/inserting-breaks-pauses)
* [Spelling text](/build-with-cartesia/formatting-text-for-sonic-2/spelling-out-input-text)
## `sonic-turbo`
All the power of Sonic, with half the latency (as low as 40ms).
| Snapshot | Release Date | Languages | Status |
| ----------------------------------------------- | ------------- | ---------------------------------------------------------- | ---------------- |
| `sonic-turbo-2025-06-04` | June 6, 2025 | en, fr, de, es, pt, zh, ja, hi, ko | Stable |
| `sonic-turbo-2025-06-04` | June 6, 2025 | it, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| `sonic-turbo-2025-03-07` | March 7, 2025 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |
## `sonic`
The first version of our flagship text-to-speech model. It produces high-accuracy, expressive speech, and is optimized for efficiency to achieve low latency.
| Snapshot | Release Date | Languages | Status |
| ----------------------------------------- | ----------------- | ---------------------------------------------------------- | ---------------- |
| `sonic-2024-12-12` | December 12, 2024 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| `sonic-2024-10-19` | October 19, 2024 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |
## Deprecated and Preview Model Aliases
The following model aliases are now deprecated. Please use the recommended model names instead:
| Deprecated Alias | Use Instead |
| ------------------------------------------- | ----------------------------------------- |
| `sonic-3-preview` | `sonic-3` |
| `sonic-preview` | `sonic-2` |
| `sonic-english` | `sonic-2024-10-19` |
| `sonic-multilingual` | `sonic-2024-10-19` |
# Sonic 3
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/sonic-3
`sonic-3` is our streaming TTS model, with high naturalness, accurate transcript following, and industry-leading latency. It provides fine-grained control on volume, speed, and emotion.
Key Features:
* **42 languages** supported
* **Volume, speed, and emotion** controls, supported through API parameters and SSML tags
* **Laughter** through `[laughter]` tags
For more information, see [Volume, Speed, and Emotion](/build-with-cartesia/sonic-3/volume-speed-emotion).
### Voice selection
Choosing voices that work best for your use case is key to getting the best performance out of Sonic 3.
* **For voice agents**: We've found stable, realistic voices work better for voice agents than studio, emotive voices. Example American English voices include Katie (ID: `f786b574-daa5-4673-aa0c-cbe3e8534c02`) and Kiefer (ID: `228fca29-3a0a-435c-8728-5cb483251068`).
* **For expressive characters**: We've tagged our most expressive and emotive voices with the `Emotive` tag. Example American English voices include Tessa (ID: `6ccbfb76-1fc6-48f7-b71d-91ac6298247b`) and Kyle (ID: `c961b81c-a935-4c17-bfb3-ba2239de8c2f`).
For more information and recommendations, see [Choosing a Voice](/build-with-cartesia/capability-guides/choosing-a-voice).
### Language support
Sonic-3 supports the following languages:
English (`en`)
French (`fr`)
German (`de`)
Spanish (`es`)
Portuguese (`pt`)
Chinese (`zh`)
Japanese (`ja`)
Hindi (`hi`)
Italian (`it`)
Korean (`ko`)
Dutch (`nl`)
Polish (`pl`)
Russian (`ru`)
Swedish (`sv`)
Turkish (`tr`)
Tagalog (`tl`)
Bulgarian (`bg`)
Romanian (`ro`)
Arabic (`ar`)
Czech (`cs`)
Greek (`el`)
Finnish (`fi`)
Croatian (`hr`)
Malay (`ms`)
Slovak (`sk`)
Danish (`da`)
Tamil (`ta`)
Ukrainian (`uk`)
Hungarian (`hu`)
Norwegian (`no`)
Vietnamese (`vi`)
Bengali (`bn`)
Thai (`th`)
Hebrew (`he`)
Georgian (`ka`)
Indonesian (`id`)
Telugu (`te`)
Gujarati (`gu`)
Kannada (`kn`)
Malayalam (`ml`)
Marathi (`mr`)
Punjabi (`pa`)
## Selecting a Model
| Snapshot | Release Date | Languages | Status |
| ------------------------------------------- | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
| `sonic-3-2026-01-12` | January 12, 2026 | en, de, es, fr, ja, pt, zh, hi, ko, it, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | Stable |
| `sonic-3-2025-10-27` | October 27, 2025 | en, de, es, fr, ja, pt, zh, hi, ko, it, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | Stable |
the latest **stable** snapshot of the model
When making API calls, you can specify either:
```python lines theme={null}
# Use the base model
# (automatically routes to the latest stable snapshot)
model_id = "sonic-3"
# Or specify a particular snapshot for consistency
model_id = "sonic-3-2026-01-12"
# Try the latest (beta) model (can be 'hot swapped')
model_id = "sonic-3-latest"
```
### Continuous updates and model snapshots
All models have a base model name (e.g. `sonic-3`) and a dated snapshot (e.g. `sonic-3-2025-10-27`). Using the base model will automatically keep you up to date with the most recent stable snapshot of that model. If pinning a specific version is important for your use case, we recommend using the dated version.
For testing our latest capabilities, we recommend using `sonic-3-latest`, which is a non-snapshotted version. `sonic-3-latest` can be updated with no notice, and not recommended for production.
To summarize:
| **Model ID** | Model update behavior | Recommended for |
| -------------------- | :---------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
| `sonic-3-YYYY-MM-DD` | Snapshotted, will never change | Customers who want to run internal evals before any updates |
| `sonic-3` | Will be updated to point to the most recent stable snapshot | Customers who want stable releases, but want to be up-to-date with the recent capabilities |
| `sonic-3-latest` | Will always be updated to our latest beta releases | Testing purposes |
## Older Models
For information on `sonic-2`, `sonic-turbo`, `sonic-multilingual`, and `sonic`, see our page on [Older Models](/build-with-cartesia/tts-models/older-models).
# Voice IDs
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/voice-ids
On June 1, 2026, we are discontinuing our voice embedding (aka stability) TTS models.
If you are currently making generation requests with voice embeddings like this:
```json theme={null}
{
"voice": {
"mode": "embedding",
"embedding": [1, 2, ..., 3, 4]
},
"model_id": "sonic-2",
// ...
}
```
You will need to switch to using voice IDs like this:
```json theme={null}
{
"voice": {
"mode": "id",
"id": "e07c00bc-4134-4eae-9ea4-1a55fb45746b"
},
"model_id": "sonic-2",
// ...
}
```
If you already use voice IDs, see [Migrating Voices](/build-with-cartesia/tts-models/migrating-voices) to make sure your voices will continue to work after the change.
For an overview of all changes, see [API Changes](/build-with-cartesia/tts-models/api-changes).
## Get a voice ID
Choose one of the following options.
### Check out the voice library
Our featured voices have all gone through rigorous evaluations and are ready to use in production.
Check them out at [play.cartesia.ai/voices](https://play.cartesia.ai/voices) and copy the ID of any voice you'd like to use.
### Clone a voice
If you have source audio, create a cloned voice via the [playground](https://play.cartesia.ai/voices/create/clone) or the [API](/api-reference/voices/clone). Cloning returns a voice ID you can use immediately.
### Generate source audio from your existing embedding
If you no longer have the original audio clip used to create your embedding, generate a short sample with `sonic` or `sonic-2` and then clone a new voice.
You can do this on our playground:
1. [play.cartesia.ai/text-to-speech](https://play.cartesia.ai/text-to-speech)
2. [play.cartesia.ai/voices/create/clone](https://play.cartesia.ai/voices/create/clone)
Or with our API:
1. [Text to Speech (Bytes)](/2024-11-13/api-reference/tts/bytes)
2. [Clone Voice](/api-reference/voices/clone)
Here is an example using our SDK:
```python theme={null}
from cartesia import Cartesia
# inputs
your_api_key: str = ""
your_voice_embedding: list[float] = []
language = "en"
transcript = """
It's nice to meet you.
Hope you're having a great day!
Could we reschedule our meeting tomorrow?
Please call me back as soon as possible.
"""
source_tts_model_id = "sonic"
client = Cartesia(api_key=your_api_key)
# Step 1: generate an audio sample
print(f"Generating audio sample {source_tts_model_id=}")
source_audio_iterator = client.tts.bytes(
voice={"mode": "embedding", "embedding": your_voice_embedding},
model_id=source_tts_model_id,
transcript=transcript,
language=language,
output_format={
"container": "wav",
"encoding": "pcm_f32le",
"sample_rate": 44100
},
)
# Step 2: clone a voice
print("Cloning a voice")
voice = client.voices.clone(
name="My Voice",
language=language,
clip=b"".join(source_audio_iterator),
mode="similarity",
)
print(f"Cloned voice {voice.id}")
# you can now use the voice like this
migrate_to_model = "sonic-3"
generated_sample_file_name = f"{migrate_to_model}_{voice.id}.wav"
cloned_audio_iterator = client.tts.bytes(
voice={"mode": "id", "id": voice.id},
model_id=migrate_to_model,
transcript=transcript,
language=language,
output_format={
"container": "wav",
"encoding": "pcm_f32le",
"sample_rate": 44100
},
)
with open(generated_sample_file_name, "wb") as f:
for chunk in cloned_audio_iterator:
f.write(chunk)
print(f"Listen to your new voice: {generated_sample_file_name}")
try:
import subprocess
subprocess.run(
[
"ffplay",
"-loglevel",
"quiet",
"-autoexit",
"-nodisp",
generated_sample_file_name,
]
)
except FileNotFoundError:
pass
```
## Using Voice IDs
See [TTS (Bytes)](/api-reference/tts/bytes), [TTS (SSE)](/api-reference/tts/sse), and [TTS (WebSocket)](/api-reference/tts/websocket) for full API documentation.
You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.
# Changelog 2024
Source: https://docs.cartesia.ai/changelog/2024
Product, API, and platform changes for 2024
### API
* Pricing updates; character usage columns migrated to bigint; presign URLs for Pro Voice Clone; **`voices//conditioning`** endpoint; file to dataset in presign; userID-level endpoint restrictions; Stripe Customer ID on checkout.
* EU deployment and Hindi HC fixes.
### Playground
* New model on Playground highlighting **transcript following** improvements (demo, not GA).
* Blog and play.cartesia.ai live.
### Models / Voices
* Model aliasing updated for **`sonic`** and **`sonic-preview`**; twilight-morning in API and enterprise; conditioning entries for voice clone and multilingual.
* Embedding search for LoRA voice selection.
### Other
* Infrastructure and scaling updates.
* State of Voice blog and map.
### API
* **Cartesia-Version 2024-11-13** — Upgrade to new API version; **unified clone voice endpoint**; datasets support; files endpoint pagination; FineTuneRequest status; fine-tunes API in Playground; presign URLs for Pro Voice Clone; **Flush Done** event for manual WebSocket flushing; **``** tag for continuations.
* GCP Enterprise.
### Playground
* Changes for new API; replay suite; GCP Enterprise.
### Models / Voices
* **Flush Done** event for manual flushing in WebSocket; **``** tag for continuations within a single transcript; spelling fixes; manual flush and flush ID.
* Empty encoding field allowed for mp3.
### Docs
* API version **2024-11-13**: Sonic 2, capability guides (clone, pronunciations, speed/emotion, continuations, localize), formatting for Sonic 2.
* Integrations: LiveKit, Pipecat, Rasa, Thoughtly, Twilio, MCP. Enterprise: SSO, organizations. See [API Conventions](/use-the-api/api-conventions).
### API
* Cartesia JS bytes endpoint; gen blocks removed from character counting; health checks and middleware; **user-level queueing** with queue length cap and timeout; 10× queue size rejection; Slang (continuations) and ConditioningData; voice changer JS SDK.
* Remove max limit from Playground.
### Playground
* GCP: API and ingress for GCP US Central. Queueing: user-level queueing in API gateway; queue length cap and `queuedRequest` timeout.
* Voice Changer: Playground UI polish; ConditioningData as part of ResolvedVoice; Slang rollout; flush on start/end of spell tags.
* LoRA release UI; onboarding data upsert fix; welcome page submit loading state; enterprise contact links.
### Other
* Canonical linking and sitemap.
* Blog and navigation (Blog, Careers) updates.
### API
* User-level queueing; queue size and websocket queueing rejection; **`api_status`** field for voice API usability; LoRA pricing and UX cleanup; **flush all audio on DONE token** (including CB); user option to obfuscate transcripts in logs.
* LoRA and load balancer improvements.
### Playground
* **Function calling**; agent creation, tests, and dev setup; voice agent infrastructure enabled.
* LoRA: HiFi cloning endpoint and Playground page; 8 new voices on Playground; Indian accent.
* **Voice Changer** Playground UI; JS SDK for voice changer. Language added to TTS request from `voices/[id]`; flush all audio on DONE token; user option to obfuscate transcripts in logs.
### Docs
* Blog and sitemap updates.
### API
* Reject invalid transcripts (docs and API gateway); `no_more_inputs` in WebSockets can use `voice_embedding` instead of `voice_id`.
* Improved bad model id handling.
### Playground
* **Localization** page in Playground and JS client; dialects and future-compatibility. Switch Playground to voice ID; allow both `id` and embedding for `TTSRequest`; archive voices (kept accessible via API).
* Replay button; feedback form; fix multilingual recommended voices when switching back to English; better error messaging.
### Models / Voices
* **LoRA** support (multiple voices per LoRA, new cache key, easy-brook-lora, vc-flowing-dream).
### Other
* On-device homepage launch; proper links for "Request a demo" button.
* **LoRA**: multiple voices per LoRA.
### API
* **Voice Conversion endpoint** — New API endpoint. **Timestamps** on WebSocket endpoint; **per-generation voice controls** (speed, emotion) in API; polar-tree deployed (`sonic-multilingual`); continuous batching support; VocalWave (English) and long-generation support; `sonic-english` → vocal-wave, `sonic-multilingual` → ancient-voice aliasing.
* **`buffer`** and **`mp3`** params on `/bytes`; MP3 streaming and WAV encoding fixes; request cancellation; empty transcript allowed when `continue=false`; Stripe webhook cache clear; subscription cancellation/reactivation; Redis cache for overage; keys endpoints.
* Clerk-based auth in API.
### Playground
* Optional **`enhance`** flag for voice cloning in JS client, Python client, and Playground; voice update endpoint and docs; gate voice cloning for free users.
* Prevent playing audio while playback in progress; download button disabled until generation finished; API key deletion clearer with copy button; character usage indicator; subscription and checkout fixes; gating clone form for free users.
### Docs
* Voice cloning docs; timestamps and continuations; user guides for voice control and Twilio; emotion control and timestamps; "phonemes" terminology.
* Voice cloning from file.
### Other
* Python client: continuations support, custom `base_url`, fallback for websockets; JS client v1.0.1: `onError` prop on useTTS.
* Voice controls (speed, emotion) in Python client and docs.
### API
* **Continuations** — Support for streaming input via SSE and Bytes; **`NoMoreInputs`** signal. **Cartesia Version** enforced via header; Playground and checkout/subscription endpoints send it.
* 48 kHz added to valid sample rates; `.wav` byte streaming; HTTP streaming endpoint for raw bytes; API standardization (backwards-compatible); new voices endpoints; mulaw and alaw backwards compatibility; Python client v1.0.0 (overhaul, `output_format`); JS client: `pcm_s16le`, `pcm_alaw`, `pcm_mulaw` and improved typing; caching for voices; **`context_id`** in WebSocket response and docs.
* Stripe webhooks for renewals and expiration; OpenAPI spec update.
### Playground
* Multilingual: `language` parameter on voices API and in API; Playground language selection; multilingual copy on homepage; default `sonic-english` → feasible-haze.
* Mobile layout improvements; multilingual UI papercuts; voice cloning and empty transcript styling fixes; filtering moved from `voices/[id]` to Speak page.
### Models / Voices
* **`sonic-multilingual`** and **`sonic-english`** aliasing; `language` column on voices.
* Recommended voices.
### Docs
* Version **2024-06-10**: get-started, API conventions, integrations (LiveKit, Pipecat, Rasa, Thoughtly, Twilio, MCP), clone voices, embeddings/voice mixing. See [API Conventions](/use-the-api/api-conventions).
### Other
* ToS changes; revised pricing tiers; legal notices on sign-in and sign-up; overage toggle in Playground.
* Character usage limit blocks WebSocket when exceeded.
### API
* **Cartesia Version** header; HTTP streaming for raw bytes; new voices endpoints; mulaw/alaw backwards compatibility; API standardization (backwards-compatible); Python client v1.0.0; JS client structure overhaul.
* Clone voice upload fix.
### Playground
* Redesign and Sonic launch copy; subscriptions page; favoriting voices; **emotion and speed sliders**; User vs Default voices; **tags** (Age, Accent) in DB and Playground; **`sample_text`** field (API Gateway and Playground); buffer streamed audio before playback; character usage indicator; API key auto-created on user creation; custom sign-in/sign-up and 404 on sign-out fix; disable generation button while audio playing; human-readable model names and skilled-cherry.
* Character limit increase.
### Models / Voices
* Human-readable model names; skilled-cherry; polar-tree (`sonic-multilingual`); continuations and output format; Python client numpy array support.
* Voice cloning disclaimer.
### Docs
* Mintlify docs added.
### Other
* Stripe webhooks for subscriptions; subscription cancellation and reactivation; character usage checks on generation routes; free subscription by default; Scale plan limit (8M chars/month); checkout and receipts.
* Custom sign-in/sign-up pages.
### API
* **`model_id`** added as parameter to generate; minimum transcript length enforced; `voice` moved to `AudioGenerationRequest`; experimental router removed; speed controls and voice edit page; video generation endpoint.
* WhisperX removed from dependencies.
### API
* WebSocket interrupt support; get voice embedding route; Redis cache for API keys; streaming switched from Octet to JSON; new model `genial-planet-1346`; `voice` param required on requests; formatting support.
* WhisperX for transcription (later removed).
### Playground
* Voice cloning in the UI; connection info in JS client; audio downloadable; transcript length validation (max 400 chars, empty rejected); improved UX and crash handling when API key missing; welcome message and icons.
* API key creation on sign-up via Clerk webhooks.
### Other
* Voice cloning and connection info in JS client.
# Changelog 2025
Source: https://docs.cartesia.ai/changelog/2025
Product, API, and platform changes for 2025
### API
* **sonic-3-latest** (preview) and dated **sonic-3-YYYY-MM-DD** snapshots.
* **sonic-3-latest** added to Playground TTS with banner when selected. See [Changelog 2026](/changelog/2026).
### Voice changes
* **Voice Library** — December: 25 new voices across 6 languages (12 English, 6 Hindi, 4 Arabic, 1 Spanish, 1 Japanese); 14 featured.
* Voice library changes; featured voice badge on voice page; **`/voices/recent`** endpoint.
### Playground
* **Report generation** (report button, alert when user reports).
* **Voice move**; **archive and publish** voices.
* **PVC**: custom PVC voices UI, multiple user errors surfaced to UI, feature flag for custom model during creation.
* **Pronunciation dicts**: new backend APIs, generator on create/edit, case sensitivity badge.
* **Agents**: new text-to-agent UI, create agent from **Github repo tarball**, system prompt generator for UI agent.
* **Narrations sunset** notice; TTS History pagination; auth strategy for access-tokens.
* **sonic-3-latest** banner and naming.
### Other
* PVC, STT, and agent improvements.
* Error handling and error codes.
### API
* Improved error handling and public error responses; cache invalidation by voice ID.
* IPVC train API (remove **`markAsReady`**); dataset files overfetch fix; default voice logic fix.
### Playground
* Pronunciation dicts migrate to new backend APIs; persist visual theme to DB; PVC pipeline error and recommendations.
* Call logs conversation view default; TTS textarea height fix; Sonic-3 model for partners shown.
* Billing overage "blood bar" and alert fixes; PVC gate for Startup plan.
* Pronunciation dict generator on create/edit; API version in dialog; featured voice toggle; narrations model selection.
### Line / Agents
* No user audio warning (250ms); Pipecat DeepgramNovaVADFilter.
* Call recording and artifact storage fixes.
### Models / Voices
* Sonic 3 PVC and normalizer updates; LoRA and PVC error handling; expand option for dataset file count.
* **`preview_file_url`**; **`tags_operator`** on GET /voices; restrict delete to non-public voices; **`owner_id`** check for fine tune voices; **`user_errors`** for PVCs.
* New Arabic accents; African French and Canadian French.
### Model changes
* **Sonic 3 launch (Oct 27)** — **sonic-3-2025-10-27** stable snapshot released; 42 languages; volume, speed, and emotion controls.
* Real-time conversation with emotion and laughter; \~190ms median latency. See [Sonic 3](/build-with-cartesia/tts-models/latest) and [Volume, Speed, and Emotion](/build-with-cartesia/sonic-3/volume-speed-emotion).
### Other
* Continued PVC, STT, and agent improvements; error handling and public errors; manifone voices; Sonic 3 PVC and normalizer updates.
* Transcript buffer multilingual and Thai pronunciation dictionary fix; TTFA buffering and reporting; Voice Conversion operator reload; audio norm operator.
### API
* **`user_id`** to **`owner_id`** in API (model aliasing / ownership).
* Improved error handling and version/limit checks.
### Line / Agents
* Warning if no user audio for 250ms+; Pipecat **DeepgramNovaVADFilter** for spurious `on_speech_started`.
* Call recording and artifact storage fixes.
### Models / Voices
* STT: Migrate STT providers to Deepgram where appropriate; Deepgram for non-English or language-detect agents; word-level user text chunks.
* Sonic 3 / PVC: Sonic 3 PVC updates; Hindi Sonic 3 normalizer revert; LoRA data processing and expand option for dataset file count; PVC errors to webhook.
* Manifone new voice; African French and Canadian French accents; partner agents can configure TTS models.
### Other
* LoRA bugfixes.
### API
* Production-facing agent WebSocket; **cancel endpoint** for ending live calls.
* Improved error handling and public error codes; cache invalidation by voice ID.
### Playground
* Telephony: stop billing for customer-managed numbers; Cartesia vs Twilio param separation.
* Outbound number management columns.
### Line / Agents
* **Deepgram Nova VAD** (`utterance_end_ms` configurable via **`vad_stop_secs`**).
### Models / Voices
* New endpoint for **`
### API
* **`deploy_error`** status fix.
### Playground
* **LangChain** launched voice agents with Cartesia Sonic TTS.
* Billing: Stripe customer for enterprise if needed; call runtime logs in call logs side panel; Call Logs UI nits (from June work).
### Line / Agents
* Partner pipeline parity with User Agent; **concurrency fix** (negative concurrency); agent metric LLM credit usage for evals; AgentEvaluations functionality.
* User Code Connector WS handlers fix; agent end turn handling; summarization system prompt; **`user_prompt`** in API; transcript removed from agent metric result; deadlock fix in WS timeout.
### Other
* Flushing and concurrency fixes.
### API
* **UserCodeAgent** deployment URL; **cancel endpoint** for force-ending live calls via API; Agent EoUD metric; cartesia agent speed-up; user prompt stored separately in agent metrics; **`agent_evaluations`** table; async flush for aggregator; User Code Connector WS and last bot turn handling; deployment URL delay on pickup.
* Concurrency and WS timeout fixes; improved goroutine handling; agent workers **`/chats`** timeout increase.
### Playground
* **Call Logs** page for agents with data table and side panel; **Agents demo** with Twilio web dialer, visualizer, and like/dislike feedback; deployment detail page and list; **Twilio number provisioning** (Parts 1 & 2); GitConnector redeploy on commit; deployment logs; zip upload for deployment; feature flag by organization; agents gated behind feature flag; **Deepgram as default STT** for agents; orgs v2 (frontend and backend); 20K credits for organizations; enterprise free trial days and email invoice options.
* **Credit usage**: separate TTS & STT concurrency panels; STT and Infill charts; voice page copyable fields; call runtime logs in call logs panel.
### Models / Voices
* STT: Whisper large v3; serve multiple models in STT pipeline; word-level user text chunks.
* FinetunedSTTContext fixes.
### API
* Voice conversion in enterprise.
### Playground
* Post–April: Following [April 2025](/changelog/2025) API changes (embeddings removed; use [Voice IDs](/build-with-cartesia/tts-models/voice-ids) and [Clone Voice](/api-reference/voices/clone)).
### Line / Agents
* User code deployments from DB; **`agent_deployments`** table; STT cartesia-streaming and Pipecat streaming Whisper; Bedrock proxy for OpenAI-compatible; timestamp bug fixes and default to original timestamps.
* Partner `/chat` and `/config` updates; DTMF support in UserCodeConnector; endpointing architecture.
### Models / Voices
* STT: Batch engine utilization; Pipecat streaming Whisper.
* Deepgram STT client `url`/`base_url` fix.
### Other
* Voice clone uploads fix.
### Breaking
* **sonic-2-2025-04-16** — Starting with **`sonic-2-2025-04-16`**, we're removing support for: Embeddings; **`stability`** cloning mode; Experimental controls for speed and emotion. The **`similarity`** cloning mode is dramatically better. To control speed and emotion today, use Instant Voice Cloning (e.g. FFMPEG, Voice Changer, or instant clones from **`sonic-2-2025-03-07`** embeddings). Users who need embeddings or experimental controls can use API version **`2024-11-13`** with model **`sonic-2-2025-03-07`** (both still available). See [Older models](/build-with-cartesia/tts-models/older-models).
### API
* listVoices by ID for single voice; warm-monkey PVC; **access tokens** (JWT); Cartesia-Version 2024-11-13; phoneme/original timestamps language check; TTS History source; LoRA from fine-tune checkpoints; context expiry replaced by input stream delay.
* **`sonic-2`** and **`sonic-2-2025-04-16`** ignore experimental controls on TTS generations; voice cloning supports only **`similarity`** clones.
* Removed embeddings from all endpoints; voices may only be specified by Voice ID; **`/tts`** cannot be called with voice embeddings.
* Deprecated **`/voices/create`** and **`/voices/mix`**.
### Breaking
* **sonic-2-2025-03-07** is the last Sonic 2 snapshot supporting voice embeddings and experimental controls. Use with API version **`2024-11-13`** for legacy behavior.
* sonic-preview → JollyTotem, RoseLion deprecated; sonic-2 alias to jolly-totem for speaker switching. See [Older models](/build-with-cartesia/tts-models/older-models).
### API
* **Cartesia-Version** updated to **2024-11-13**; model latency via header on bytes endpoint; new Sonic PVC model warm-monkey; listVoices by ID (single voice); **access tokens** (JWT signing, validation); API-level check for languages supporting phoneme and original timestamps.
* Organizations and billing; **free credits** 10k → 20k; overages product; subscription cache invalidation webhook; TTS History **source** column (api, playground, narrations); LoRA voices from base VoiceVariation and checkpoint for fine tunes.
### Playground
* **sonic-2** and **sonic-turbo** aliases launched; Sonic 2 / Sonic Turbo messaging (Turbo = 40ms latency).
* cartesia.ai/sonic and playground updates.
### Line / Agents
* Agent ID in websocket URL; telephony info on partner calls; Pipecat version upgrade; partner demo tool calls; warm-monkey PVC model; prespeak and function call flow updates.
* Twilio voice routes support agent IDs; Keypad DTMF on agent; half-duplex STT and LLM context; original timestamps support in API.
### Other
* **sonic-pvc** alias and DryVoice as sonic-pvc model. **Python SDK** announced.
### API
* **listVoices** by ID; localize endpoint voice name fix; 400s for bad body params; text forcing max transcript length; **OpenAI-compatible STT server**; agent with local STT; voice tags; on-device transcripts in evals; jolly-totem as default sonic-preview.
* S2S and Agents foundational blocks.
### Playground
* Instant cloning enabled for free users; voice tags; localize refactored to use conditioning; listVoices can query by ID for single voice; Sarah (Similarity) and Southern Woman migrations; on-device transcripts.
* Narrations settings (JSONB).
### Line / Agents
* Agent with local STT; foundational S2S + Agents blocks; design and pipeline work.
### Models / Voices
* STT: cartesia-streaming and Pipecat streaming Whisper; on-device transcripts.
### API
* **sonic-lite** added to API; EU deployment for prod API; save option for TTS bytes handler; CORS header for **Cartesia-File-ID**; Stripe credits default to `char_limit` in checkout; Redis cache for overage settings; polar-mountain and VC in EU; ListFiles paginator fix.
* Eval break/spell tags and replacement/normalization mode.
### Models / Voices
* sonic-preview routed to MisunderstoodFrog; polar-mountain added and staged; visionary-yogurt timestamp requests for any language.
* jolly-totem as default sonic-preview.
# Changelog 2026
Source: https://docs.cartesia.ai/changelog/2026
Product, API, and platform changes for 2026
### Sonic 3.5
*Sonic 3.5 is now available on `sonic-3-latest`. We'd love for you to try it and tell us what you think.*
#### Why you should try it
* **More natural speech, pacing, and emotional expression**, especially noticeable on expressive, conversational, and support-style transcripts.
* **Cleaner audio quality** across all languages and voices.
* **Better alphanumeric read-out** — confirmation codes, order numbers, phone numbers, IDs, and emails sound meaningfully more natural, in all supported languages.
* **Step-change multilingual performance**, particularly Hebrew, Japanese, Spanish, Hindi, German, Korean, and French.
* **English heteronyms** — tricky English heteronyms like "read," "bass," and "bow" now pronounce correctly in context.
#### How to try it
1. Point your API call or Playground request to the model ID `sonic-3-latest`.
2. Keep your existing voice IDs, request shape, and prompting — no code changes required for most customers.
3. Send us feedback on any voice or transcript that behaves differently than you expect.
As with any `-latest` alias, `sonic-3-latest` can be updated without notice and is not recommended for production. Pin to a dated snapshot (e.g. `sonic-3`) for production traffic.
#### What to know to be successful
* **Spell tags still work the same way.** If you already wrap alphanumerics in `...`, you don't need to change anything — you'll just get better-sounding output. See [here](/build-with-cartesia/sonic-3-5/prompting-tips#controlling-pacing-and-spelling) for more details.
* **If you use custom delimiters** (commas/periods between characters or groups) to control pacing, our recommended format has changed. Use **spaces between characters** and **commas between groups**, e.g. `A B C, 1 2 3` instead of `A, B, C. 1, 2, 3.`. See [Prompting tips for Sonic 3.5](/build-with-cartesia/sonic-3-5/prompting-tips) for more details.
* **Speed and volume controls are temporarily disabled** on `sonic-3-latest`. If you rely on speed or volume augmentation (including via SSML), stay on `sonic-3` for now. We believe that Sonic 3.5 has more natural pacing and you may find that you don't need to use speed control as much when using this model.
* **Timestamps behave slightly differently.** If you use end-of-word timestamps for interruption handling, you should not see a meaningful change. If you depend on beginning-of-word timestamps, please test carefully and reach out if you see regressions for your use case.
* **Existing Professional Voice Clones (PVCs) do not carry over to `sonic-3-latest`.** Professional Voice Clones are pinned to the base model they were trained on (e.g. `sonic-3`) and will function as a standard voice clone for this model. For more information, see [Clone Voices (Pro)](/build-with-cartesia/capability-guides/clone-voices-pro/playground).
* **Providing proper context to the model improves naturalness.** Please see our buffering guide [here](/use-the-api/tts-websocket/buffering) for more details.
#### Where to look for help
* [Sonic 3.5 model overview](/build-with-cartesia/tts-models/sonic-3-5)
* [Prompting tips for Sonic 3.5](/build-with-cartesia/sonic-3-5/prompting-tips)
* [Model aliases and snapshots](/build-with-cartesia/tts-models/latest#continuous-updates-and-model-snapshots)
### Breaking
* **Text-to-Agent (T2A) API** — Text-to-Agent workflow for Line is **deprecated**.
### API
* **Error responses** — For `Cartesia-Version: 2026-03-01`, we now return structured JSON. See [API Errors](/use-the-api/api-errors).
* API versions before `2026-03-01` continue to return legacy error formats (for example HTTP `Title: Message`).
* **Voices** — `PATCH /voices/{id}`: voice owners can now update accent and gender. Voice creation validates language. Invalid voice UUIDs and pronunciation-dictionary IDs return 404 instead of ambiguous errors.
* **PVC model routing** — PVC voices require a dated model ID (e.g. **`sonic-3-2026-01-12`**) instead of **`sonic-3`**. See [Clone Voices (Pro)](/build-with-cartesia/capability-guides/clone-voices-pro/api).
* **Voice search** — Name and metadata search is **diacritics-insensitive**.
### Playground
* **Pro voice clones**
* Clearer **language mismatch** messaging
* **Background noise removal** is now a simple on/off control
* **Fine-tuning model support**:
* Removed support for older models
* Now only **sonic-3-2026-01-12** is supported
* **Multilingual agents** — Multilingual agent configuration is now supported in the Playground.
* **Agents UI** — Search by **call ID** and **agent ID**.
### Billing
* **Concurrency** — Organizations can receive **notifications** when concurrency nears configured **limits**.
### Model / voice
* **Professional Voice Clones** — Backend updates improve stability of the professional voice cloning workflow.
* **Accents & filters** — Additional **accent** options (e.g. **Irish**, **New Zealand**, **South African**, **Belgian**) and **locale aliases** for accent filtering in APIs and Playground.
* **Voice Library** — **94** new voices across **17** locales (including Arabic, German, English variants, Spanish, Finnish, French, Hebrew, Hindi, Japanese, Korean, Polish, Portuguese, Swedish, Telugu, Thai, and more).
### Self-hosted
* **On-premises** — API for managing voices on self-hosted deployments.
### Cartesia SDK
* **cartesia-js v3.0.0** (Mar 2) — Major updates:
* New features: `flush_id` included in chunk and voice changer binary responses; `output_format` and infill support; inline WebSocket response types; byte endpoint returns **ArrayBuffer**; improved **WebPlayer** and client export.
* Fixes: memory leak and timing issues with abort signals/listeners, handling of empty `Content-Length`, and **TimeoutError** now includes a message.
See [cartesia-js releases](https://github.com/cartesia-ai/cartesia-js/releases) for full details.
### Line
* **[History Management API](/line/sdk/agents#history-management)**: You can add or replace the history provided to your agent, for example, to summarize a long conversation.
* **[Custom User Events](/line/sdk/events#custom-event)**: You can send bidirectional custom events between your client and the agent. You could use this, for example, if you have a web application with UI interactions.
* **[Uninterruptible Messages](/line/sdk/events#speech)**: You can set messages as uninterruptible. A common use case is a legal disclaimer at the beginning of a call.
* **End Tool Call Improvements**: The default end call tool call is more conservative to prevent calls from ending prematurely.
### API
* Increased reliability of API connections
### Cartesia SDK
* **cartesia-python v3.0.0** (Feb 9). See full details in [cartesia-python releases](https://github.com/cartesia-ai/cartesia-python/releases).
### Playground
* Shipped a new TTS page
* Shipped a new Voice Creation page
* Shipped a new Agents page
### Model changes
* **Improved pronunciation of real-world text patterns across languages**
* Enhanced support for structured and formatted speech patterns: numbers, dates, times, currency, phone numbers, IDs, percentages, and amounts/measurements.
* Support for various date formats (YYYY-MM-DD, YYYY/MM/DD, 年月日).
* Support for measurement units (meters, kg, tablespoon, gigabytes, etc.) with locale awareness.
* Support for domestic and international phone number formats with locale-specific chunking for French, Italian, German, Portuguese, Korean, and more.
* Improved alphanumeric ID handling with katakana/hiragana readings and Latin acronym transliteration to katakana for Japanese.
* Improves all languages except English, Hindi & other Indic languages, Arabic, Hebrew, Chinese, Swedish, Georgian, Bulgarian, and Tagalog (targeted for future updates).
* **Support for regional and locale-specific pronunciation within languages**
* Regional voices use region-specific terms in addition to accent (e.g. Belgian and Swiss French "nonante" vs. Canadian and French "quatre-vingt-dix").
* Region-specific number terminology, currency symbols, date formats, and measurement units.
* Locale-aware date and time formatting (e.g. Russian year suffixes, French/Spanish time conventions).
* Locale-aware currency symbol handling (e.g. \$ as "dollars" in en\_US and "pesos" in es\_MX).
* Locale pronunciation falls back to the primary country for that language (e.g. US for English, Brazil for Portuguese). We will continue to expand locale-aware support.
* Improves all languages except English, Hindi & other Indic languages, Arabic, Hebrew, Chinese, Swedish, Georgian, Bulgarian, and Tagalog (targeted for future updates). Existing regional pronunciation for English voices (e.g. British) is unaffected.
### Voice changes
* **Voice Library**: 39 new voices across 21 locales
### Breaking changes effective June 1, 2026
The following model snapshots and languages are discontinued effective June 1, 2026:
| Model | Snapshots | Languages |
| -------------------- | ---------------------------------------------------------------- | -------------------------- |
| `sonic` | All | All |
| `sonic-english` | — | All |
| `sonic-multilingual` | — | All |
| `sonic-2` | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | it, nl, pl, ru, sv, tr, hi |
| | `sonic-2-2025-03-07` | All |
| `sonic-turbo` | `sonic-turbo-2025-06-04` | it, nl, pl, ru, sv, tr |
| | `sonic-turbo-2025-03-07` | All |
The following endpoints are discontinued effective June 1, 2026:
| Discontinued Endpoint | Replacement |
| ------------------------------------------ | ------------------------------------------ |
| Voice Embedding: `POST /voices/clone/clip` | [Clone Voice](/api-reference/voices/clone) |
| Mix Voices: `POST /voices/mix` | — |
| Create Voice: `POST /voices` | [Clone Voice](/api-reference/voices/clone) |
The following endpoints stop accepting voice embeddings effective June 1, 2026:
| Endpoint with a breaking change | Replacement |
| ------------------------------------- | ----------- |
| TTS (bytes): `POST /tts/bytes` | Voice ID |
| TTS (SSE): `POST /tts/sse` | Voice ID |
| TTS (WebSocket): `WSS /tts/websocket` | Voice ID |
### API
* **Regionalization** — Calls routed to US, EU, APAC by origin.
* **Parameterized outbound calls** — [Docs](/line/integrations/telephony/outbound-dialing)
* **Pronunciation dictionaries** — [Docs](/line/sdk/agents#custom-pronunciations)
### Model changes
* **Sonic-3 model versioning scheme introduced**
* New preview track: **`sonic-3-latest`** (continuous updates for early access and feedback).
* Stable track: **`sonic-3`** always points to the most recent stable release.
* Immutable dated snapshots: **`sonic-3-YYYY-MM-DD`** never change.
* Details: [Continuous updates and model snapshots](/build-with-cartesia/tts-models/latest#continuous-updates-and-model-snapshots)
* **Promotion to stable checkpoint:** **`sonic-3-2026-01-12`**
* Included improvements: consistent speed & volume, custom IPA pronunciations with stronger adherence, Hindi prosody improvements, Korean prosody/intonation improvements.
### Voice changes
* **Featured Voices launched** — Curated set of 30+ best-performing voices (e.g. [Cathy](https://play.cartesia.ai/voices/e8e5fffb-252c-436d-b842-8879b84445b6), [Henry](https://play.cartesia.ai/voices/87286a8d-7ea7-4235-a41a-dd9fa6630feb)).
* **Voice Library** — December: 25 new voices across 6 languages.
* **Voice Library** — January: 9 Spanish voices (Mexican, Colombian, Castilian).
### Playground
* Voice library usability improvements (test with your own scripts, call an agent per voice).
* One-click **Report Issue** on TTS Playground.
* Mini voice picker (recently used + saved) on TTS page.
* PVC UI + reliability (loading skeletons, error messages, better behavior with large datasets and silence).
### Line
* **Line SDK v0.2** — [Repo](https://github.com/cartesia-ai/line). Improved DX, long-running tool-call handling, **committed turns**, better turn-taking and transcription.
# Error Handling
Source: https://docs.cartesia.ai/examples/error-handling
Example of error handling with SDK exceptions.
```python theme={null}
def error_handling_example(client: Cartesia) -> None:
"""Example of error handling with SDK exceptions."""
try:
_response = client.tts.generate(
model_id="sonic-3",
transcript="Hello, world!",
voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
)
except BadRequestError as e:
print(f"Bad request: {e}")
except AuthenticationError as e:
print(f"Auth failed: {e}")
except NotFoundError as e:
print(f"Not found: {e}")
except RateLimitError as e:
print(f"Rate limited: {e}")
except APIError as e:
print(f"API error: {e}")
```
From [cartesia-python/examples/examples.py:545](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L545)
```typescript theme={null}
async function errorHandling(client: Cartesia): Promise {
/** Example of error handling with SDK exceptions. */
try {
await client.tts.generate({
model_id: 'sonic-3',
transcript: 'Hello, world!',
voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
output_format: { container: 'wav', encoding: 'pcm_f32le', sample_rate: 44100 },
});
} catch (e) {
if (e instanceof BadRequestError) {
console.log(`Bad request: ${e.message}`);
} else if (e instanceof AuthenticationError) {
console.log(`Auth failed: ${e.message}`);
} else if (e instanceof NotFoundError) {
console.log(`Not found: ${e.message}`);
} else if (e instanceof RateLimitError) {
console.log(`Rate limited: ${e.message}`);
} else if (e instanceof APIError) {
console.log(`API error: ${e.message}`);
} else {
throw e;
}
}
}
```
From [cartesia-js/examples/node\_examples.ts:398](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L398)
## Run this example
```sh theme={null}
cd cartesia-python
CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py error_handling_example
```
```sh theme={null}
cd cartesia-js
CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts errorHandling
```
# Create Infill Audio
Source: https://docs.cartesia.ai/examples/infill-create
Create infill audio between two clips.
```python theme={null}
def infill_create(client: Cartesia) -> None:
"""Create infill audio between two clips."""
from pathlib import Path
# Can pass file paths directly (as Path objects)
response = client.tts.infill(
model_id="sonic-3",
language="en",
transcript="Infill text",
left_audio=Path("left.wav"),
right_audio=Path("right.wav"),
voice_id="6ccbfb76-1fc6-48f7-b71d-91ac6298247b",
output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
)
response.write_to_file("infill_output.wav")
print(f"Saved audio to infill_output.wav")
print(f"Play with: ffplay -f wav infill_output.wav")
```
From [cartesia-python/examples/examples.py:504](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L504)
```python theme={null}
async def infill_create_async(client: AsyncCartesia) -> None:
"""Create infill audio between two clips."""
from pathlib import Path
response = await client.tts.infill(
model_id="sonic-3",
language="en",
transcript="Infill text",
left_audio=Path("left.wav"),
right_audio=Path("right.wav"),
voice_id="6ccbfb76-1fc6-48f7-b71d-91ac6298247b",
output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
)
await response.write_to_file("infill_output_async.wav")
print("Saved audio to infill_output_async.wav")
print("Play with: ffplay -f wav infill_output_async.wav")
```
From [cartesia-python/examples/async\_examples.py:341](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L341)
## Run this example
```sh theme={null}
cd cartesia-python
CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py infill_create
```
```sh theme={null}
cd cartesia-python
CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py infill_create_async
```
# Next.js Full Example
Source: https://docs.cartesia.ai/examples/nextjs
A complete Next.js application with batch TTS, HTTP streaming, and WebSocket streaming.
A full Next.js app demonstrating three approaches to Cartesia TTS in the browser:
batch generation, HTTP streaming, and WebSocket streaming. Includes a server-side
token endpoint so API keys are never exposed to the client.
## Token Endpoint
```typescript app/api/token/route.ts theme={null}
import Cartesia from "@cartesia/cartesia-js";
const client = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY });
export async function POST() {
const { token } = await client.accessToken.create({
grants: { tts: true },
expires_in: 300,
});
return Response.json({ token });
}
```
## Batch and HTTP Streaming
```tsx app/page.tsx theme={null}
"use client";
import { useRef, useState } from "react";
import Cartesia from "@cartesia/cartesia-js";
const SAMPLE_RATE = 44100;
const BYTES_PER_SAMPLE = 4; // f32le
async function getToken(): Promise {
const res = await fetch("/api/token", { method: "POST" });
const { token } = await res.json();
return token;
}
// =============================================================================
// Batch: waits for the full response, then plays via