# Delete Agent
Source: https://docs.cartesia.ai/api-reference/agents/agents/delete

/latest.yml DELETE /agents/{agent_id}


# Get Agent
Source: https://docs.cartesia.ai/api-reference/agents/agents/get

/latest.yml GET /agents/{agent_id}
Returns the details of a specific agent. To create an agent, use the CLI or the Playground for the best experience and integration with Github.


# List Agents
Source: https://docs.cartesia.ai/api-reference/agents/agents/list

/latest.yml GET /agents
Lists all agents associated with your account.


# List Phone Numbers
Source: https://docs.cartesia.ai/api-reference/agents/agents/phone-numbers

/latest.yml GET /agents/{agent_id}/phone-numbers
List the phone numbers associated with an agent. Currently, you can only have one phone number per agent and these are provisioned by Cartesia.


# List Templates
Source: https://docs.cartesia.ai/api-reference/agents/agents/templates

/latest.yml GET /agents/templates
List of public, Cartesia-provided agent templates to help you get started.


# Update Agent
Source: https://docs.cartesia.ai/api-reference/agents/agents/update

/latest.yml PATCH /agents/{agent_id}


# Download Call Audio
Source: https://docs.cartesia.ai/api-reference/agents/calls/download-call-audio

/latest.yml GET /agents/calls/{call_id}/audio
The downloaded audio file is in .wav format. This endpoint streams the audio file content (WAV format) to the client.


# Get Call
Source: https://docs.cartesia.ai/api-reference/agents/calls/get-call

/latest.yml GET /agents/calls/{call_id}


# Get Call Runtime Logs
Source: https://docs.cartesia.ai/api-reference/agents/calls/get-call-logs

/latest.yml GET /agents/calls/{call_id}/logs
Returns the runtime logs for a specific call. These are the logs produced by your agent's code during the call. Logs may not be available if the call is still in progress or if they have been removed due to data retention settings.


# List Calls
Source: https://docs.cartesia.ai/api-reference/agents/calls/list-calls

/latest.yml GET /agents/calls
Lists calls sorted by start time in descending order for a specific agent. `agent_id` is required and if you want to include `transcript` in the response, add `expand=transcript` to the request. This endpoint is paginated.


# Get Deployment
Source: https://docs.cartesia.ai/api-reference/agents/deployments/get-deployment

/latest.yml GET /agents/deployments/{deployment_id}
Get a deployment by its ID.


# List Deployments
Source: https://docs.cartesia.ai/api-reference/agents/deployments/list-deployments

/latest.yml GET /agents/{agent_id}/deployments
List of all deployments associated with an agent.


# Add Metric to Agent
Source: https://docs.cartesia.ai/api-reference/agents/metrics/add-metric-to-agent

/latest.yml POST /agents/{agent_id}/metrics/{metric_id}
Add a metric to an agent. Once the metric is added, it will be run on all calls made to the agent automatically from that point onwards.


# Create Metric
Source: https://docs.cartesia.ai/api-reference/agents/metrics/create-metric

/latest.yml POST /agents/metrics
Create a new metric.


# Export Metric Results as CSV
Source: https://docs.cartesia.ai/api-reference/agents/metrics/export-metric-results

/latest.yml GET /agents/metrics/results/export
Export metric results to a CSV file. This endpoint streams at most 100k results as the CSV file directly to the client. Use the optional filters to narrow down the results to export.


# Get Metric
Source: https://docs.cartesia.ai/api-reference/agents/metrics/get-metric

/latest.yml GET /agents/metrics/{metric_id}
Get a metric by its ID.


# List Metric Results
Source: https://docs.cartesia.ai/api-reference/agents/metrics/list-metric-results

/latest.yml GET /agents/metrics/results
Paginated list of metric results. Filter results using the query parameters,


# List Metrics
Source: https://docs.cartesia.ai/api-reference/agents/metrics/list-metrics

/latest.yml GET /agents/metrics
List of all LLM-as-a-Judge metrics owned by your account.


# Remove Metric from Agent
Source: https://docs.cartesia.ai/api-reference/agents/metrics/remove-metric-from-agent

/latest.yml DELETE /agents/{agent_id}/metrics/{metric_id}
Remove a metric from an agent. Once the metric is removed, it will no longer be run on all calls made to the agent automatically from that point onwards. Existing metric results will remain.


# API Status and Version
Source: https://docs.cartesia.ai/api-reference/api-status/get

/latest.yml GET /


# Speech-to-Text (Streaming)
Source: https://docs.cartesia.ai/api-reference/stt/stt

This endpoint creates a bidirectional WebSocket connection for real-time speech transcription.

Our STT endpoint enables sending in a stream of audio as bytes, and provides transcription results as they become available.

**Usage Pattern**:

1. Connect to the WebSocket with appropriate query parameters
2. Send audio chunks as binary WebSocket messages in the specified encoding format
3. Receive transcription messages as JSON with word-level timestamps
4. Send `finalize` as a text message to flush any remaining audio (receives `flush_done` acknowledgment)
5. Send `done` as a text message to close the session cleanly (receives `done` acknowledgment and closes)

**Performance Recommendation**: For best performance, it is recommended to resample audio before streaming and send audio chunks in `pcm_s16le` format at 16kHz sample rate.

**Pricing**: Speech-to-text streaming is priced at **1 credit per 1 second** of audio streamed in.

For WebSocket connection limits, see the [concurrency limits and timeouts](/use-the-api/concurrency-limits-and-timeouts) page.


# Speech-to-Text (Batch)
Source: https://docs.cartesia.ai/api-reference/stt/transcribe

/latest.yml POST /stt
Transcribes audio files into text using Cartesia's Speech-to-Text API.

Upload an audio file and receive a complete transcription response. Supports arbitrarily long audio files with automatic intelligent chunking for longer audio.

**Supported audio formats:** flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm

**Response format:** Returns JSON with transcribed text, duration, and language. Include `timestamp_granularities: ["word"]` to get word-level timestamps.
**Pricing:** Batch transcription is priced at **1 credit per 2 seconds** of audio processed.

<Note>
For migrating from the OpenAI SDK, see our [OpenAI Whisper to Cartesia Ink Migration Guide](/use-the-api/migrate-from-open-ai).
</Note>


# Text to Speech (Bytes)
Source: https://docs.cartesia.ai/api-reference/tts/bytes

/latest.yml POST /tts/bytes


# Text to Speech (SSE)
Source: https://docs.cartesia.ai/api-reference/tts/sse

/latest.yml POST /tts/sse


# Text to Speech (WebSocket)
Source: https://docs.cartesia.ai/api-reference/tts/websocket

This endpoint creates a bidirectional WebSocket connection. The connection supports multiplexing, so you can send multiple requests and receive the corresponding responses in parallel.

The WebSocket API is built around contexts:

- When you send a generation request, you pass a `context_id`. Further inputs on the same `context_id` will [continue the generation](/build-with-cartesia/capability-guides/stream-inputs-using-continuations), maintaining prosody.
- Responses for a context contain the `context_id` you passed in so that you can match requests and responses.

Read the guide [on working with contexts](/use-the-api/tts-websocket/contexts) to learn more.

For the best performance, we recommend the following usage pattern:

1. **Do many generations over a single WebSocket**. Just use a separate context for each generation. The WebSocket scales up to dozens of concurrent generations.
2. **Set up the WebSocket before the first generation**. This ensures you don’t incur latency when you start generating speech.
3. **Include necessary spaces and punctuation**: This allows Sonic to generate speech more accurately and with better prosody.

For conversational agent use cases, we recommend the following usage pattern:

1. **Each turn in a conversation should correspond to a context**: For example, if you are using Sonic to power a voice agent, each turn in the conversation should be a new context.
2. **Start a new context for interruptions**: If the user interrupts the agent, start a new context for the agent’s response.

To learn more about managing concurrent generations and WebSocket connection limits, see the [concurrency limits and timeouts](/use-the-api/concurrency-limits-and-timeouts) page.


# Clone Voice
Source: https://docs.cartesia.ai/api-reference/voices/clone

/latest.yml POST /voices/clone
Clone a high similarity voice from an audio clip. Clones are more similar to the source clip, but may reproduce background noise. For these, use an audio clip about 5 seconds long.


# Delete Voice
Source: https://docs.cartesia.ai/api-reference/voices/delete

/latest.yml DELETE /voices/{id}


# Get Voice
Source: https://docs.cartesia.ai/api-reference/voices/get

/latest.yml GET /voices/{id}


# List Voices
Source: https://docs.cartesia.ai/api-reference/voices/list

/latest.yml GET /voices


# Localize Voice
Source: https://docs.cartesia.ai/api-reference/voices/localize

/latest.yml POST /voices/localize
Create a new voice from an existing voice localized to a new language and dialect.


# Update Voice
Source: https://docs.cartesia.ai/api-reference/voices/update

/latest.yml PATCH /voices/{id}
Update the name, description, and gender of a voice. To set the gender back to the default, set the gender to `null`. If gender is not specified, the gender will not be updated.


# Audio encodings
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/audio-encodings


Pick the encoding that matches your downstream pipeline.

## TTS output encodings

Used in the `output_format.encoding` field when generating audio.

| Encoding    | Bit depth        | Best for                                                        | Pair with sample rate             |
| ----------- | ---------------- | --------------------------------------------------------------- | --------------------------------- |
| `pcm_s16le` | 16-bit int       | General-purpose playback, browsers, audio players, most devices | 44100 (CD quality) or 16000–48000 |
| `pcm_f32le` | 32-bit float     | ML post-processing, high-fidelity recording, audio analysis     | 48000                             |
| `pcm_mulaw` | 8-bit compressed | North American / Japanese telephony (G.711μ), Twilio            | 8000                              |
| `pcm_alaw`  | 8-bit compressed | European / international telephony (G.711A)                     | 8000                              |

### `pcm_s16le`

16-bit signed integer PCM, little-endian. Matches the standard audio CD format and is the most widely supported encoding across audio players, browsers, and hardware. Use this as your default unless you have a specific reason to choose another format.

```json theme={null}
{
  "container": "raw",
  "encoding": "pcm_s16le",
  "sample_rate": 44100
}
```

### `pcm_f32le`

32-bit floating point PCM, little-endian. Provides the highest precision and dynamic range. Use when your pipeline handles float audio end-to-end—for example, feeding generated audio into an ML model, performing signal processing with NumPy/SciPy, or recording to a lossless format for later mastering.

```json theme={null}
{
  "container": "raw",
  "encoding": "pcm_f32le",
  "sample_rate": 48000
}
```

### `pcm_mulaw`

8-bit μ-law compressed PCM. The standard encoding for North American and Japanese telephone networks (G.711μ). Use this when sending audio to Twilio or any telephony provider that expects μ-law. Always pair with an 8000 Hz sample rate to match the telephony standard.

```json theme={null}
{
  "container": "raw",
  "encoding": "pcm_mulaw",
  "sample_rate": 8000
}
```

### `pcm_alaw`

8-bit A-law compressed PCM. The standard encoding for European and international telephone networks (G.711A). Use when your telephony infrastructure expects A-law rather than μ-law. Always pair with an 8000 Hz sample rate.

```json theme={null}
{
  "container": "raw",
  "encoding": "pcm_alaw",
  "sample_rate": 8000
}
```

## STT input encodings

Used in the `encoding` parameter when sending audio for transcription. Must match the actual encoding of your audio source.

| Encoding    | Bit depth        | Common sources                                                      |
| ----------- | ---------------- | ------------------------------------------------------------------- |
| `pcm_s16le` | 16-bit int       | Microphones, browsers (Web Audio API), most audio capture libraries |
| `pcm_s32le` | 32-bit int       | Professional audio interfaces                                       |
| `pcm_f16le` | 16-bit float     | Half-precision ML pipelines                                         |
| `pcm_f32le` | 32-bit float     | ML models, Web Audio API `AudioWorklet` nodes, NumPy/SciPy          |
| `pcm_mulaw` | 8-bit compressed | North American telephony, Twilio streams                            |
| `pcm_alaw`  | 8-bit compressed | European telephony systems                                          |

For best STT performance, resample your audio to `pcm_s16le` at 16000 Hz before sending.


# Choosing a Voice
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/choosing-a-voice

How to pick the best voice for your Voice Agents

When designing a voice agent experience, the voice that your agents will speak in is a critical choice that will influence your customers' experience.

Cartesia offers 500+ voices out-out-of-box, as well as the ability to clone your own voices.

### Featured Voices

We feature a set of Voices that we've found work well for our customers and pass our internal quality checks. These voices are a great starting point to find the best Voice for your voice agent.

Featured Voices are displayed with a check mark icon next to their names on [play.cartesia.ai](https://play.cartesia.ai/).

### Stable voices (best for voice agents)

For voice agents in production, we've found that more stable, realistic voices perform better than studio quality, emotive voices. From our testing, we think these are the top performing English Voices for voice agents in Sonic 3:

* **Male**: Ronald, Carson
* **Female**: Katie, Jacqueline, Brooke

### Emotive voices (best for AI characters)

Our latest model, Sonic 3, is very expressive with some voices like Tessa and Maya labeled as emotive in the playground, and respond well to [emotion instructions](/build-with-cartesia/sonic-3/volume-speed-emotion).

If your use case requires more expressive speech (e.g. companion apps, game characters), then we suggest trying:

* **Male**: Kyle, Cory
* **Female**: Tessa, Ariana

We tag such voices as Emotive in our playground and you can see a full list [here](https://play.cartesia.ai/voices?tags=Emotive).


# Choosing TTS parameters
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/choosing-tts-parameters


Our Text-to-Speech API includes many parameters that can be bewildering to developers who have not
worked with audio before.

In general, you should pick the highest precision and sample rate supported by every stage of your audio
pipeline, including telephony and device outputs.

A typical digital audio setup will perform well with these settings, which match the standard audio CD format:

```
output_format: {
	container: "raw",
	encoding: "pcm_s16le",
	sample_rate: 44100,
}
```

If you know your pipeline supports a higher encoding and sample rate end to end, the highest quality settings are:

```
output_format: {
	container: "raw",
	encoding: "pcm_f32le",
	sample_rate: 48000,
}
```

## Reference

<ParamField type="string">
  The container format (if any), for the audio output.

  Available options: `RAW`, `WAV`, `MP3`. Only the Bytes endpoint supports all container formats;
  our streaming endpoints (SSE, Websockets) only support `RAW`.
</ParamField>

<ParamField type="string">
  The encoding of the output audio. Available options: `pcm_f32le`, `pcm_s16le`, `pcm_mulaw`, `pcm_alaw`.

  For detailed guidance on when to use each encoding, see [Audio encodings](/build-with-cartesia/capability-guides/audio-encodings).
</ParamField>

<ParamField type="number">
  The sample rate of the output audio. Remember that to represent a given signal, the sample rate
  must be at least twice the highest frequency component of the signal (Nyquist theorem).

  Available options: `8000`, `16000`, `22050`, `24000`, `44100`, `48000`.
</ParamField>

## Examples

### Audio CD quality

Standard audio CDs are encoded as `pcm_s16le` at 44.1kHz sample rate:

```
output_format: {
	container: "raw",
	encoding: "pcm_s16le",
	sample_rate: 44100,
}
```

This performs well for consumer digital audio setups.

### Telephony

Many customers send their audio output over Twilio. Since all audio sent over Twilio is
transcoded to µlaw encoding with 8kHz sample rate (to match the telephony standard), you should
specify the following output\_format:

```
output_format: {
  container: "raw",
	encoding: "pcm_mulaw",
	sample_rate: 8000,
}
```

### Bluetooth headsets

If you happen to know that that the user is using a Bluetooth headset (such as AirPods) to multiplex
both microphone input and headphone output, the user will be on the Bluetooth Hands-Free Profile
(HFP), limiting sample rate to 16kHz. (In practice, it's difficult to programmatically determine the
end-user's microphone/speaker devices, so this example is a bit contrived.)

```
output_format: {
	container: "raw"
	encoding: "pcm_s16le",
	sample_rate: 16000,
}
```


# Clone Voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices

Learn how to get the best voice clones from your audio clips.

<Frame>
  <img />
</Frame>

Voice cloning is available through the [playground](https://play.cartesia.ai) and the [API](/2024-11-13/api-reference/voices/clone). With current API versions, instant cloning uses **high-similarity** mode: clones sound more like the source clip, but may reproduce background noise. For the legacy **stability** workflow, pin API version `2024-11-13` and see [Older TTS models](/build-with-cartesia/tts-models/older-models).

For the best voice clones, we recommend following these best practices:

## General best practices for voice cloning

1. **Choose an appropriate script to speak.** You want your recording to align as closely as possible with the voice you want to generate. For example, don't read a colorless transcript in a monotone voice unless you're aiming for a monotonous clone. Instead, prepare a script that is suited to your use case and has the right energy.
2. **Speak as clearly as possible and avoid background noise.** For example, when recording yourself, try to use a high-quality microphone and be in a quiet space.
3. **Avoid long pauses.** Pauses in the recording will be mimicked by the cloned voice, such as between sentences. Ensure your recording matches the pacing you want your voice to follow.
4. **Trim your recording.** The audio you provide should roughly contain speech from start to finish. Make sure the speaker is not cut-off and that there's no excessive silence at the beginning or end. You can use a tool like Audacity or our playground make the perfect clip from your recording.
5. **Speak in the target language.** For instance, if you want the cloned voice to speak Spanish, speak Spanish in the recording. If this is not possible, you can use Cartesia's localization feature—available in the playground and in the API—to convert your clone to a different language.

## Best practices for high-similarity clones

1. **Limit your recording to ten seconds.** This is the sweet spot for high-similarity clones. A longer clip will not result in a better clone.
2. **Set `enhance` to `false` when cloning.** Unless your source clip has substantial background noise, any postprocessing will reduce the similarity of the clone to the source clip.


# End-to-end Pro Voice Cloning (Python)
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices-pro/api

Use Cartesia's REST API to create a Pro Voice Clone.

> **Prerequisites**
>
> 1. You have a **Cartesia API token** (export it as `CARTESIA_API_TOKEN`).
> 2. You have at least 1 M credits on your account.
> 3. You have a folder called `samples/` with one or more `.wav` files.

```python lines theme={null}
"""
End-to-end Pro Voice Cloning example.

Steps
-----
1. Create a dataset.
2. Upload audio files from samples/ to the dataset.
3. Kick off a fine-tune from that dataset.
4. Poll until fine-tune is completed.
5. Get the voices produced by the fine-tune.
"""

import os
import time
from pathlib import Path

import requests

API_BASE = "https://api.cartesia.ai"
API_HEADERS = {
    "Cartesia-Version": "2025-04-16",
    "Authorization": f"Bearer {os.environ['CARTESIA_API_KEY']}",
}


def create_dataset(name: str, description: str) -> str:
    """POST /datasets → dataset id."""
    res = requests.post(
        f"{API_BASE}/datasets",
        headers=API_HEADERS,
        json={"name": name, "description": description},
    )
    res.raise_for_status()
    return res.json()["id"]


def upload_file_to_dataset(dataset_id: str, path: Path) -> None:
    """POST /datasets/{dataset_id}/files (multipart/form-data)."""
    with path.open("rb") as fp:
        res = requests.post(
            f"{API_BASE}/datasets/{dataset_id}/files",
            headers=API_HEADERS,
            files={"file": fp, "purpose": (None, "fine_tune")},
        )
    res.raise_for_status()


def create_fine_tune(dataset_id: str, *, name: str, language: str, model_id: str) -> str:
    """POST /fine-tunes → fine-tune id."""
    body = {
        "name": name,
        "description": "Pro Voice Clone demo",
        "language": language,
        "model_id": model_id,
        "dataset": dataset_id,
    }
    res = requests.post(f"{API_BASE}/fine-tunes", headers=API_HEADERS, json=body, timeout=60)
    res.raise_for_status()
    return res.json()["id"]


def wait_for_fine_tune(ft_id: str, every: float = 10.0) -> None:
    """Poll GET /fine-tunes/{id} until status == completed."""
    start = time.monotonic()
    while True:
        res = requests.get(f"{API_BASE}/fine-tunes/{ft_id}", headers=API_HEADERS)
        res.raise_for_status()
        status = res.json()["status"]
        print(f"fine-tune {ft_id} -> {status}. Elapsed: {time.monotonic() - start:.0f}s")
        if status == "completed":
            return
        if status == "failed":
            raise RuntimeError(f"fine-tune ended with status={status}")
        time.sleep(every)


def list_voices(ft_id: str) -> list[dict]:
    """GET /fine-tunes/{id}/voices → list of voices."""
    res = requests.get(f"{API_BASE}/fine-tunes/{ft_id}/voices", headers=API_HEADERS)
    res.raise_for_status()
    return res.json()["data"]


if __name__ == "__main__":
    # Create the dataset
    DATASET_ID = create_dataset("PVC demo", "Samples for a Pro Voice Clone")
    print("Created dataset:", DATASET_ID)

    # Upload .wav files to the dataset
    for wav_path in Path("samples").glob("*.wav"):
        upload_file_to_dataset(DATASET_ID, wav_path)
        print(f"Uploaded {wav_path.name} to dataset {DATASET_ID}")

    # Ask for confirmation before kicking off the fine-tune
    confirmation = input(
        "Are you sure you want to start the fine-tune? It will cost 1M credits upon successful completion (yes/no): "
    )
    if confirmation.lower() != "yes":
        print("Fine-tuning cancelled by user.")
        exit()

    # Kick off the fine-tune
    FINE_TUNE_ID = create_fine_tune(
        DATASET_ID,
        name="PVC demo",
        language="en",
        model_id="sonic-2",
    )
    print(f"Started fine-tune: {FINE_TUNE_ID}")

    # Wait for training to finish
    wait_for_fine_tune(FINE_TUNE_ID)
    print("Fine-tune completed!")

    # Fetch the voices created by the fine-tune
    voices = list_voices(FINE_TUNE_ID)
    print("Voices IDs:")
    for voice in voices:
        print(voice["id"])
```


# Pro Voice Cloning
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices-pro/playground


## Why use Pro Voice Cloning?

A Professional Voice Clone (PVC) is a voice that uses a fine-tune of our TTS model on your data, which allows it to create an almost exact replica of the voice it hears including accent, speaking style, and audio quality.

Compared to [Instant Voice Cloning](/build-with-cartesia/capability-guides/clone-voices), Pro Voice Cloning can capture the exact nuances of your hours of studio-quality audio voice data.

<Frame>
  <img />
</Frame>

## Overview

Pro Voice Cloning is available in the [Playground](https://play.cartesia.ai/pro-voice-cloning) for anyone with a Cartesia subscription of Startup or higher. It allows you to create highly accurate voice clones by leveraging a larger amount of data compared to instant cloning.

| Feature             | Required audio data | Pricing: cost to create | Pricing: cost to use for TTS |
| ------------------- | ------------------- | ----------------------- | ---------------------------- |
| Instant Voice Clone | 10 seconds          | Free                    | 1 credit per character       |
| Pro Voice Clone     | 3 hours             | 1M credits on success   | 1.5 credits per character    |

When you create a Pro Voice Clone, Cartesia first fine-tunes a model on your data, then creates Voices from selected clips of your data. These Voices are tied to the fine-tuned model and will be automatically used with these Voices for text-to-speech.

<Frame>
  <img />
</Frame>

## Get started

Visit the Pro Voice Clone tab to get started on your first PVC. On the home page, you can to see all your fine-tuned models and their statuses (i.e Draft, Failed, Training, Completed).

<Frame>
  <img />
</Frame>

<Steps>
  <Step title="Prepare Data">
    Fill out the form to create a Pro Voice Clone.

    <Frame>
      <img />
    </Frame>

    Then, upload all of the audio files you want to use for training. You can upload multiple
    files at once. Files must be one of the following audio formats:

    * .wav
    * .mp3
    * .flac
    * .ogg
    * .oga
    * .ogx
    * .aac
    * .wma
    * .m4a
    * .opus
    * .ac3
    * .webm

    Pro Voice Clones require a minimum of 30 minutes of audio, but we recommend 2 hours of audio for optimal balance of quality and effort. The Pro Voice Clone will closely match your uploaded data, so make sure it sounds the way you like in terms of background noise, loudness, and speech quality.
    Generally, it's better to upload audio with only the speaker you which to clone. Multi-speaker audio can interfere with cloning quality.

    <Frame>
      <img />
    </Frame>

    If you also reused data from past Pro Voice Clones. Switch to the **Select dataset** tab to view previous datasets. These datasets can be edited separately from your PVCs and are helpful for managing your audio files.

    <Frame>
      <img />
    </Frame>
  </Step>

  <Step title="Train Model">
    Training should take 3 hours to complete. You'll only be charged if the training is successful. If training fails, you can click the `Re-attempt Training` button to try again or contact [support](mailto:support@cartesia.ai) if the failures persist.
  </Step>

  <Step title="Test Voices">
    Once training is complete, we'll automatically create four Voices based on different source audio clips from your dataset. These Voices are internally linked to your fine-tuned model, which will be used when you specify the model ID of the fine-tuned model in your requests.

    The Voices are also available in the Voice Library under My Voices and can be used through the API.

    <Frame>
      <img />
    </Frame>

    **Note about base model updates:**

    We've fine-tuned the latest base model available in production, which is reflected in the displayed model ID. This means that the fine-tuned model is fixed to this particular model ID and will not be activated if you use a different `model-id`. PVCs will not automatically be updated for future base models, and will need to be retrained on each new base model.
    Retraining a new fine-tuned model with new data or the latest base model will again cost 1M credits.
  </Step>
</Steps>


# Localize voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/localize-voices

Learn how to localize voices for your brand or product.

<LocalizeVoicesIntro />

The localization feature accepts a voice to localize, the gender of the voice, and the target language and accent to localize to, and produces a Voice that you can use to generate speech (or save as a new voice).

<Frame>
  <img />
</Frame>


# Stream Inputs using Continuations
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/stream-inputs-using-continuations

Learn how to stream input text to Sonic TTS.

In many real-time use cases, you don't have input text available upfront—like when you're generating it on the fly using a language model. For these cases, we support input streaming through a feature we call *continuations*.

<Info>
  This guide will cover how input streaming works from the perspective of the TTS model. If you just want to implement input streaming, see [the WebSocket API reference](/api-reference/tts/tts), which implements continuations using *contexts*.
</Info>

## Continuations

Continuations are generations that extend already generated speech. They're called continuations because you're continuing the generation from where the last one left off, maintaining the *prosody* of the previous generation.

If you don't use continuations, you get sudden changes in prosody that create seams in the audio.

<Note>
  Prosody refers to the rhythm, intonation, and stress in speech. It's what makes speech flow naturally and sound human-like.
</Note>

Let's say we're using an LLM and it generates a transcript in three parts, with a one second delay between each part:

1. `Hello, my name is Sonic.`
2. ` It's very nice`
3. ` to meet you.`

To generate speech for the whole transcript, we might think to generate speech for each part independently and stitch the audios together:

<Frame>
  <img alt="no_continuations" />
</Frame>

Unfortunately, we end up with speech that has sudden changes in prosody and strange pacing:

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>

Now, let's try the same transcripts, but using continuations. The setup looks like this:

<Frame>
  <img alt="continuations" />
</Frame>

Here's what we get:

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>

As you can hear, this output sounds seamless and natural.

<Check>
  You can scale up continuations to any number of inputs. There is no limit.
</Check>

## Caveat: Streamed inputs should form a valid transcript when joined

This means that `"Hello, world!"` can be followed by `" How are you?"` (note the leading space) but not `"How are you?"`, since when joined they form the invalid transcript `"Hello, world!How are you?"`.

In practice, this means you should maintain spacing and punctuation in your streamed inputs.

<Warning>
  **End complete sentences with closing punctuation** (for example `.`, `?`, or `!`).

  If a streamed chunk does not end with sentence-ending punctuation, the model often treats it as an incomplete sentence. That can cause:

  * **Extra latency:** Text may stay in the automatic input buffer until the model sees a clearer boundary or until `max_buffer_delay_ms` elapses (**3000ms by default**), so audio starts later than you expect.
  * **Audio artifacts:** The model expects natural sentence endings; without closing punctuation, the generated audio sometimes ends with odd or distorted sounds.

  When a user-facing utterance is finished, put terminal punctuation on the final segment (and signal that no more text is coming on the context when appropriate, for example `no_more_inputs()` in the SDK or `continue: false` over the WebSocket).
</Warning>

## Automatic buffering with `max_buffer_delay_ms`

When streaming inputs from LLMs word-by-word or token-by-token, we buffer text until the optimal transcript length for our model. The default buffer is 3000ms, if you wish to modify this you can use the `max_buffer_delay_ms` parameter, though we *do not recommend making this change*.

<Warning>
  If you plan on using `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with buffering, make sure decimal values are not split up.
  Submitting `1.0` as `1`, `.`, `0` will result in unintended failure modes.
</Warning>

### How it works

When set, the model will buffer incoming text chunks until it's confident it has enough context to generate high-quality speech, or the buffer delay elapses, whichever comes first.

Without this buffer, the model would immediately start generating with each input, which could result in choppy audio or unnatural prosody if inputs are very small (like single words or tokens).

### Configuration

* **Range**: Values between 0-5000ms are supported
* **Default**: 3000ms

Use this *only* if

* you have custom buffering client side, in which case you can set this to 0
* you have choppiness even at 3000ms, in which case you can try a higher value

```js lines theme={null}
// Example WebSocket request with `max_buffer_delay_ms`
{
  "model_id": "sonic-3",
  "transcript": "Hello",  // First word/token
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-conversation-123",
  "continue": true,
  "max_buffer_delay_ms": 3000  // Buffer up to 3000ms
}
```

Let's try the following transcripts with continuations and the default `max_buffer_delay_ms=3000`: `['Hello', 'my name', 'is Sonic.', "It's ", 'very ', 'nice ', 'to ', 'meet ', 'you.']`

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>


# Custom Pronunciations
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/custom-pronunciations

Learn how to specify custom pronunciations for words that are hard to get right, like proper nouns or domain-specific terms.

All models in the Sonic TTS family support custom pronunciations in your transcripts. Try out the pronunciation tool on our [demo](https://play.cartesia.ai/demos/pronunciation) page.

<Tabs>
  <Tab title="Sonic-3">
    `sonic-3` supports custom pronunciation dictionaries, which allow specifying how to pronounce a specific word or words more easily and sustainably.

    At its core, a dictionary is a simple search and replace, which directs the model to use another string in lieu of the text for the transcript. The pronunciation can either be an [IPA pronunciation](/build-with-cartesia/sonic-3/phonemes), or a "sounds-like" guidance:

    ```json lines theme={null}
    [
      {
        "text": "bayou",
        "pronunciation": "<<ˈ|b|ɑ|ˈ|j|u>>"
      },
      {
        "text": "jambalaya",
        "pronunciation": "<<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>>"
      },
      {
        "text": "tchoupitoulas",
        "pronunciation": "chop-uh-TOO-liss"
      }
    ]
    ```

    These JSONs can then be saved as a pronunciation dictionaries [through our API](https://docs.cartesia.ai/api-reference/pronunciation-dicts/create), or through our [playground](https://play.cartesia.ai/pronunciation). The playground gives affordances for creating and manipulating dictionaries also directly in the UI:

    <img alt="image.png" />

    Once the dictionaries are created, they can be used in any of the TTS APIs by specifying the id in `pronunciation_dict_id`.

    With the above dictionary, the string: `I ate some jambalaya on tchoupitoulas street` would become `I ate some <<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>> on chop-uh-TOO-liss street`  before being handed off to the model, which in turn, would do a better job in pronouncing it properly.

    ## Case Sensitivity

    Dictionary matching is **case-sensitive**, with one exception: a lowercase entry also matches its sentence-start capitalized form. For example, `cat` matches both `cat` and `Cat`, but not `CAT`. An entry for `CAT` only matches `CAT`.

    This applies to multi-word entries too. An entry for `green valley` matches `green valley` and `Green valley`, but not `Green Valley`.

    **Use lowercase entries for common words.** These match the word both mid-sentence (`cat`) and at the start of a sentence (`Cat`), covering the two most common positions.

    **Use exact capitalization for proper nouns.** A term like "LaTeX" should be entered as `LaTeX` so it doesn't collide with a different pronunciation for the common word `latex`. For multi-word proper nouns, enter the exact casing as it appears in your transcripts — for example, `Green Valley` if the transcript capitalizes both words.
  </Tab>

  <Tab title="Sonic-turbo and Sonic-2">
    > For the best controllability around pronunciation, we recommend using `sonic-3`.

    `sonic-2` and `sonic-turbo` use MFA-style IPA for all languages.
    For the best controllability around pronunciation, we recommend using `sonic-2`.

    You can also get custom pronunciations with older Sonic models.
    The `sonic`, `sonic-2024-12-12`, and `sonic-2024-10-19` models use Sonic-flavored IPA phonemes for English.
    The `sonic` and `sonic-2024-12-12` use MFA-style IPA for languages other than English, and the Sonic Preview model uses MFA-style IPA for all languages.
    Note that `sonic-2024-10-19` does not support custom pronunciations for languages other than English.
    We will soon be updating all models to use MFA-style IPA.

    Custom words should be wrapped in double angle brackets `<<` `>>` , with pipe characters `|` between phonemes and no whitespace.
    For example:

    * `Can I get <<x|a|l|a|p|e|ɲ|o>> on that?` (MFA-style IPA)
    * `Can I get <<h|ɑː|l|ˈə|p|eɪ|n|y|ˌoʊ|>> on that?` (Sonic-flavored IPA)

    Each individual word should be wrapped in it’s own set of angle brackets.

    # MFA-style IPA

    ## Constructing Pronunciations

    We use the IPA phoneset as defined by the [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/). Because of the size and complexity of this phoneset, you may find it easier to construct your custom pronunciation starting from existing words with known phonemizations. We suggest the following workflow for constructing a custom pronunciation for a word:

    1. Go to the [MFA pronunciation dictionary index](https://mfa-models.readthedocs.io/en/latest/dictionary/index.html) and find the page corresponding to your language. Make sure the phoneset is MFA, and try to download the latest version (for most languages, v3.0 or v3.1).
       1. This page will give you the full range of acceptable phones for your language under the “phones” section.
    2. Scroll down to the `Installation` section and click on the `Download from the release page` link.
    3. Scroll to the bottom of the release page and download the .dict file; this is a text file mapping words to their constituent phonemes.
       1. The first column in the file contains words, and the last column contains space delimited phonemes. Ignore the other columns.
    4. Look up your word or words that sound similar to your intended pronunciation in the dictionary. Use these pronunciations as a starting point to construct your custom pronunciation.

    Automatic pronunciation suggestions based on audio samples will be added in a future update. Note that MFA-style IPA does not support stress markers.

    ## Example

    Suppose I want to generate the text “This is a generation from Cartesia” and the model is not pronouncing “Cartesia” correctly. I would do the following:

    1. Go to the [MFA pronunciation dictionary index](https://mfa-models.readthedocs.io/en/latest/dictionary/index.html) and look for English pronunciation dictionaries. I see that for US English, the most recent version is v3.1.
       1. I note that the page says that the acceptable phones for US english are `aj aw b bʲ c cʰ cʷ d dʒ dʲ d̪ ej f fʲ h i iː j k kʰ kʷ l m mʲ m̩ n n̩ ow p pʰ pʲ pʷ s t tʃ tʰ tʲ tʷ t̪ v vʲ w z æ ç ð ŋ ɐ ɑ ɑː ɒ ɒː ɔj ə ɚ ɛ ɝ ɟ ɟʷ ɡ ɡʷ ɪ ɫ ɫ̩ ɱ ɲ ɹ ɾ ɾʲ ɾ̃ ʃ ʉ ʉː ʊ ʎ ʒ ʔ θ`

    2. Download the .dict file from the bottom of the [release page](https://github.com/MontrealCorpusTools/mfa-models/releases/tag/dictionary-english_us_mfa-v3.1.0).

    3. Find a word in this dictionary that sounds similar to how I want “Cartesia” to be pronounced. I see this entry in the dictionary:

       `cartesian	0.99	0.14	1.0	1.0	kʰ ɑ ɹ tʲ i ʒ ə n`

    4. Ignore the middle four numeric columns. I want to cut off the part of the pronunciation that corresponds to “-an” and replace it with an “uh” sound. I know that the MFA phoneme for “uh” is `ɐ` (if I didn’t know that, I could also look up “uh” in the dictionary). So the pronunciation I want is `kʰ ɑ ɹ tʲ i ʒ ɐ`.

    5. Format the phonemes it in angle brackets with pipe characters between phonemes and no whitespace. So my transcript is `This is a generation from <<kʰ|ɑ|ɹ|tʲ|i|ʒ|ɐ>>`.

    # (Deprecated) Sonic-flavored IPA

    Sonic-flavored IPA is only for `sonic` and users of our latest models (`sonic-2` and `sonic-turbo`) should use MFA-style IPA.

    Here is a pronunciation guide for Sonic-flavored IPA.
    It follows the [English phonology article on Wikipedia](https://en.wikipedia.org/wiki/English_phonology) for most phonemes,
    but in spots where our model requires different notation than you may expect, we've included a blue `<=` in the margins.

    You can copy/paste some of these uncommon symbols from the original [charts here](https://docs.google.com/spreadsheets/d/1OJbiKtxLyodpNPqVfOu43X2HloLsAixTtFppEuQ_4pI/edit?usp=sharing).

    <Frame>
      <img alt="" />
    </Frame>

    ## Stresses and vowel length markers

    Sonic English requires stress markers for first (`ˈ`) and second (`ˌ`) stressed syllables, which go directly before the vowel. We also use annotations for vowel length (`ː`). The model can also operate without them, but you will have noticeably better robustness and control when using them.
  </Tab>
</Tabs>


# Prompting tips
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/prompting-tips


1. **Use appropriate punctuation.** Add punctuation where appropriate and at the end of each transcript whenever possible.
2. **Use dates in MM/DD/YYYY form.** For example, 04/20/2023.
3. **Add spaces between time and AM/PM.** For example, `7:00 PM`, `7 PM`, `7:00 P.M`.
4. **Insert pauses.** To insert pauses, insert "-" or use [break tags](/build-with-cartesia/formatting-text-for-sonic-2/inserting-breaks-pauses) where you need the pause. These tags are considered 1 character and do not need to be separated with adjacent text using a space -- to save credits you can remove spaces around break tags.
5. **Match the voice to the language.** Each voice has a language that it works best with. You can use the playground to quickly understand which voices are most appropriate for a language.

6) **Stream in inputs for contiguous audio.** Use [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) if generating audio that should sound contiguous in separate chunks.
7) **Specify [custom pronunciations](/build-with-cartesia/sonic-3/custom-pronunciations) for
   domain-specific or ambiguous words.** You may want to do this for proper nouns and trademarks, as
   well as for words that are spelled the same but pronounced differently, like the city of Nice and
   the adjective "nice."
8) **Force [spelling out numbers and letters](/build-with-cartesia/sonic-3/ssml-tags#spelling-out-numbers-and-letters).** You may want to do this for IDs, email addresses, or numeric values.

<Note>For sonic-2, see [Formatting Text for Sonic-2](/build-with-cartesia/formatting-text-for-sonic-2/best-practices).</Note>


# SSML Tags
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/ssml-tags


<Warning>
  Tags for volume, speed, and emotions is in beta and subject to change in the
  future.
</Warning>

Sonic-3 supports SSML-like (Speech Synthesis Markup Language) tags to control generated speech.

## Speed

<Warning>
  Note that if you're streaming token by token, you'll need to buffer the whole value of the speed or volume tags.
  Passing in `1`, `.`, `0` as separate inputs, for example, will result in reading out the tags.
</Warning>

You can guide the speed of a TTS generation with a `speed` tag, which takes a scalar between `0.6` and `1.5`.
This value is roughly a multiplier on the default speed. For example, `1.5` will generate audio at roughly 1.5x the
default speed.

```xml theme={null}
<speed ratio="1.5"/> I like to speak quickly because it makes me sound smart.
```

## Volume

You can guide the volume of a TTS generation with a `volume` tag, which is a number between `0.5`
and `2.0`. The default volume is `1`.

```xml theme={null}
<volume ratio="0.5"/> I will speak softly.
```

## Emotion <span>Beta</span>

<Warning>
  Emotion control is highly experimental, particularly when emotion shifts occur
  mid-generation. If you need to change the emotion in a transcript, we recommend
  using separate generation contexts for each emotion. For best results, use [Voices
  tagged as "Emotive"](https://play.cartesia.ai/voices?tags=Emotive), as emotions may not work reliably with other Voices.
</Warning>

```xml theme={null}
<emotion value="angry"/> I will not allow you to continue this! <emotion value="sad"/> I was hoping for a peaceful resolution.
```

## Pauses and breaks

To insert breaks (or pauses) in generated speech, use a `break` tags with one attribute, `time`. For
example, `<break time="1s"/>`. You can specify the time in seconds (`s`) or milliseconds (`ms`).
For accounting purposes, these tags are considered 1 character and do not need to be separated with adjacent text using a
space -- to save credits you can remove spaces around break tags.

```xml theme={null}
Hello, my name is Sonic.<break time="1s"/>Nice to meet you.
```

## Spelling out numbers and letters

To spell out input text, you can wrap it in `<spell>` tags.

This is particularly useful for pronouncing long numbers or identifiers, such as credit card numbers, phone numbers, or unique IDs.

```xml theme={null}
My name is Bob, spelled <spell>Bob</spell>, my account number is <spell>ABC-123</spell>, my phone number is <spell>(123) 456-7890</spell>, and my credit card is <spell>1234-5678-9012-3456</spell>.
```

If you want to spell out numbers or identifiers and have planned breaks between the generations (e.g. taking a break between the area code of a phone number and the rest of that number), you can combine `<break>` and `<spell>` tags. These tags are considered 1 character and do not need to be separated with adjacent text using a space -- to save credits you can remove spaces around break and spell tags.

```xml theme={null}
My phone number is <spell>(123)</spell><break time="200ms"/><spell>4712177</spell> and my credit card number is <spell>1234</spell><break time="200ms"/><spell>5678</spell> <break time="200ms"/><spell>6347</spell><break time="200ms"/><spell>4537</spell>.
```


# Volume, Speed, and Emotion
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/volume-speed-emotion


Sonic-3 provides rich controls for the speed, volume, and emotion of generated speech. These controls are available on play.cartesia.ai using the UI controls, or by passing in a `generation_config` parameter, or by using SSML tags within the transcript itself.

<Tip>
  **Sonic-3 interprets these parameters as guidance** instead of as strict adjustments to ensure natural speech, so we recommend testing them against your content to ensure the output matches your expectations.
</Tip>

## Speed and Volume Controls

You can guide the speed and volume of a TTS generation with the `generation_config.speed` and `generation_config.volume` parameters. These values are roughly a multiplier on the default speed and volume, eg, `1.5` will generate audio at 1.5x the default speed.

<ParamField type="number">
  The speed of the generation, ranging from `0.6` to `1.5`.
</ParamField>

<ParamField type="number">
  The volume of the generation, ranging from `0.5` to `2.0`.
</ParamField>

You can also specify these inside the transcript itself, using [SSML](/build-with-cartesia/sonic-3/ssml-tags), for example:

```xml lines theme={null}
<speed ratio="1.5"/> I like to speak quickly because it makes me sound smart.
<volume ratio="1.5"/> And I can be loud, too!
```

## Emotion Controls <span>Beta</span>

By default, the model attempts to interpret the emotional subtext present in the provided transcript. You can guide the emotion of a TTS generation, like a director providing guidance to an actor, using the `generation_config.emotion` parameter.

<Note>
  Emotion tags are good to push the model to be more emotive, but they only work when the emotion is consistent with transcript. For instance, the mismatch below is unlikely to work well:
</Note>

```xml theme={null}
<emotion value="sad"/> I'm so excited!
```

<ParamField type="string">
  The emotional guidance for a generation, one of the emotions below.
</ParamField>

The primary emotions, for which we have the most data and produce the best results, are: `neutral`, `angry`, `excited`, `content`, `sad`, and `scared`.

The complete list of available emotions is: `happy`, `excited`, `enthusiastic`, `elated`, `euphoric`, `triumphant`, `amazed`, `surprised`, `flirtatious`, `joking/comedic`, `curious`, `content`, `peaceful`, `serene`, `calm`, `grateful`, `affectionate`, `trust`, `sympathetic`, `anticipation`, `mysterious`, `angry`, `mad`, `outraged`, `frustrated`, `agitated`, `threatened`, `disgusted`, `contempt`, `envious`, `sarcastic`, `ironic`, `sad`, `dejected`, `melancholic`, `disappointed`, `hurt`, `guilty`, `bored`, `tired`, `rejected`, `nostalgic`, `wistful`, `apologetic`, `hesitant`, `insecure`, `confused`, `resigned`, `anxious`, `panicked`, `alarmed`, `scared`, `neutral`, `proud`, `confident`, `distant`, `skeptical`, `contemplative`, `determined`.

The Voices with the best emotional response are:

* [Leo](https://play.cartesia.ai/voices/0834f3df-e650-4766-a20c-5a93a43aa6e3) (id: `0834f3df-e650-4766-a20c-5a93a43aa6e3`)
* [Jace](https://play.cartesia.ai/voices/6776173b-fd72-460d-89b3-d85812ee518d) (id: `6776173b-fd72-460d-89b3-d85812ee518d`)
* [Kyle](https://play.cartesia.ai/voices/c961b81c-a935-4c17-bfb3-ba2239de8c2f) (id: `c961b81c-a935-4c17-bfb3-ba2239de8c2f`)
* [Gavin](https://play.cartesia.ai/voices/f4a3a8e4-694c-4c45-9ca0-27caf97901b5) (id: `f4a3a8e4-694c-4c45-9ca0-27caf97901b5`)
* [Maya](https://play.cartesia.ai/voices/cbaf8084-f009-4838-a096-07ee2e6612b1) (id: `cbaf8084-f009-4838-a096-07ee2e6612b1`)
* [Tessa](https://play.cartesia.ai/voices/6ccbfb76-1fc6-48f7-b71d-91ac6298247b) (id: `6ccbfb76-1fc6-48f7-b71d-91ac6298247b`)
* [Dana](https://play.cartesia.ai/voices/cc00e582-ed66-4004-8336-0175b85c85f6) (id: `cc00e582-ed66-4004-8336-0175b85c85f6`)
* [Marian](https://play.cartesia.ai/voices/26403c37-80c1-4a1a-8692-540551ca2ae5) (id: `26403c37-80c1-4a1a-8692-540551ca2ae5`)

View the full list of emotive Voices on our [Voice Library with voices tagged 'Emotive'](https://play.cartesia.ai/voices?tags=Emotive).

You can also use [SSML](/build-with-cartesia/sonic-3/ssml-tags) tags for emotions, for example:

```xml theme={null}
<emotion value="angry"/> How dare you speak to me like I'm just a robot!
```

## Nonverbalisms

Insert `[laughter]`in your transcript to make the model laugh. In the future we plan to add more non-speech verbalisms like sighs and coughs.


# STT Models
Source: https://docs.cartesia.ai/build-with-cartesia/stt-models


Ink is a new family of streaming speech-to-text (STT) models for developers building real-time voice applications.

* <Icon icon="circle" /> the latest **stable** snapshot of the model

To use the stable version of the model, we recommend using the base model name (e.g. `ink-whisper`).
In many cases the stable and preview snapshots are the same, but in some cases the preview snapshot may have additional features or improvements.

## `ink-whisper`

Ink Whisper is the fastest, most affordable speech-to-text model — engineered for enterprise deployment in production-grade voice agents. It delivers higher accuracy than baseline Whisper and is optimized for real-time performance in a wide variety of real-world conditions.

Additional Capabilities:

* Handles variable-length audio chunks and interruptions gracefully using dynamic chunking.
* Reliably transcribes speech with background noise.
* Accurately transcribes audio with telephony artifacts, accents, and disfluencies.
* Excels at transcribing proper nouns and domain-specific terminology.

| Snapshot                             | Release Date  | Languages                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Status |
| ------------------------------------ | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |
| <Icon icon="circle" /> `ink-whisper` | June 10, 2025 | `en`, `zh`, `de`, `es`, `ru`, `ko`, `fr`, `ja`, `pt`, `tr`, `pl`, `ca`, `nl`, `ar`, `sv`, `it`, `id`, `hi`, `fi`, `vi`, `he`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `ta`, `no`, `th`, `ur`, `hr`, `bg`, `lt`, `la`, `mi`, `ml`, `cy`, `sk`, `te`, `fa`, `lv`, `bn`, `sr`, `az`, `sl`, `kn`, `et`, `mk`, `br`, `eu`, `is`, `hy`, `ne`, `mn`, `bs`, `kk`, `sq`, `sw`, `gl`, `mr`, `pa`, `si`, `km`, `sn`, `yo`, `so`, `af`, `oc`, `ka`, `be`, `tg`, `sd`, `gu`, `am`, `yi`, `lo`, `uz`, `fo`, `ht`, `ps`, `tk`, `nn`, `mt`, `sa`, `lb`, `my`, `bo`, `tl`, `mg`, `as`, `tt`, `haw`, `ln`, `ha`, `ba`, `jw`, `su`, `yue` | Stable |
| `ink-whisper-2025-06-04`             | June 4, 2025  | `en`, `zh`, `de`, `es`, `ru`, `ko`, `fr`, `ja`, `pt`, `tr`, `pl`, `ca`, `nl`, `ar`, `sv`, `it`, `id`, `hi`, `fi`, `vi`, `he`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `ta`, `no`, `th`, `ur`, `hr`, `bg`, `lt`, `la`, `mi`, `ml`, `cy`, `sk`, `te`, `fa`, `lv`, `bn`, `sr`, `az`, `sl`, `kn`, `et`, `mk`, `br`, `eu`, `is`, `hy`, `ne`, `mn`, `bs`, `kk`, `sq`, `sw`, `gl`, `mr`, `pa`, `si`, `km`, `sn`, `yo`, `so`, `af`, `oc`, `ka`, `be`, `tg`, `sd`, `gu`, `am`, `yi`, `lo`, `uz`, `fo`, `ht`, `ps`, `tk`, `nn`, `mt`, `sa`, `lb`, `my`, `bo`, `tl`, `mg`, `as`, `tt`, `haw`, `ln`, `ha`, `ba`, `jw`, `su`, `yue` | Stable |

To learn how to use the Ink STT family, see [the Speech-to-Text API Reference](/api-reference/stt/stt). You can find a detailed mapping of codes to languages, see the [STT supported languages](/api-reference/stt/stt#request.query.language) list.

## Selecting a Model

When making API calls, you can specify either:

```python lines theme={null}
// Use the base model (automatically routes to the latest snapshot)
{
  model = "ink-whisper",
  ...
}

// Or specify a particular snapshot for consistency
{
  model = "ink-whisper-2025-06-04",
  ...
}
```

### Continuous updates

All models have a base model name (e.g. `ink-whisper`).
We recommend using these for prototyping and development, then switching to a date-versioned model for production use cases to ensure stability.

## Future Updates

New snapshots are released periodically with improvements in performance, additional language support, and new capabilities. Check back regularly for updates.


# API Changes
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/api-changes


Starting June 1, 2026, we are discontinuing several models, snapshots, and languages, and removing voice embeddings from our voice API. Migrate to `sonic-3` for improved naturalness, 42-language support, and fine-grained controls.

## Deprecated models and languages

You can check if you're making requests to deprecated models on [play.cartesia.ai/deprecation/traffic](https://play.cartesia.ai/deprecation/traffic).

### Fully deprecated models

These models will stop serving requests on June 1, 2026.

| Model                | Snapshots affected       | Deprecated languages |
| -------------------- | ------------------------ | -------------------- |
| `sonic`              | All                      | All                  |
| `sonic-english`      | —                        | All                  |
| `sonic-multilingual` | —                        | All                  |
| `sonic-2`            | `sonic-2-2025-03-07`     | All                  |
| `sonic-turbo`        | `sonic-turbo-2025-03-07` | All                  |

### Partially deprecated models

These models will continue to serve a reduced set of languages. The languages listed below will be discontinued on June 1, 2026.

| Model         | Snapshots affected                                               | Deprecated languages       |
| ------------- | ---------------------------------------------------------------- | -------------------------- |
| `sonic-2`     | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | it, nl, pl, ru, sv, tr, hi |
| `sonic-turbo` | `sonic-turbo-2025-06-04`                                         | it, nl, pl, ru, sv, tr     |

## Stable offerings

The following will remain available beyond June 1.

| Model         | Snapshots                                                        | Supported Languages                                                                 |
| ------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| `sonic-3`     | All                                                              | 42 languages — [full list](/build-with-cartesia/tts-models/latest#language-support) |
| `sonic-2`     | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | en, de, es, fr, ja, ko, pt, zh                                                      |
| `sonic-turbo` | `sonic-turbo-2025-06-04`                                         | en, de, es, fr, ja, ko, pt, zh, hi                                                  |

## API changes

These endpoints will be discontinued on June 1, 2026.

| Discontinued Endpoint                      | Replacement                                |
| ------------------------------------------ | ------------------------------------------ |
| Voice Embedding: `POST /voices/clone/clip` | [Clone Voice](/api-reference/voices/clone) |
| Mix Voices: `POST /voices/mix`             | —                                          |
| Create Voice: `POST /voices`               | [Clone Voice](/api-reference/voices/clone) |

These endpoints will stop accepting voice embeddings on June 1, 2026.

| Endpoint with a breaking change       | Replacement                                            |
| ------------------------------------- | ------------------------------------------------------ |
| TTS (bytes): `POST /tts/bytes`        | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |
| TTS (SSE): `POST /tts/sse`            | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |
| TTS (WebSocket): `WSS /tts/websocket` | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |

You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.

### Moving off of deprecated endpoints

1. Change how you create voices — see [Migrating Voices](/build-with-cartesia/tts-models/migrating-voices).
2. Switch from voice embeddings to IDs — see [Voice IDs](/build-with-cartesia/tts-models/voice-ids).

## Full Checklist

1. Move off of [deprecated models / snapshots / languages](/build-with-cartesia/tts-models/api-changes#deprecated-models-and-languages) onto `sonic-3` or another stable model
2. Move off of [deprecated endpoints](/build-with-cartesia/tts-models/api-changes#api-changes) when creating voices
3. Use [Voice IDs](/build-with-cartesia/tts-models/voice-ids)
4. Check your deprecated model traffic on [play.cartesia.ai/deprecation/traffic](https://play.cartesia.ai/deprecation/traffic)
5. Make sure your voices are migrated on [play.cartesia.ai/deprecation/voices](https://play.cartesia.ai/deprecation/voices)
6. (Optional) Upgrade your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`

## Why are we doing this?

Since the launch of Sonic 3, we've made improvements across pacing, prosody, and naturalness; the vast majority of our customers have migrated to these models with great success. In order to increase our capacity, availability, and serving performance, we have to discontinue our oldest models.

Additionally, our newer models have evolved beyond voice embeddings in order to sound more natural. The parts of our API that accept voice embeddings cannot be made forward-compatible. Migrating to voice IDs will allow us to continue to improve both our models and your voices in tandem.

If you have questions, reach out to [support@cartesia.ai](mailto:support@cartesia.ai).


# Migrating Voices
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/migrating-voices


On June 1, 2026, we are discontinuing our voice embedding (aka stability) TTS models.

Voices listed on [play.cartesia.ai/deprecation/voices](https://play.cartesia.ai/deprecation/voices) will stop working. Simply click "Auto Migrate" to make these voices compatible with the latest Sonic 3, 2, and Turbo snapshots.

If you use voice embeddings rather than voice IDs, see [Voice IDs](/build-with-cartesia/tts-models/voice-ids).

For an overview of all changes, see [API Changes](/build-with-cartesia/tts-models/api-changes).

## Where do these voices come from?

Voices created by these endpoints rely on our voice embedding models:

* [POST /voices](/2024-06-10/api-reference/voices/create)
* [POST /voices/mix](/2024-06-10/api-reference/voices/mix)
* `POST /voices/clone/clip`

## Creating voices

You can move to our [Clone Voice API](/api-reference/voices/clone) or use our [web UI](https://play.cartesia.ai/voices/create/clone) to create voices from 3–10 seconds of source audio.

You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.

Here is an example using the Cartesia SDK:

```python theme={null}
your_api_key: str = ""

client = Cartesia(api_key=your_api_key)

print("Cloning a voice")
with open("3 to 10 seconds of source audio.wav", mode="rb") as f:
    voice = client.voices.clone(
        clip=f,
        # this must match the source audio
        language="en",
        name="My Voice",
        mode="similarity",
)
print(f"Cloned voice {voice.id}")

print("Generating audio...")
generated_audio = client.tts.bytes(
    # voice embeddings will not work after June 1, 2026!
    voice={"mode": "id", "id": voice.id},
    model_id="sonic-3",
    transcript="Hello from Cartesia!",
    language="en",
    output_format={
        "container": "wav",
        "encoding": "pcm_f32le",
        "sample_rate": 44100
    },
)
```


# Older TTS Models
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/older-models


<Note>
  We recommend using [Sonic 3](/build-with-cartesia/tts-models/latest) for best
  results, most languages, and controllability. We continue to serve these older
  models for compatibility.
</Note>

<Danger>
  Some models and snapshots are being discontinued on June 1, 2026 — see [API Changes](/build-with-cartesia/tts-models/api-changes) for details.
</Danger>

> <Icon icon="circle" /> the latest **stable** snapshot of the model\
> <Icon icon="circle" /> to be discontinued June 1, 2026

All models have a base model name (e.g. `sonic-2`, `sonic-turbo`) and date-versioned model names
(e.g. `sonic-2-2025-06-11`).
We recommend using base model names for prototyping and development, then switching to a date-versioned model for production use cases to ensure stability.

When making API calls, you can specify either:

```python lines theme={null}
# Use the base model
# (automatically routes to the latest stable snapshot)
model_id = "sonic-3"

# Or specify a particular snapshot for consistency
model_id = "sonic-3-2026-01-12"
```

## `sonic-2`

Sonic-2 provides ultra-realistic speech with accurate transcript following, minimal hallucinations, and excellent voice cloning. It's latency optimized and achieves 90ms model latency.

Additional Capabilities:

* Higher fidelity voice cloning
* Timestamps for all 15 languages
* [Infill](/2024-11-13/api-reference/infill/bytes) support

| Snapshot                                    | Release Date   | Languages                                                  | Status           |
| ------------------------------------------- | -------------- | ---------------------------------------------------------- | ---------------- |
| <Icon icon="circle" /> `sonic-2-2025-06-11` | June 11, 2025  | en, fr, de, es, pt, zh, ja, ko                             | Stable           |
| `sonic-2-2025-06-11`                        | June 11, 2025  | <Icon icon="circle" /> hi, it, nl, pl, ru, sv, tr          | EOL June 1, 2026 |
| `sonic-2-2025-05-08`                        | May 8, 2025    | en, fr, de, es, pt, zh, ja, ko                             | Stable           |
| `sonic-2-2025-05-08`                        | May 8, 2025    | <Icon icon="circle" /> hi, it, nl, pl, ru, sv, tr          | EOL June 1, 2026 |
| `sonic-2-2025-04-16`                        | April 16, 2025 | en, fr, de, es, pt, zh, ja, ko                             | Stable           |
| `sonic-2-2025-04-16`                        | April 16, 2025 | <Icon icon="circle" /> hi, it, nl, pl, ru, sv, tr          | EOL June 1, 2026 |
| <Icon icon="circle" /> `sonic-2-2025-03-07` | March 7, 2025  | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |

Read these pages to learn more about how to use Sonic-2:

* [Best practices](/build-with-cartesia/formatting-text-for-sonic-2/best-practices)
* [Inserting breaks](/build-with-cartesia/formatting-text-for-sonic-2/inserting-breaks-pauses)
* [Spelling text](/build-with-cartesia/formatting-text-for-sonic-2/spelling-out-input-text)

## `sonic-turbo`

All the power of Sonic, with half the latency (as low as 40ms).

| Snapshot                                        | Release Date  | Languages                                                  | Status           |
| ----------------------------------------------- | ------------- | ---------------------------------------------------------- | ---------------- |
| <Icon icon="circle" /> `sonic-turbo-2025-06-04` | June 6, 2025  | en, fr, de, es, pt, zh, ja, hi, ko                         | Stable           |
| `sonic-turbo-2025-06-04`                        | June 6, 2025  | <Icon icon="circle" /> it, nl, pl, ru, sv, tr              | EOL June 1, 2026 |
| <Icon icon="circle" /> `sonic-turbo-2025-03-07` | March 7, 2025 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |

## <Icon icon="circle" /> `sonic`

The first version of our flagship text-to-speech model. It produces high-accuracy, expressive speech, and is optimized for efficiency to achieve low latency.

| Snapshot                                  | Release Date      | Languages                                                  | Status           |
| ----------------------------------------- | ----------------- | ---------------------------------------------------------- | ---------------- |
| <Icon icon="circle" /> `sonic-2024-12-12` | December 12, 2024 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| <Icon icon="circle" /> `sonic-2024-10-19` | October 19, 2024  | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |

## Deprecated and Preview Model Aliases

The following model aliases are now deprecated. Please use the recommended model names instead:

| Deprecated Alias                            | Use Instead                               |
| ------------------------------------------- | ----------------------------------------- |
| `sonic-3-preview`                           | `sonic-3`                                 |
| `sonic-preview`                             | `sonic-2`                                 |
| <Icon icon="circle" /> `sonic-english`      | <Icon icon="circle" /> `sonic-2024-10-19` |
| <Icon icon="circle" /> `sonic-multilingual` | <Icon icon="circle" /> `sonic-2024-10-19` |


# Sonic 3
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/sonic-3


`sonic-3` is our streaming TTS model, with high naturalness, accurate transcript following, and industry-leading latency. It provides fine-grained control on volume, speed, and emotion.

Key Features:

* **42 languages** supported
* **Volume, speed, and emotion** controls, supported through API parameters and SSML tags
* **Laughter** through `[laughter]` tags

For more information, see [Volume, Speed, and Emotion](/build-with-cartesia/sonic-3/volume-speed-emotion).

### Voice selection

Choosing voices that work best for your use case is key to getting the best performance out of Sonic 3.

* **For voice agents**: We've found stable, realistic voices work better for voice agents than studio, emotive voices. Example American English voices include Katie (ID: `f786b574-daa5-4673-aa0c-cbe3e8534c02`) and Kiefer (ID: `228fca29-3a0a-435c-8728-5cb483251068`).
* **For expressive characters**: We've tagged our most expressive and emotive voices with the `Emotive` tag.  Example American English voices include Tessa (ID: `6ccbfb76-1fc6-48f7-b71d-91ac6298247b`) and Kyle (ID: `c961b81c-a935-4c17-bfb3-ba2239de8c2f`).

For more information and recommendations, see [Choosing a Voice](/build-with-cartesia/capability-guides/choosing-a-voice).

### Language support

Sonic-3 supports the following languages:

<table>
  <tbody>
    <tr><td>English (`en`)</td><td>French (`fr`)</td><td>German (`de`)</td><td>Spanish (`es`)</td></tr>
    <tr><td>Portuguese (`pt`)</td><td>Chinese (`zh`)</td><td>Japanese (`ja`)</td><td>Hindi (`hi`)</td></tr>
    <tr><td>Italian (`it`)</td><td>Korean (`ko`)</td><td>Dutch (`nl`)</td><td>Polish (`pl`)</td></tr>
    <tr><td>Russian (`ru`)</td><td>Swedish (`sv`)</td><td>Turkish (`tr`)</td><td>Tagalog (`tl`)</td></tr>
    <tr><td>Bulgarian (`bg`)</td><td>Romanian (`ro`)</td><td>Arabic (`ar`)</td><td>Czech (`cs`)</td></tr>
    <tr><td>Greek (`el`)</td><td>Finnish (`fi`)</td><td>Croatian (`hr`)</td><td>Malay (`ms`)</td></tr>
    <tr><td>Slovak (`sk`)</td><td>Danish (`da`)</td><td>Tamil (`ta`)</td><td>Ukrainian (`uk`)</td></tr>
    <tr><td>Hungarian (`hu`)</td><td>Norwegian (`no`)</td><td>Vietnamese (`vi`)</td><td>Bengali (`bn`)</td></tr>
    <tr><td>Thai (`th`)</td><td>Hebrew (`he`)</td><td>Georgian (`ka`)</td><td>Indonesian (`id`)</td></tr>
    <tr><td>Telugu (`te`)</td><td>Gujarati (`gu`)</td><td>Kannada (`kn`)</td><td>Malayalam (`ml`)</td></tr>
    <tr><td>Marathi (`mr`)</td><td>Punjabi (`pa`)</td><td /><td /></tr>
  </tbody>
</table>

## Selecting a Model

| Snapshot                                    | Release Date     | Languages                                                                                                                                                              | Status |
| ------------------------------------------- | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
| <Icon icon="circle" /> `sonic-3-2026-01-12` | January 12, 2026 | en, de, es, fr, ja, pt, zh, hi, ko, it, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | Stable |
| `sonic-3-2025-10-27`                        | October 27, 2025 | en, de, es, fr, ja, pt, zh, hi, ko, it, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | Stable |

<Icon icon="circle" /> the latest **stable** snapshot of the model

When making API calls, you can specify either:

```python lines theme={null}
# Use the base model
# (automatically routes to the latest stable snapshot)
model_id = "sonic-3"

# Or specify a particular snapshot for consistency
model_id = "sonic-3-2026-01-12"

# Try the latest (beta) model (can be 'hot swapped')
model_id = "sonic-3-latest"
```

### Continuous updates and model snapshots

All models have a base model name (e.g. `sonic-3`) and a dated snapshot (e.g. `sonic-3-2025-10-27`). Using the base model will automatically keep you up to date with the most recent stable snapshot of that model. If pinning a specific version is important for your use case, we recommend using the dated version.

For testing our latest capabilities, we recommend using `sonic-3-latest`, which is a non-snapshotted version. `sonic-3-latest` can be updated with no notice, and not recommended for production.

To summarize:

| **Model ID**         | Model update behavior                                       | Recommended for                                                                            |
| -------------------- | :---------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
| `sonic-3-YYYY-MM-DD` | Snapshotted, will never change                              | Customers who want to run internal evals before any updates                                |
| `sonic-3`            | Will be updated to point to the most recent stable snapshot | Customers who want stable releases, but want to be up-to-date with the recent capabilities |
| `sonic-3-latest`     | Will always be updated to our latest beta releases          | Testing purposes                                                                           |

## Older Models

For information on `sonic-2`, `sonic-turbo`, `sonic-multilingual`, and `sonic`, see our page on [Older Models](/build-with-cartesia/tts-models/older-models).


# Voice IDs
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/voice-ids


On June 1, 2026, we are discontinuing our voice embedding (aka stability) TTS models.

If you are currently making generation requests with voice embeddings like this:

```json theme={null}
{
  "voice": {
    "mode": "embedding",
    "embedding": [1, 2, ..., 3, 4]
  },
  "model_id": "sonic-2",
  // ...
}
```

You will need to switch to using voice IDs like this:

```json theme={null}
{
  "voice": {
    "mode": "id",
    "id": "e07c00bc-4134-4eae-9ea4-1a55fb45746b"
  },
  "model_id": "sonic-2",
  // ...
}
```

If you already use voice IDs, see [Migrating Voices](/build-with-cartesia/tts-models/migrating-voices) to make sure your voices will continue to work after the change.

For an overview of all changes, see [API Changes](/build-with-cartesia/tts-models/api-changes).

## Get a voice ID

Choose one of the following options.

### Check out the voice library

Our featured voices have all gone through rigorous evaluations and are ready to use in production.

Check them out at [play.cartesia.ai/voices](https://play.cartesia.ai/voices) and copy the ID of any voice you'd like to use.

### Clone a voice

If you have source audio, create a cloned voice via the [playground](https://play.cartesia.ai/voices/create/clone) or the [API](/api-reference/voices/clone). Cloning returns a voice ID you can use immediately.

### Generate source audio from your existing embedding

If you no longer have the original audio clip used to create your embedding, generate a short sample with `sonic` or `sonic-2` and then clone a new voice.

You can do this on our playground:

1. [play.cartesia.ai/text-to-speech](https://play.cartesia.ai/text-to-speech)
2. [play.cartesia.ai/voices/create/clone](https://play.cartesia.ai/voices/create/clone)

Or with our API:

1. [Text to Speech (Bytes)](/2024-11-13/api-reference/tts/bytes)
2. [Clone Voice](/api-reference/voices/clone)

Here is an example using our SDK:

```python theme={null}
from cartesia import Cartesia

# inputs
your_api_key: str = ""

your_voice_embedding: list[float] = []

language = "en"

transcript = """
It's nice to meet you.
Hope you're having a great day!
Could we reschedule our meeting tomorrow?
Please call me back as soon as possible.
"""

source_tts_model_id = "sonic"

client = Cartesia(api_key=your_api_key)

# Step 1: generate an audio sample
print(f"Generating audio sample {source_tts_model_id=}")
source_audio_iterator = client.tts.bytes(
    voice={"mode": "embedding", "embedding": your_voice_embedding},
    model_id=source_tts_model_id,
    transcript=transcript,
    language=language,
    output_format={
        "container": "wav",
        "encoding": "pcm_f32le",
        "sample_rate": 44100
    },
)

# Step 2: clone a voice
print("Cloning a voice")
voice = client.voices.clone(
    name="My Voice",
    language=language,
    clip=b"".join(source_audio_iterator),
    mode="similarity",
)
print(f"Cloned voice {voice.id}")

# you can now use the voice like this
migrate_to_model = "sonic-3"
generated_sample_file_name = f"{migrate_to_model}_{voice.id}.wav"

cloned_audio_iterator = client.tts.bytes(
    voice={"mode": "id", "id": voice.id},
    model_id=migrate_to_model,
    transcript=transcript,
    language=language,
    output_format={
        "container": "wav",
        "encoding": "pcm_f32le",
        "sample_rate": 44100
    },
)
with open(generated_sample_file_name, "wb") as f:
    for chunk in cloned_audio_iterator:
        f.write(chunk)
print(f"Listen to your new voice: {generated_sample_file_name}")

try:
    import subprocess

    subprocess.run(
        [
            "ffplay",
            "-loglevel",
            "quiet",
            "-autoexit",
            "-nodisp",
            generated_sample_file_name,
        ]
    )
except FileNotFoundError:
    pass
```

## Using Voice IDs

See [TTS (Bytes)](/api-reference/tts/bytes), [TTS (SSE)](/api-reference/tts/sse), and [TTS (WebSocket)](/api-reference/tts/websocket) for full API documentation.

You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.


# Set up an organization
Source: https://docs.cartesia.ai/enterprise/set-up-an-organization


Organization workspaces enable seamless collaboration between multiple team members. All users in an organization share the same view of resources, including voices, API keys, and datasets. The only exceptions are playground generation history and starred voices, which remain private to each individual user.

By default, your Cartesia account initializes as an organization workspace on the Free subscription plan with a limit of one member.

<Warning>
  To invite team members, you must first upgrade your organization to the
  Startup tier or higher. After upgrading, you can invite unlimited users at no
  additional cost.
</Warning>

## Manage your organization

<Steps>
  <Step title="Upgrade your current organization">
    Organizations must be upgraded to the Startup tier or above before team members can be invited. Each workspace has its own billing and credit limits, so make sure you are on the intended organization before proceeding to upgrade your subscription.

    <Frame>
      <img alt="Upgrade organization" />
    </Frame>
  </Step>

  <Step title="Invite your team">
    Once you've upgraded your organization, you can use the "Manage" button in the workspace switcher to manage it:

    <Frame>
      <img alt="Organization manage button in switcher" />
    </Frame>

    This pops up a modal where you can change your profile and invite your team:

    <Frame>
      <img alt="Organization manager modal" />
    </Frame>

    There are two membership types in an organizaton:

    1. Admin: has the ability to manage the organization profile, invitations, and members.
    2. Member: can use all functionality included in the subscription, but cannot alter organization settings.

    <Frame>
      <img alt="Organization membership types" />
    </Frame>

    You can invite unlimited team members in an organization once it is on Startup tier or higher.
  </Step>

  <Step title="Create voices, API keys, and other resources in your organization">
    Once your organization is upgraded, voices, Line agents, API keys, and other resources will be available to all users in the organization.
  </Step>
</Steps>

## Create additional organizations

If you want separate workspaces on different subscriptions, you can create another organization by going to the playground at [https://play.cartesia.ai](https://play.cartesia.ai), selecting the workspace switcher, and clicking **Create organization**.

<Frame>
  <img alt="Create organization" />
</Frame>

This will bring up a dialog where you can name your organization and upload a logo.

<Frame>
  <img alt="Organization creation dialog" />
</Frame>

Please reach out to us at [support@cartesia.ai](mailto:support@cartesia.ai) if you run into any troubles with your organization.


# Set up SSO
Source: https://docs.cartesia.ai/enterprise/set-up-sso


We support Single-Sign On (SSO) for customers on the Enterprise plan via SAML. This integration is processed through our identity provider, [Clerk](https://clerk.com).

## Set up SSO with Okta

1. Send us your SSO domain.
2. We will send you a service provider configuration, which consists of a single-sign on URL and an audience URI (SP entity ID).
3. Follow steps 2, 3, 4, and 5 in [the Clerk SSO guide](https://clerk.com/docs/authentication/enterprise-connections/saml/okta), and send us the metadata URL you get from step 6.1.

After you are done, we will complete the remaining SSO setup and send you a confirmation that SSO is enabled for your organization.


# Authenticate your client applications
Source: https://docs.cartesia.ai/get-started/authenticate-your-client-applications

Secure client access to Cartesia APIs using Access Tokens

You may want to make Cartesia API requests directly from your client application, eg, a web app. However, shipping your API key to the app is not secure, as a malicious user could extract your API key and issue API requests billed to your account.

Access Tokens provide a secure way to authenticate client-side requests to Cartesia's APIs without
exposing your API key.

<Note>
  Access Tokens are used in contexts like web apps which should not be bundled with an API key. For
  trusted contexts like server applications, local scripts, or iPython notebooks, you should simply
  use API keys.
</Note>

## Prerequisites

Before implementing Access Tokens:

1. Configure your server with a Cartesia API key
2. Implement user authentication in your application
3. Establish secure client-server communication

### Available Grants

Access Tokens support granular permissions through grants. Both TTS and STT grants are optional:

**TTS Grant**: With `grants: { tts: true }`, clients have access to:

* `/tts/bytes` - Synchronous TTS generation streamed with chunked encoding
* `/tts/sse` - Server-sent events for streaming
* `/tts/websocket` - WebSocket-based streaming

**STT Grant**: With `grants: { stt: true }`, clients have access to:

* `/stt/websocket` - WebSocket-based speech-to-text streaming
* `/stt` - Batch speech-to-text processing
* `/audio/transcriptions` - OpenAI-compatible transcription endpoint

**Agents Grant**: With `grants: { agent: true }`, clients have access to:

* the Agents websocket calling endpoint

You can request multiple grants in a single token:

```json theme={null}
grants: { tts: true, stt: true, agent: false }
```

## Implementation Guide

### 1. Token Generation (Server-side)

Make a request to generate a new access token:

<CodeGroup>
  ```bash cURL lines theme={null}
  # TTS and STT access
  curl --location 'https://api.cartesia.ai/access-token' \
    -H 'Cartesia-Version: 2025-04-16' \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer sk_car_...' \
    -d '{ "grants": {"tts": true, "stt": true}, "expires_in": 60}'

  # TTS-only access
  curl --location 'https://api.cartesia.ai/access-token' \
    -H 'Cartesia-Version: 2025-04-16' \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer sk_car_...' \
    -d '{ "grants": {"tts": true}, "expires_in": 60}'
  ```

  ```javascript JavaScript lines theme={null}
  import { CartesiaClient } from "@cartesia/cartesia-js";

  const client = new CartesiaClient({ apiKey: "YOUR_API_KEY" });

  // TTS and STT access
  await client.auth.accessToken({
    grants: {
      tts: true,
      stt: true
    },
    expires_in: 60
  });

  // TTS-only access
  await client.auth.accessToken({
    grants: {
      tts: true
    },
    expires_in: 60
  });
  ```

  ```python Python lines theme={null}
  from cartesia import Cartesia

  client = Cartesia(
    token="YOUR_API_KEY"
  )

  # TTS and STT access
  response = client.auth.access_token(
    grants={"tts": True, "stt": True}, # Grant both permissions
    expires_in=60 # Token expires in 60 seconds
  )

  # TTS-only access
  response = client.auth.access_token(
    grants={"tts": True}, # Grant TTS permissions only
    expires_in=60 # Token expires in 60 seconds
  )

  # The response will contain the access token
  print(f"Access Token: {response.token}")
  ```
</CodeGroup>

#### Example Implementation

```typescript lines theme={null}
// TTS and STT access
const response = await fetch("https://api.cartesia.ai/access-token", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Cartesia-Version": "2025-04-16",
    Authorization: "Bearer <your_api_key>",
  },
  body: JSON.stringify({
    grants: { tts: true, stt: true },
    expires_in: 60, // 1 minute
  }),
});

// TTS-only access
const responseTTS = await fetch("https://api.cartesia.ai/access-token", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Cartesia-Version": "2025-04-16",
    Authorization: "Bearer <your_api_key>",
  },
  body: JSON.stringify({
    grants: { tts: true },
    expires_in: 60, // 1 minute
  }),
});

const { token } = await response.json();
```

For detailed API specifications, see the [Token API Reference](/api-reference/auth/access-token).

### 2. Token Storage (Client-side)

Store the token securely, such as setting HTTP-only cookie with matching token expiration. The cookie should be `httpOnly`, `secure`, and `sameSite: "strict"`.

### 3. Making Authenticated Requests

```typescript lines theme={null}
// Using TTS with access token
const ttsResponse = await fetch("https://api.cartesia.ai/tts/bytes", {
  headers: {
    Authorization: `Bearer ${accessToken}`,
    "Content-Type": "application/json",
  },
  // ... request configuration
});

// Using STT with access token
const sttResponse = await fetch("https://api.cartesia.ai/stt", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${accessToken}`,
  },
  body: formData, // multipart/form-data with audio file
});
```

### 4. Token Refresh Strategy

Proactively refresh the token in your app before they expire.

## Security Best Practices

### Essential Guidelines

* ✅ Generate tokens server-side only
* ✅ Use short token lifetimes (minutes)
* ✅ Implement automatic token refresh
* ✅ Store tokens in HTTP-only cookies
* ✅ Enable secure and SameSite cookie flags

### Security Don'ts

* ❌ Never store tokens in localStorage/sessionStorage
* ❌ Never log tokens or display them in the UI
* ❌ Never transmit tokens over non-HTTPS connections

### Token Lifecycle Management

1. Generate new token upon user authentication
2. Implement automatic refresh before expiration
3. Handle token expiration gracefully

## Additional Resources

* [API Reference](/api-reference/auth/access-token) - Access Token generation endpoint documentation


# Welcome to Cartesia
Source: https://docs.cartesia.ai/get-started/overview

Our API enables developers to build real-time, multimodal AI experiences that feel natural and responsive.

<Frame>
  <img alt="" />
</Frame>

The Cartesia API is the fastest, most emotive, ultra-realistic voice AI platform. Purpose-built for developers, it serves state-of-the-art models for both text-to-speech and speech-to-text, enabling seamless conversational AI experiences.

## Sonic Models for Text-to-Speech

Sonic models take text input and and stream back ultra-realistic speech in response. They can also clone voices, with full control over pronunciation and accent.

**Sonic 3 is the world's fastest, most emotive, ultra-realistic text-to-speech model.** It can stream out the first byte of audio in just 90ms, making it perfect for real-time and conversational experiences as well as dubbing, narration, AI avatars, and more. (To put things into perspective, 90ms is about twice as fast as the blink of an eye.)

**If real-time performance is your top priority,** Sonic Turbo offers even better performance, streaming out the first byte of audio in just 40ms.

Learn more about available Sonic model variants and their capabilities in the [TTS Models](../build-with-cartesia/tts-models/latest) section.

## Ink Models for Speech-to-Text

Ink models provide streaming speech-to-text transcription optimized for real-time voice applications.

**Ink-Whisper**, our debut model, is specifically engineered for conversational AI—handling telephony artifacts, background noise, accents, and proper nouns that typically challenge standard STT systems.

Ink-Whisper uses advanced dynamic chunking to process variable-length audio segments, reducing errors and hallucinations during pauses or audio gaps. At just \$0.13/hour, it's the most affordable streaming STT model available.

Learn more about the Ink model and its capabilities in the [STT Models](../build-with-cartesia/stt-models) section.

## Support

<CardGroup>
  <Card title="Discord" icon="discord" href="https://discord.gg/cartesia">
    Join our Discord server to chat with the Cartesia team, engage with the community, and get help with your projects.
  </Card>

  <Card title="Email" icon="envelope" href="mailto:support@cartesia.ai">
    Email us at [support@cartesia.ai](mailto:support@cartesia.ai) to get help with integrating Cartesia, your account, or billing.
  </Card>
</CardGroup>


# Realtime Text to Speech Quickstart
Source: https://docs.cartesia.ai/get-started/realtime-text-to-speech-quickstart

Stream text to Cartesia over a WebSocket and receive audio in real time.

Using the Cartesia Websocket API allows you to simultaneously stream text input and audio output.  This is best for realtime use cases such as voice agents when text is generated incrementally, as from an LLM.

Stream text in chunks to the Cartesia and receive audio chunks in real time. This is ideal when text is generated incrementally, such as from an LLM.

## Prerequisites

* A Cartesia API key. [Create one here](https://play.cartesia.ai/keys), then add it to your `.bashrc` or `.zshrc`:

  ```sh theme={null}
  export CARTESIA_API_KEY=<your api key here>
  ```

* `ffplay` (part of FFmpeg), used to play audio output:

  <Tabs>
    <Tab title="macOS">
      ```sh theme={null}
      brew install ffmpeg
      ```
    </Tab>

    <Tab title="Ubuntu">
      ```sh theme={null}
      sudo apt install ffmpeg
      ```
    </Tab>
  </Tabs>

## Stream text and play audio

<Tabs>
  <Tab title="Python">
    <Steps>
      <Step title="Install the SDK">
        ```sh theme={null}
        pip install 'cartesia[websockets]'
        ```
      </Step>

      <Step title="Stream text over a WebSocket">
        ```python realtime-tts.py theme={null}
        from cartesia import Cartesia
        import subprocess
        import os

        client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY"))

        print("Starting ffplay to play streaming audio output...")
        player = subprocess.Popen(
            ["ffplay", "-f", "f32le", "-ar", "44100", "-probesize", "32", "-analyzeduration", "0", "-nodisp", "-autoexit", "-loglevel", "quiet", "-"],
            stdin=subprocess.PIPE,
            bufsize=0,
        )

        print("Connecting to Cartesia via websockets...")
        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "f786b574-daa5-4673-aa0c-cbe3e8534c02"},
                output_format={
                    "container": "raw",
                    "encoding": "pcm_f32le",
                    "sample_rate": 44100,
                },
            )

            print("Sending chunked text input...")
            for part in ["Hi there! ", "Welcome to ", "Cartesia Sonic."]:
                ctx.push(part)

            ctx.no_more_inputs()

            for response in ctx.receive():
                if response.type == "chunk" and response.audio:
                    print(f"Received audio chunk ({len(response.audio)} bytes)")
                    # Here we pipe audio to ffplay. In a production app you might play audio in
                    # a client, or forward it to another service, eg, a telephony provider.
                    player.stdin.write(response.audio)
                elif response.type == "done":
                    break

        player.stdin.close()
        player.wait()
        ```
      </Step>

      <Step title="Run the quickstart">
        ```sh theme={null}
        python3 realtime-tts.py
        ```

        This will stream text inputs to Cartesia, and play the streaming audio output using `ffplay`. (Make sure your device volume is turned on!)
      </Step>
    </Steps>
  </Tab>

  <Tab title="JavaScript">
    <Steps>
      <Step title="Install the SDK">
        ```sh theme={null}
        npm install @cartesia/cartesia-js ws
        ```

        <Info>
          In the browser, you don't need the `ws` package — the native WebSocket API is used instead. However, you will need to use ephemeral access tokens for authentication. See [Authenticate Your Client Applications](/get-started/authenticate-your-client-applications).
        </Info>
      </Step>

      <Step title="Stream text over a WebSocket">
        Create a file named `realtime-tts.js` with the following code:

        ```js realtime-tts.js theme={null}
        import Cartesia from "@cartesia/cartesia-js";
        import { spawn } from "child_process";

        const client = new Cartesia({ apiKey: process.env["CARTESIA_API_KEY"] });

        console.log("Starting ffplay to play streaming audio output...");
        const player = spawn("ffplay", ["-f", "f32le", "-ar", "44100", "-probesize", "32", "-analyzeduration", "0", "-nodisp", "-autoexit", "-loglevel", "quiet", "-"], {
          stdio: ["pipe", "ignore", "ignore"],
        });

        console.log("Connecting to Cartesia via websockets...");
        const ws = await client.tts.websocket();

        const ctx = ws.context({
          model_id: "sonic-3",
          voice: { mode: "id", id: "f786b574-daa5-4673-aa0c-cbe3e8534c02" },
          output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: 44100 },
        });

        console.log("Sending chunked text input...");
        const transcriptChunks = ["Hi there! ", "Welcome to ", "Cartesia Sonic."]
        for (const part of transcriptChunks) {
          await ctx.push({ transcript: part });
        }

        await ctx.no_more_inputs();

        for await (const event of ctx.receive()) {
          if (event.type === "chunk" && event.audio) {
            console.log("Received audio chunk (%d bytes)", event.audio.length);
            // Here we pipe audio to ffplay. In a production app you might play audio in
            // a client, or forward it to another service, eg, a telephony provider.
            player.stdin.write(event.audio);
          } else if (event.type === "done") {
            break;
          }
        }
        player.stdin.end();
        ws.close();
        ```
      </Step>

      <Step title="Run the quickstart">
        ```sh theme={null}
        node realtime-tts.js
        ```

        This will stream text inputs to Cartesia, and play the streaming audio output using `ffplay`. (Make sure your device volume is turned on!)
      </Step>
    </Steps>
  </Tab>
</Tabs>

## How it works

The WebSocket connection manages multiple *contexts*, each representing an independent, continuous stream of speech. Cartesia context is exactly like an LLM context: on our servers, we store the previously-generated speech so that new speech matches it in tone.

To summarize, here's what our code does, after establishing a Websocket connection:

1. **Create a context** with `context()`.
2. **Push text** incrementally with `push()`. Each chunk continues seamlessly from the previous one using [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).
3. **Signal completion** with `no_more_inputs()` to tell the model no more text is coming.
4. **Receive audio** chunks as they are generated.

This maps directly to LLM token streaming — push each token or sentence fragment as it arrives, and audio begins streaming back even if the full text is not yet available.

## What's next

<CardGroup>
  <Card title="Stream inputs using continuations" icon="waveform" href="/build-with-cartesia/capability-guides/stream-inputs-using-continuations">
    Deep dive into context management and buffering.
  </Card>

  <Card title="Choose a Voice" icon="microphone" href="/build-with-cartesia/capability-guides/choosing-a-voice">
    Browse voices and learn how to pick the right one for your use case.
  </Card>

  <Card title="Choosing TTS parameters" icon="sliders" href="/build-with-cartesia/capability-guides/choosing-tts-parameters">
    Pick the right output format, sample rate, and encoding for your use case.
  </Card>
</CardGroup>


# LiveKit
Source: https://docs.cartesia.ai/integrations/live-kit


<Frame>
  <img alt="LiveKit Agents logo" />
</Frame>

**LiveKit** is a WebRTC-first platform for realtime **video, voice, and data** in your product. **LiveKit Agents** is its framework for conversational agents.

**Cartesia** integrates in two ways: **LiveKit Inference** (hosted **cartesia/sonic-3** and related model IDs in the agent runtime; keys and pricing are through **LiveKit**—see [LiveKit’s Cartesia TTS guide](https://docs.livekit.io/agents/models/tts/inference/cartesia)) and the open-source **[livekit-plugins-cartesia](https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-cartesia)** Python package for **TTS and STT** using your **Cartesia** credentials from the worker.

# Demo

Here's a demo of a voice assistant built with LiveKit Agents and Cartesia:

<Card title="LiveKit Cartesia Demo" icon="solid link" href="https://cartesia-assistant.vercel.app/">
  Try out the LiveKit Cartesia demo.
</Card>

The source code for this demo is available [here](https://github.com/livekit-examples/cartesia-voice-agent)


# Overview
Source: https://docs.cartesia.ai/integrations/overview

Partner integrations for Cartesia TTS and STT in your own app—not Cartesia-hosted agents.

Cartesia provides first-party speech APIs and SDKs, and integrates with many other products and developer frameworks. The pages in this section describe each path at a high level; detailed setup usually lives in partner documentation and repositories.

## Prerequisites

You’ll need these for almost every integration below. Individual pages also list extras (partner accounts, runtimes, SDK installs).

* **[Cartesia API key](https://play.cartesia.ai/keys)** — create and manage keys in the Playground.
* **A voice** — pick one in the Playground or API; see [Choosing a voice](/build-with-cartesia/capability-guides/choosing-a-voice) and [Voice IDs](/build-with-cartesia/tts-models/voice-ids).

## Integrations

<CardGroup>
  <Card title="LiveKit" icon="circle" href="/integrations/live-kit">
    Realtime rooms and agents—Cartesia via LiveKit Inference or the Cartesia plugin.
  </Card>

  <Card title="Pipecat" icon="cat" href="/integrations/pipecat">
    Python voice and multimodal agents with official Cartesia TTS/STT services.
  </Card>

  <Card title="Twilio" icon="phone" href="/integrations/twilio">
    Programmable Voice and Media Streams with Cartesia TTS (Node walkthrough).
  </Card>

  <Card title="Tencent RTC" icon="tencent-weibo" href="/integrations/tencent-rtc">
    TRTC realtime media with Cartesia for conversational AI workloads.
  </Card>

  <Card title="Thoughtly" icon="phone" href="/integrations/thoughtly">
    No-code phone agents; Cartesia is the default voice stack for new agents.
  </Card>

  <Card title="Rasa" icon="robot" href="/integrations/rasa">
    Rasa Pro voice assistants with Cartesia as the TTS backend.
  </Card>

  <Card title="Vision Agents (by Stream)" icon="camera" href="/integrations/vision-agents-by-stream">
    Stream’s Vision Agents framework with a Cartesia TTS plugin.
  </Card>

  <Card title="MCP" icon="comment" href="/tools/ai/mcp">
    `cartesia-mcp` for Cursor, Claude Desktop, and other MCP clients.
  </Card>
</CardGroup>


# Pipecat
Source: https://docs.cartesia.ai/integrations/pipecat


<Frame>
  <img alt="Pipecat logo" />
</Frame>

## Overview

[**Pipecat**](https://www.pipecat.ai/) is an open-source Python framework for realtime **voice** agents.

Building voice agents requires the creation and orchestration of pipelines, media and communication transports (such as Daily or LiveKit), and pluggable AI models.

**Cartesia** is available as a first-party provider plugin for **[TTS and STT services](https://github.com/pipecat-ai/pipecat/tree/main/src/pipecat/services/cartesia)** in the Pipecat repo.

## Prerequisites

Pipecat’s examples require a recent Python installation (see the Pipecat repo's [root-level README](https://github.com/pipecat-ai/pipecat/tree/main#prerequisites) for current prerequisites).

Install the **`pipecat-ai`** Python package with the **`cartesia`** extra for TTS/STT (bracket syntax):

```
pip install "pipecat-ai[cartesia,...]"

# or

uv add "pipecat-ai[cartesia,...]"
```

You'd also need to choose the **transport** extras your sample needs - you can do this by matching whatever the upstream README lists for that example.

## Getting Started - TTS (Websockets)

Pipecat's getting-started example provides you with a small, copy-friendly path to wire Cartesia TTS into a Pipecat [TTS WebSocket API](https://docs.cartesia.ai/api-reference/tts/websocket), and:

<Card title="Cartesia & Pipecat | Getting Started" icon="github" href="https://github.com/pipecat-ai/pipecat/tree/main/examples/getting-started">
  Getting-started examples in the Pipecat repo.
</Card>

## Getting Started - TTS and STT (Websockets & HTTP)

For smaller voice-focused samples using Cartesia STT and TTS you can choose between two transports - WebSockets or HTTP:

<CardGroup>
  <Card title="Pipecat & Cartesia Voice (WebSockets)" icon="github" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-cartesia.py">
    Voice bot using Cartesia STT & TTS over WebSocket.
  </Card>

  <Card title="Pipecat & Cartesia Voice (HTTP)" icon="github" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-cartesia-http.py">
    Same flow using Cartesia STT & TTS over HTTP.
  </Card>
</CardGroup>

## Orchestrated Conversational AI

For a fuller example app that shows an end to end Voice Agent experience (VAD -> STT -> LLM -> TTS) orchestrated with Pipecat, see StudyPal:

<Card title="Pipecat > StudyPal" icon="github" href="https://github.com/pipecat-ai/pipecat-examples/tree/main/studypal">
  StudyPal example in the pipecat-examples repo.
</Card>


# Rasa
Source: https://docs.cartesia.ai/integrations/rasa


**Rasa** is an open dialogue stack; **voice streaming with Cartesia** is documented for **Rasa Pro** (commercial) assistants. Configure a voice channel in **`credentials.yml`** with `tts: name: cartesia` and **`CARTESIA_API_KEY`** per Rasa’s speech-integrations reference. Start with their walkthrough, then use the reference for parameters (`model_id`, `voice`, multilingual `language_map`, etc.):

<Card title="Tutorial: Build a Voice Agent with Rasa and Cartesia" href="https://rasa.com/blog/building-a-voice-bot-with-rasa-and-cartesia-a-technical-tutorial/">
  Full tutorial for building a voice agent with Rasa and Cartesia.
</Card>

For implementation details, see their documentation:

<Card title="Rasa > Docs > Speech integrations (Cartesia)" href="https://rasa.com/docs/reference/integrations/speech-integrations/#cartesia-tts">
  Rasa reference for Cartesia TTS in voice channels.
</Card>


# Tencent RTC
Source: https://docs.cartesia.ai/integrations/tencent-rtc


<Frame>
  <img alt="Cartesia & Tencent" />
</Frame>

**Tencent Real-Time Communication (TRTC)** is Tencent Cloud’s stack for realtime audio and video—calls, live streaming, and conferencing.

**TRTC Conversational AI** is Tencent’s packaged stack for realtime voice agents. Tencent and Cartesia have a **public partnership** to combine TRTC networking with Cartesia **Sonic** TTS and **Ink-Whisper** STT for low-latency conversational AI (see Tencent’s [TRTC × Cartesia solution overview](https://trtc.tencentcloud.com/solutions/trtc-cartesia)). Integration steps and SDK details live in **Tencent’s** console and docs.

# Demo

Experience the TRTC × Cartesia voice assistant here:
[TRTC x Cartesia Demo](https://trtc.io/demo/homepage/#/cartesia)


# Thoughtly
Source: https://docs.cartesia.ai/integrations/thoughtly


<Frame>
  <div>
    <img alt="Thoughtly logo" />
  </div>
</Frame>

**Thoughtly** is a no-code platform for **inbound and outbound AI phone agents** (sales, support, routing): visual flows, CRM and calendar integrations, analytics, and telephony. Following the [Thoughtly × Cartesia partnership](https://www.thoughtly.com/blog/thoughtly-upgrades-its-voice-library-through-partnership-with-cartesia/), **new agents default to Cartesia voices** (low-latency TTS, expanded library, cloning from a short sample in-product); Thoughtly notes existing agents can keep prior voices during migration.

# Demo

<Card title="Thoughtly Cartesia Demo" icon="link" href="https://app.arcade.software/share/MaOO9bPhyHAP5ZdOq8Gt">
  See a demo of Cartesia on Thoughtly.
</Card>


# Integrate with Twilio
Source: https://docs.cartesia.ai/integrations/twilio

How to integrate Twilio with Cartesia to generate audio from text and send it as a voice call.

Use **Twilio Programmable Voice** with **Media Streams** so a phone call receives audio generated by **Cartesia TTS** over WebSockets. This walkthrough uses **Node.js**: a small server bridges Twilio’s stream to Cartesia and plays TTS audio on the callee’s line.

## Prerequisites

Before you begin, make sure you have the following:

1. [Node.js](https://nodejs.org/en/download) installed.
2. A [Twilio account](https://www.twilio.com/en-us/try-twilio). You will need your Account SID and Auth Token.
3. A [Cartesia API key](https://play.cartesia.ai/keys).
4. A phone number that you want to call.
5. A Twilio phone number to call from.
6. An [ngrok authtoken](https://dashboard.ngrok.com/get-started/your-authtoken) (a free account works).

## Get Started

<Steps>
  <Step title="Set Up Your Project">
    1. Create a new directory for your project and navigate to it in your terminal.
    2. Initialize a new Node.js project:
       ```bash lines theme={null}
       npm init -y
       ```
    3. Install the required dependencies:
       ```bash lines theme={null}
       npm install twilio ws http @ngrok/ngrok dotenv
       ```
  </Step>

  <Step title="Configure Environment Variables">
    Create a `.env` file in your project root and add the following:

    ```sh lines theme={null}
    TWILIO_ACCOUNT_SID="your_twilio_account_sid"
    TWILIO_AUTH_TOKEN="your_twilio_auth_token"
    CARTESIA_API_KEY="your_cartesia_api_key"
    NGROK_AUTHTOKEN="your_ngrok_authtoken"
    ```

    Replace the placeholder values with your actual credentials.
  </Step>

  <Step title="Create the Main Script">
    Create a file named `app.js` (or any name you prefer) and add the following code:

    ```javascript lines theme={null}
    const twilio = require('twilio');
    const WebSocket = require('ws');
    const http = require('http');
    const ngrok = require('@ngrok/ngrok');
    const dotenv = require('dotenv');
    const crypto = require('crypto');

    // Load environment variables
    dotenv.config({ override: true });

    // Function to get a value from environment variable or command line argument
    function getConfig(key, defaultValue = undefined) {
      return process.env[key] || process.argv.find(arg => arg.startsWith(`${key}=`))?.split('=')[1] || defaultValue;
    }

    // Configuration
    const config = {
        TWILIO_ACCOUNT_SID: getConfig('TWILIO_ACCOUNT_SID'),
        TWILIO_AUTH_TOKEN: getConfig('TWILIO_AUTH_TOKEN'),
        CARTESIA_API_KEY: getConfig('CARTESIA_API_KEY'),
        NGROK_AUTHTOKEN: getConfig('NGROK_AUTHTOKEN'),
    };

    // Validate required configuration
    const requiredConfig = ['TWILIO_ACCOUNT_SID', 'TWILIO_AUTH_TOKEN', 'CARTESIA_API_KEY', 'NGROK_AUTHTOKEN'];
    for (const key of requiredConfig) {
        if (!config[key]) {
            console.error(`Missing required configuration: ${key}`);
            process.exit(1);
        }
    }

    const client = twilio(config.TWILIO_ACCOUNT_SID, config.TWILIO_AUTH_TOKEN);
    ```
  </Step>

  <Step title="Configure Cartesia TTS">
    In the script, you'll find a configuration section for Cartesia TTS. Make sure to set the following variables according to your needs:

    ```javascript lines theme={null}
    const TTS_WEBSOCKET_URL = `wss://api.cartesia.ai/tts/websocket?cartesia_version=2025-03-01`;
    const modelId = 'sonic-3';
    const voice = {
        'mode': 'id',
        // You can check available voices using the Cartesia API or at https://play.cartesia.ai
        'id': "e07c00bc-4134-4eae-9ea4-1a55fb45746b"
    };
    const partialResponse = 'Hi there, my name is Cartesia. I hope youre having a great day!';
    ```
  </Step>

  <Step title="Set Up Twilio Calling">
    Configure your Twilio outbound and inbound numbers:

    ```javascript lines theme={null}
    const outbound = "+1234567890"; // Replace with the number you want to call
    const inbound = "+1234567890";  // Replace with your Twilio number
    ```
  </Step>

  <Step title="Implement Main Logic">
    The `main()` function orchestrates the entire process:

    1. Connects to the Cartesia TTS WebSocket
    2. Tests the TTS WebSocket
    3. Sets up a Twilio WebSocket server
    4. Creates an ngrok tunnel for the Twilio WebSocket
    5. Initiates the call using Twilio

    ```javascript expandable lines  theme={null}
    let ttsWebSocket;
    let callSid;
    let messageComplete = false;
    let audioChunksReceived = 0;

    function log(message) {
      console.log(`[${new Date().toISOString()}] ${message}`);
    }

    function connectToTTSWebSocket() {
      return new Promise((resolve, reject) => {
        log('Attempting to connect to TTS WebSocket');
        ttsWebSocket = new WebSocket(TTS_WEBSOCKET_URL, {
          headers: { 'X-Api-Key': config.CARTESIA_API_KEY }
        });

        ttsWebSocket.on('open', () => {
          log('Connected to TTS WebSocket');
          resolve(ttsWebSocket);
        });

        ttsWebSocket.on('error', (error) => {
          log(`TTS WebSocket error: ${error.message}`);
          reject(error);
        });

        ttsWebSocket.on('close', (code, reason) => {
          log(`TTS WebSocket closed. Code: ${code}, Reason: ${reason}`);
          reject(new Error('TTS WebSocket closed unexpectedly'));
        });
      });
    }

    function sendTTSMessage(message) {
      const textMessage = {
        'model_id': modelId,
        'transcript': message,
        'voice': voice,
        'output_format': {
          'container': 'raw',
          'encoding': 'pcm_mulaw',
          'sample_rate': 8000
        },
        // create a new context for each message since each is a complete transcript
        'context_id': crypto.randomUUID()
      };

      log(`Sending message to TTS WebSocket: ${message}`);
      ttsWebSocket.send(JSON.stringify(textMessage));
    }

    function testTTSWebSocket() {
      return new Promise((resolve, reject) => {
        const testMessage = 'This is a test message';
        let receivedAudio = false;

        sendTTSMessage(testMessage);

        const timeout = setTimeout(() => {
          if (!receivedAudio) {
            reject(new Error('Timeout: No audio received from TTS WebSocket'));
          }
        }, 10000); // 10 second timeout

        ttsWebSocket.on('message', (audioChunk) => {
          if (!receivedAudio) {
            log(audioChunk);
            log('Received audio chunk from TTS for test message');
            receivedAudio = true;
            clearTimeout(timeout);
            resolve();
          }
        });
      });
    }

    async function startCall(twilioWebsocketUrl) {
      try {
        log(`Initiating call with WebSocket URL: ${twilioWebsocketUrl}`);
        const call = await client.calls.create({
          twiml: `<Response><Connect><Stream url="${twilioWebsocketUrl}"/></Connect></Response>`,
          to: outbound,  // Replace with the phone number you want to call
          from: inbound  // Replace with your Twilio phone number
        });

        callSid = call.sid;
        log(`Call initiated. SID: ${callSid}`);
      } catch (error) {
        log(`Error initiating call: ${error.message}`);
        throw error;
      }
    }

    async function hangupCall() {
      try {
        log(`Attempting to hang up call: ${callSid}`);
        await client.calls(callSid).update({status: 'completed'});
        log('Call hung up successfully');
      } catch (error) {
        log(`Error hanging up call: ${error.message}`);
      }
    }

    function setupTwilioWebSocket() {
        return new Promise((resolve, reject) => {
          const server = http.createServer((req, res) => {
            log(`Received HTTP request: ${req.method} ${req.url}`);
            res.writeHead(200);
            res.end('WebSocket server is running');
          });

          const wss = new WebSocket.Server({ server });

          log('WebSocket server created');

          wss.on('connection', (twilioWs, request) => {
            log(`Twilio WebSocket connection attempt from ${request.socket.remoteAddress}`);

            let streamSid = null;

            twilioWs.on('message', (message) => {
              try {
                const msg = JSON.parse(message);
                log(`Received message from Twilio: ${JSON.stringify(msg)}`);

                if (msg.event === 'start') {
                  log('Media stream started');
                  streamSid = msg.start.streamSid;
                  log(`Stream SID: ${streamSid}`);
                  sendTTSMessage(partialResponse);
                } else if (msg.event === 'media' && !messageComplete) {
                  log('Received media event');
                } else if (msg.event === 'stop') {
                  log('Media stream stopped');
                  hangupCall();
                }
              } catch (error) {
                log(`Error processing Twilio message: ${error.message}`);
              }
            });

            twilioWs.on('close', (code, reason) => {
              log(`Twilio WebSocket disconnected. Code: ${code}, Reason: ${reason}`);
            });

            twilioWs.on('error', (error) => {
              log(`Twilio WebSocket error: ${error.message}`);
            });

            // Handle incoming audio chunks from TTS WebSocket
            ttsWebSocket.on('message', (audioChunk) => {
              log('Received audio chunk from TTS');
              try {
                if (streamSid) {
                  twilioWs.send(JSON.stringify({
                    event: 'media',
                    streamSid: streamSid,
                    media: {
                      payload: JSON.parse(audioChunk)['data']
                    }
                  }));

                  audioChunksReceived++;
                  log(`Audio chunks received: ${audioChunksReceived}`);

                  if (audioChunksReceived >= 50) {
                    messageComplete = true;
                    log('Message complete, preparing to hang up');
                    setTimeout(hangupCall, 2000);
                  }
                } else {
                  log('Warning: Received audio chunk but streamSid is not set');
                }
              } catch (error) {
                log(`Error sending audio chunk to Twilio: ${error.message}`);
              }
            });

            log('Twilio WebSocket connected and handlers set up');
          });

          wss.on('error', (error) => {
            log(`WebSocket server error: ${error.message}`);
          });

          server.listen(0, () => {
            const port = server.address().port;
            log(`Twilio WebSocket server is running on port ${port}`);
            resolve(port);
          });

          server.on('error', (error) => {
            log(`HTTP server error: ${error.message}`);
            reject(error);
          });
        });
      }

    async function setupNgrokTunnel(port) {
        try {
          const listener = await ngrok.forward({
            addr: port,
            authtoken: config.NGROK_AUTHTOKEN,
          });
          const wssUrl = listener.url().replace('https://', 'wss://');
          log(`ngrok tunnel established: ${wssUrl}`);
          return wssUrl;
        } catch (error) {
          log(`Error setting up ngrok tunnel: ${error.message}`);
          throw error;
        }
      }

    async function main() {
      try {
        log('Starting application');

        await connectToTTSWebSocket();
        log('TTS WebSocket connected successfully');

        await testTTSWebSocket();
        log('TTS WebSocket test passed successfully');

        const twilioWebsocketPort = await setupTwilioWebSocket();
        log(`Twilio WebSocket server set up on port ${twilioWebsocketPort}`);

        const twilioWebsocketUrl = await setupNgrokTunnel(twilioWebsocketPort);

        await startCall(twilioWebsocketUrl);
      } catch (error) {
        log(`Error in main function: ${error.message}`);
      }
    }

    // Run the script
    main();
    ```
  </Step>

  <Step title="Run the Application">
    To run the application, use the following command:

    ```bash lines theme={null}
    node app.js
    ```
  </Step>
</Steps>

## How It Works

1. The script establishes a connection to Cartesia's TTS WebSocket.
2. It sets up a local WebSocket server to communicate with Twilio.
3. An ngrok tunnel is created to expose the local WebSocket server to the internet.
4. A call is initiated using Twilio, connecting to the ngrok tunnel.
5. When the call connects, the script sends the predefined message to Cartesia's TTS.
6. Cartesia converts the text to speech and sends audio chunks back.
7. The script forwards these audio chunks to Twilio, which plays them on the call.

## Customization

* To change the spoken message, modify the `partialResponse` variable.
* Adjust the voice parameters in the `voice` object to change the TTS voice characteristics.
* Modify the `audioChunksReceived` threshold to control when the call should end.

## Troubleshooting

* If you encounter any issues, check the console logs for detailed error messages.
* Ensure all required environment variables are correctly set.
* If you see `invalid tunnel configuration`, make sure you're using the better supported `@ngrok/ngrok` package and not `ngrok`.


# Vision Agents by Stream
Source: https://docs.cartesia.ai/integrations/vision-agents-by-stream


<Frame>
  <img alt="Vision Agents logo" />
</Frame>

[Stream](https://getstream.io/) maintains **[Vision Agents](https://visionagents.ai)**—an open-source Python framework for voice- and vision-driven agents with realtime media over **Stream**’s WebRTC edge. Cartesia is supported as the **TTS** provider; install steps, environment variables, and parameters are in Stream’s **[Cartesia integration](https://visionagents.ai/integrations/cartesia)**.

You need a **Stream** developer account for realtime transport and a **Cartesia API key** for speech.

The ["Simple Agent"](https://github.com/GetStream/Vision-Agents/tree/main/examples/01_simple_agent_example) example in GitHub and the [voice](https://visionagents.ai/introduction/voice-agents) / [video](https://visionagents.ai/introduction/video-agents) intros are good starting points.

# Demo

<Card title="Vision Agents Cartesia Demo" icon="fa-solid fa-link" href="https://github.com/GetStream/Vision-Agents/tree/main/examples/01_simple_agent_example">
  Try out the Simple Agent Cartesia demo.
</Card>


# CLI documentation
Source: https://docs.cartesia.ai/line/cli


Create, deploy, and manage voice agents from the command line.

## Installation

<Warning>By running the quick install commands, you are accepting Cartesia's [Terms of Service (TOS)](https://cartesia.ai/legal/terms.html). Please make sure to review the full TOS here before proceeding.</Warning>

Install and download from our servers:

```zsh lines theme={null}
curl -fsSL https://cartesia.sh | sh
```

Update to the latest version:

```zsh lines theme={null}
cartesia update
```

## Quick Start

<Steps>
  <Step title="Login with API key">
    Authenticate with your Cartesia API key.
    To make an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys) and select your organization.

    ```zsh lines theme={null}
    cartesia auth login  # paste your API key when prompted
    ```
  </Step>

  <Step title="Clone an example agent">
    Clone an example agent from the Line repository.

    ```zsh lines theme={null}
    cartesia create my-agent
    # Choose any example you like.
    cd my-agent
    ```
  </Step>

  <Step title="Initialize your agent">
    Give your agent a name and link it to your organization.

    ```zsh lines theme={null}
    cartesia init
    ```
  </Step>

  <Step title="Deploy your agent">
    Deploy your agent to make it available in the playground.

    ```zsh lines theme={null}
    cartesia deploy
    ```
  </Step>
</Steps>

## Features

### Initialize a Project

Link any directory to a new or existing Cartesia agent:

```zsh lines theme={null}
cartesia init
```

Create a project from an example:

```zsh lines theme={null}
cartesia create
```

<Tip>
  Inside a project directory, the CLI auto-detects the agent. Run `cartesia status` to see the current agent ID.
</Tip>

### Chat with Your Agent

Test your agent's text reasoning locally.

Terminal 1. Run your text logic fastapi server:

```zsh lines theme={null}
PORT=8000 uv run python main.py
```

Terminal 2. Run the CLI to chat with your agent:

```zsh lines theme={null}
cartesia chat 8000
```

## Commands

### Authentication

To get an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys), select your organization, and generate a new key.

```zsh lines theme={null}
cartesia auth login
```

To validate the existing API key:

```zsh lines theme={null}
cartesia auth status
```

To logout (clears cached credentials):

```zsh lines theme={null}
cartesia auth logout
```

### Voice Agents

Deploy your agent to Cartesia cloud.

```zsh lines theme={null}
cartesia deploy
```

List out all the agents in your organization:

```zsh lines theme={null}
cartesia agents ls
```

#### Managed Deployments

Versions of your agent running on Cartesia's cloud. Each deployment rebuilds the environment, instantiates your project, and runs a health check.

To see all of your deployments:

```zsh lines theme={null}
cartesia deployments ls
```

Check the status of a deployment:

```zsh lines theme={null}
cartesia status [<deployment-id> or <agent-id>]
```

#### Self-Hosted Agent Code

While Cartesia's managed deployments are the simplest way to deploy low-latency voice agents, if you'd like to manage your own deployments of your agent code, you can pass us a URL for your agent to connect to during calls.

Connect an existing agent to your self-hosted code:

```zsh lines theme={null}
cartesia connect --agent-id <agent-id> --url https://my-agent.example.com
```

Or run without `--agent-id` to interactively select an existing agent or create a new one:

```zsh lines theme={null}
cartesia connect --url https://my-agent.example.com
```

Disconnect an agent from your self-hosted code:

```zsh lines theme={null}
cartesia disconnect --agent-id <agent-id>
```

### Environment Variables

Create, list, and remove environment variables for your agent.

Set environment variables for your agent:

```zsh lines theme={null}
cartesia env set API_KEY=FOOBAR MY_CONFIG=FOOBAZ
```

<Warning icon="lock">
  Environment variables are encrypted for storage and can only be accessed by your code.
</Warning>

Port environment variables from a `.env` file:

```zsh lines theme={null}
cartesia env set --from .env
```

```text .env theme={null}
API_KEY=FOOBAR
MY_CONFIG=FOOBAZ
```

Remove an environment variable:

```zsh lines theme={null}
cartesia env rm <env-var-name>
```

### Help Menu

For more details on any command:

```zsh lines theme={null}
cartesia --help
```


# Release Notes
Source: https://docs.cartesia.ai/line/developer-tools/release-notes

Updates to the Line SDK and platform.

## March 2026

Platform-wide API, PVC, and client library updates for this month are in [Changelog 2026](/changelog/2026) (March 2026).

***

## February 4, 2026

### AgentUpdateCall Output Event

Added `AgentUpdateCall` event for dynamically updating call configuration during a conversation:

```python theme={null}
from line.events import AgentUpdateCall

# In an agent's process method:
yield AgentUpdateCall(voice_id="5ee9feff-1265-424a-9d7f-8e4d431a12c7")
yield AgentUpdateCall(pronunciation_dict_id="dict-123")
```

| Field                   | Description                          |
| ----------------------- | ------------------------------------ |
| `voice_id`              | Updates the agent's voice            |
| `pronunciation_dict_id` | Updates the pronunciation dictionary |

All fields are optional—only set fields are updated. See [Events](/line/sdk/events#dynamic-configuration) for details.

***

## February 1, 2026

### Line SDK v0.2 — Major Release

We're releasing **Line SDK v0.2**, a complete redesign of the voice agent framework focused on simplicity, streaming performance, and seamless LLM integration. This release introduces a new async iterable architecture that replaces the previous event bus system.

<Warning>
  **Breaking Changes**: v0.2 is not backwards compatible with v0.1.x. See the [Migration Guide](#migration-guide-from-v0-1-x-to-v0-2) below for detailed upgrade instructions.
</Warning>

<Info>
  **What's changing?** Line SDK v0.2 makes it much simpler to build voice agents. Instead of manually wiring together multiple components (systems, bridges, nodes), you now write a single function that returns your agent. The SDK handles audio, interruptions, and conversation flow automatically.
</Info>

**Why upgrade?**

* **Faster development** — Build agents in hours instead of days with less boilerplate code
* **Easier maintenance** — Fewer moving parts means fewer bugs and simpler debugging
* **Better reliability** — Built-in error handling, retries, and fallback models
* **More flexibility** — Switch between 100+ AI providers (OpenAI, Anthropic, Google, etc.) without code changes
* **Powerful tools** — Add capabilities like web search, call transfers, and multi-agent handoffs with one line of code

***

## What's New in v0.2

### Simplified Agent Architecture

The new architecture replaces the `VoiceAgentSystem`, `Bus`, `Bridge`, and `ReasoningNode` pattern with a single async iterable function:

```python theme={null}
import os
from line import CallRequest
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import AgentEnv, VoiceAgentApp

async def get_agent(env: AgentEnv, call_request: CallRequest):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)
```

**Benefits:**

* Less boilerplate code
* No manual event routing or bridge configuration
* Automatic conversation history management
* Built-in interruption handling
* Quick, and easy tool definition

### Built-in LLM Support via LiteLLM

`LlmAgent` provides unified access to 100+ LLM providers through [LiteLLM](https://github.com/BerriAI/litellm):

```python theme={null}
# OpenAI
LlmAgent(model="gpt-5-nano", api_key=os.getenv("OPENAI_API_KEY"), ...)

# Anthropic
LlmAgent(model="anthropic/claude-haiku-4-5-20251001", api_key=os.getenv("ANTHROPIC_API_KEY"), ...)

# Google Gemini
LlmAgent(model="gemini/gemini-2.5-flash-preview-09-2025", api_key=os.getenv("GEMINI_API_KEY"), ...)

# With fallbacks
LlmAgent(
    model="gpt-5-nano",
    config=LlmConfig(fallbacks=["anthropic/claude-haiku-4-5-20251001", "gemini/gemini-2.5-flash-preview-09-2025"]),
    ...
)
```

### Declarative Tool System

Define agent capabilities using simple decorators. Three tool types cover all common scenarios:

| Tool Type       | Decorator           | What It Does                                                    | Example Use Case                                  |
| --------------- | ------------------- | --------------------------------------------------------------- | ------------------------------------------------- |
| **Loopback**    | `@loopback_tool`    | Fetches information, then the agent speaks the answer naturally | Looking up order status, checking account balance |
| **Passthrough** | `@passthrough_tool` | Takes an immediate action without additional AI processing      | Ending a call, transferring to a phone number     |
| **Handoff**     | `@handoff_tool`     | Transfers the conversation to a different specialized agent     | Routing to Spanish support, escalating to billing |

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool, passthrough_tool, handoff_tool
from line.events import AgentEndCall

@loopback_tool
async def get_weather(ctx, city: Annotated[str, "City name"]) -> str:
    """Get current weather for a city."""
    return f"72°F and sunny in {city}"

@passthrough_tool
async def end_call(ctx):
    """End the call."""
    yield AgentEndCall()

@handoff_tool
async def transfer_to_support(ctx, event):
    """Transfer to support agent."""
    async for output in support_agent.process(ctx.turn_env, event):
        yield output
```

### Background Tool Execution

Long-running tools can execute in the background without blocking the LLM:

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool(is_background=True)
async def check_bank_balance(ctx, account_id: Annotated[str, "Account ID"]):
    """Check account balance (may take a few seconds)."""
    yield "Checking your balance..."  # Immediate acknowledgment
    balance = await api.get_balance(account_id)  # Long operation
    yield f"Your balance is ${balance:.2f}"  # Triggers new LLM completion
```

### Built-in Tools

Common operations available out of the box:

```python theme={null}
from line.llm_agent import end_call, send_dtmf, transfer_call, web_search, agent_as_handoff

agent = LlmAgent(
    tools=[
        end_call,                    # End the call
        send_dtmf,                   # Send DTMF tones
        transfer_call,               # Transfer to phone number
        web_search,                  # Real-time web search
        agent_as_handoff(other_agent, name="transfer_to_billing"),
    ],
    ...
)
```

### Multi-Agent Workflows

Create sophisticated agent routing with `agent_as_handoff`:

```python theme={null}
spanish_agent = LlmAgent(
    model="gpt-5-nano",
    config=LlmConfig(system_prompt="Speak only in Spanish.", ...),
    ...
)

main_agent = LlmAgent(
    tools=[
        agent_as_handoff(
            spanish_agent,
            handoff_message="Transferring to Spanish support...",
            name="transfer_to_spanish",
            description="Transfer when user requests Spanish.",
        ),
    ],
    ...
)
```

### Structured Event System

Events are how your agent communicates with the outside world. **Output events** are actions your agent takes (speaking, ending calls). **Input events** are things that happen during a call (user speaks, call starts).

**Output Events** (agent → harness):

* `AgentSendText` — Send text to be spoken
* `AgentEndCall` — End the call
* `AgentTransferCall` — Transfer to another number
* `AgentSendDtmf` — Send DTMF tone
* `AgentToolCalled` / `AgentToolReturned` — Tool execution tracking
* `LogMetric` / `LogMessage` — Observability

**Input Events** (harness → agent):

* `CallStarted` / `CallEnded` — Call lifecycle
* `UserTurnStarted` / `UserTurnEnded` — User speaking
* `UserTextSent` / `UserDtmfSent` — User content
* `AgentHandedOff` — Handoff notification

All input events include a `history` field with the complete conversation context.

### Enhanced Configuration

Fine-tune how your agent thinks and responds. `LlmConfig` lets you control the AI's personality, response length, creativity, and reliability:

```python theme={null}
LlmConfig(
    system_prompt="You are a helpful assistant.",
    introduction="Hello! How can I help?",

    # Sampling parameters
    temperature=0.7,
    max_tokens=1024,
    top_p=0.95,

    # Resilience
    num_retries=2,
    fallbacks=["gpt-5-nano"],
    timeout=30.0,

    # Provider-specific options
    extra={"reasoning_effort": "high"},
)
```

***

## Migration Guide from v0.1.x to v0.2

This guide walks you through upgrading your existing v0.1.x agents to v0.2. The migration involves updating imports, simplifying your agent setup, and adopting the new tool system. Most agents can be migrated in under an hour.

### Overview of Changes

| v0.1.x                                | v0.2                                      |
| ------------------------------------- | ----------------------------------------- |
| `VoiceAgentSystem` + `Bus` + `Bridge` | `VoiceAgentApp` with `get_agent` callback |
| `ReasoningNode` subclasses            | `LlmAgent` or custom `Agent` protocol     |
| `call_handler(system, request)`       | `get_agent(env, request) -> Agent`        |
| Manual event routing                  | Automatic event dispatch with filters     |
| `process_context()` method            | `process(env, event)` async iterable      |

### Step 1: Update Imports

```python theme={null}
# v0.1.x
from line.voice_agent_app import VoiceAgentApp
from line.voice_agent_system import VoiceAgentSystem
from line.bridge import Bridge
from line.nodes import ReasoningNode
from line.events import (
    AgentSpeechSent,
    UserTranscriptionReceived,
    EndCall,
    TransferCall,
)

# v0.2
from line.voice_agent_app import VoiceAgentApp, AgentEnv
from line.llm_agent import LlmAgent, LlmConfig
from line.llm_agent import end_call, transfer_call, loopback_tool, passthrough_tool
from line.events import (
    AgentSendText,
    AgentEndCall,
    AgentTransferCall,
    UserTurnEnded,
    CallStarted,
)
```

### Step 2: Replace VoiceAgentSystem with get\_agent

In v0.1.x, event routing was configured manually via `bridge.on()`. In v0.2, event dispatch is automatic with customizable **run** and **cancel filters**.

<CodeGroup>
  ```python v0.1.x theme={null}
  from line.voice_agent_app import VoiceAgentApp
  from line.voice_agent_system import VoiceAgentSystem
  from line.bridge import Bridge
  from line.nodes import ReasoningNode
  from line.events import (
      UserTranscriptionReceived,
      UserStoppedSpeaking,
      DTMFInputEvent,
  )

  class MyReasoningNode(ReasoningNode):
      async def process_context(self, context):
          # Your LLM logic here
          response = await call_llm(context.messages)
          yield AgentResponse(content=response)

  async def call_handler(system: VoiceAgentSystem, call_request):
      node = MyReasoningNode(system_prompt="You are helpful.")
      bridge = Bridge(node)

      system.with_speaking_node(node, bridge)

      # Manual event routing with bridge.on()
      bridge.on(UserTranscriptionReceived).map(node.add_event)
      bridge.on(UserStoppedSpeaking).stream(node.generate).broadcast()

      # DTMF events required explicit routing
      bridge.on(DTMFInputEvent).map(node.handle_dtmf)

      await system.start()
      await system.send_initial_message("Hello!")
      await system.wait_for_shutdown()

  app = VoiceAgentApp(call_handler=call_handler)
  ```

  ```python v0.2 theme={null}
  import os
  from line import CallRequest
  from line.voice_agent_app import VoiceAgentApp, AgentEnv
  from line.llm_agent import LlmAgent, LlmConfig, end_call
  from line.events import (
      CallStarted,
      UserTurnEnded,
      UserDtmfSent,
      UserTurnStarted,
      CallEnded,
  )

  async def get_agent(env: AgentEnv, call_request: CallRequest):
      agent = LlmAgent(
          model="gpt-5-nano",
          api_key=os.getenv("OPENAI_API_KEY"),
          tools=[end_call],
          config=LlmConfig(
              system_prompt="You are helpful.",
              introduction="Hello!",
          ),
      )

      # Default: just return the agent (uses default filters)
      return agent

  async def get_agent_with_dtmf(env: AgentEnv, call_request: CallRequest):
      """Alternative: include DTMF events in processing."""
      agent = LlmAgent(...)

      # Return an AgentSpec tuple to customize filters
      run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded]
      cancel_filter = [UserTurnStarted]
      return (agent, run_filter, cancel_filter)

  app = VoiceAgentApp(get_agent=get_agent)
  ```
</CodeGroup>

#### Run and Cancel Filters

Filters control your agent's behavior during a call:

* **Run filters** determine what triggers your agent to respond (e.g., when the user finishes speaking)
* **Cancel filters** determine what interrupts your agent (e.g., when the user starts talking over the agent)

You can customize these by returning a tuple instead of just the agent:

```python theme={null}
from typing import Union, Tuple

AgentSpec = Union[Agent, Tuple[Agent, run_filter, cancel_filter]]
```

| Filter             | Purpose                                    | Default                                   |
| ------------------ | ------------------------------------------ | ----------------------------------------- |
| **run\_filter**    | Events that trigger agent processing       | `[CallStarted, UserTurnEnded, CallEnded]` |
| **cancel\_filter** | Events that cancel in-progress agent tasks | `[UserTurnStarted]`                       |

**Example: Agent that responds to DTMF input**

```python theme={null}
from line.events import (
    CallStarted, CallEnded, UserTurnEnded, UserTurnStarted, UserDtmfSent
)

async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)

    # Include UserDtmfSent in run_filter to process DTMF
    run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded]
    cancel_filter = [UserTurnStarted]

    return (agent, run_filter, cancel_filter)
```

**Example: Agent that doesn't get interrupted**

```python theme={null}
async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)

    # Empty cancel_filter = agent won't be interrupted
    run_filter = [CallStarted, UserTurnEnded, CallEnded]
    cancel_filter = []

    return (agent, run_filter, cancel_filter)
```

**Example: Custom filter function**

```python theme={null}
def my_run_filter(event: InputEvent) -> bool:
    """Only process events during business hours."""
    if isinstance(event, CallStarted):
        return is_business_hours()
    return isinstance(event, (UserTurnEnded, CallEnded))

async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)
    return (agent, my_run_filter, [UserTurnStarted])
```

### Step 3: Migrate Event Handling

<CodeGroup>
  ```python v0.1.x theme={null}
  # Event names
  AgentSpeechSent        # Agent spoke
  UserTranscriptionReceived  # User spoke
  EndCall                # End call
  TransferCall           # Transfer call

  # Manual event handling in ReasoningNode
  class MyNode(ReasoningNode):
      async def process_context(self, context):
          for event in context.events:
              if isinstance(event, UserTranscriptionReceived):
                  user_message = event.transcription
  ```

  ```python v0.2 theme={null}
  # Event names
  AgentSendText          # Output: send text to speak
  AgentTextSent          # Input: confirmation text was spoken
  UserTurnEnded          # Input: user finished speaking
  AgentEndCall           # Output: end call
  AgentTransferCall      # Output: transfer call

  # Events include history automatically
  async def process(self, env, event):
      if isinstance(event, UserTurnEnded):
          # Access user's message
          user_message = event.content[0].content

          # Access full conversation history
          for past_event in event.history:
              if isinstance(past_event, UserTextSent):
                  print(f"User previously said: {past_event.content}")
  ```
</CodeGroup>

### Step 4: Migrate Custom Tools

<CodeGroup>
  ```python v0.1.x theme={null}
  # Manual tool handling in ReasoningNode
  class MyNode(ReasoningNode):
      async def process_context(self, context):
          # Parse tool calls from LLM response
          if tool_call := extract_tool_call(response):
              result = await self.execute_tool(tool_call)
              # Manually add to context and call LLM again
              context.add_tool_result(result)
              response = await call_llm(context)
  ```

  ```python v0.2 theme={null}
  from typing import Annotated
  from line.llm_agent import loopback_tool, passthrough_tool
  from line.events import AgentSendText, AgentEndCall

  # Declarative tool definitions
  @loopback_tool
  async def get_account_balance(ctx, account_id: Annotated[str, "Account ID"]):
      """Look up account balance."""
      balance = await api.get_balance(account_id)
      return f"${balance:.2f}"

  @passthrough_tool
  async def end_call_with_message(ctx, message: Annotated[str, "Goodbye message"]):
      """End call with a custom message."""
      yield AgentSendText(text=message)
      yield AgentEndCall()

  # Tools are passed to LlmAgent
  agent = LlmAgent(
      tools=[get_account_balance, end_call_with_message],
      ...
  )
  ```
</CodeGroup>

### Step 5: Migrate Multi-Agent Patterns

<CodeGroup>
  ```python v0.1.x theme={null}
  # Manual agent switching
  class MainNode(ReasoningNode):
      def __init__(self, spanish_node):
          self.spanish_node = spanish_node
          self.use_spanish = False

      async def process_context(self, context):
          if self.should_switch_to_spanish(context):
              self.use_spanish = True
              # Complex manual state management
  ```

  ```python v0.2 theme={null}
  from line.llm_agent import agent_as_handoff

  spanish_agent = LlmAgent(
      model="gpt-5-nano",
      config=LlmConfig(system_prompt="Speak only in Spanish."),
      ...
  )

  main_agent = LlmAgent(
      tools=[
          agent_as_handoff(
              spanish_agent,
              handoff_message="Transferring...",
              name="transfer_to_spanish",
              description="Use when user requests Spanish.",
          ),
      ],
      ...
  )
  ```
</CodeGroup>

### Removed APIs

The following APIs from v0.1.x have been removed with no direct replacement:

| Removed               | Alternative                                  |
| --------------------- | -------------------------------------------- |
| `VoiceAgentSystem`    | Use `VoiceAgentApp` with `get_agent`         |
| `Bus`                 | Events are dispatched automatically          |
| `Bridge`              | Use run/cancel filters on `AgentSpec`        |
| `ReasoningNode`       | Use `LlmAgent` or implement `Agent` protocol |
| `ConversationHarness` | Handled internally by `ConversationRunner`   |
| `EventsRegistry`      | Use typed event classes directly             |

### Custom Agent Protocol

If you need custom logic beyond `LlmAgent`, implement the `Agent` protocol:

```python theme={null}
from typing import AsyncIterable
from line.events import (
    InputEvent,
    OutputEvent,
    AgentSendText,
    CallStarted,
    UserTurnEnded,
)

class CustomAgent:
    """Custom agent implementing the Agent protocol."""

    async def process(self, env, event: InputEvent) -> AsyncIterable[OutputEvent]:
        if isinstance(event, CallStarted):
            yield AgentSendText(text="Hello from custom agent!")
        elif isinstance(event, UserTurnEnded):
            # Your custom logic here
            user_message = event.content[0].content
            response = await your_custom_logic(user_message, event.history)
            yield AgentSendText(text=response)
```

***

## Breaking Changes Summary

This section provides a quick reference for all breaking changes. Use this as a checklist when migrating your code.

### Event Renames

| v0.1.x                      | v0.2                                               |
| --------------------------- | -------------------------------------------------- |
| `AgentSpeechSent`           | `AgentSendText` (output) / `AgentTextSent` (input) |
| `UserTranscriptionReceived` | `UserTextSent` / `UserTurnEnded`                   |
| `UserStartedSpeaking`       | `UserTurnStarted`                                  |
| `UserStoppedSpeaking`       | `UserTurnEnded`                                    |
| `AgentStartedSpeaking`      | `AgentTurnStarted`                                 |
| `AgentStoppedSpeaking`      | `AgentTurnEnded`                                   |
| `EndCall`                   | `AgentEndCall`                                     |
| `TransferCall`              | `AgentTransferCall`                                |
| `DTMFInputEvent`            | `UserDtmfSent`                                     |
| `DTMFOutputEvent`           | `AgentSendDtmf`                                    |

<Note>
  **Output vs. Input events**: `AgentSendText` is an output event you **yield** to make the agent speak. `AgentTextSent` is an input event you **receive** confirming what was spoken (appears in history).
</Note>

### Structural Changes

* **History in events**: All input events now include an optional `history` field with complete conversation context. When `history` is `None`, the event is inside a history list; when it contains a list, the event has full context attached.
* **Tool events**: `ToolCall`/`ToolResult` replaced with structured `AgentToolCalled`/`AgentToolReturned`
* **Event IDs**: All events now have stable `event_id` fields for tracking

### Configuration Changes

| v0.1.x                            | v0.2                                  |
| --------------------------------- | ------------------------------------- |
| `CallRequest.agent.system_prompt` | `LlmConfig.system_prompt`             |
| `CallRequest.agent.introduction`  | `LlmConfig.introduction`              |
| Manual LLM parameters             | `LlmConfig` with full LiteLLM support |

<Tip>
  Use `LlmConfig.from_call_request(call_request, fallback_system_prompt="...", fallback_introduction="...")` to automatically inherit configuration from the Cartesia Playground while providing sensible defaults. See [Agents documentation](/line/sdk/agents#accessing-call-metadata-in-your-agent-logic) for details.
</Tip>

***

## New Dependencies

v0.2 introduces the following dependencies:

```
litellm              # Multi-provider LLM support
pydantic             # Type validation for events
phonenumbers >= 9.0  # Phone number validation for transfer_call
```

Optional dependencies for examples:

```
exa-py               # Exa web search integration
duckduckgo-search    # Fallback web search
```

***

## Getting Help

* **Documentation**: [Line SDK Overview](/line/sdk/overview)
* **Examples**: [github.com/cartesia-ai/line/examples](https://github.com/cartesia-ai/line/tree/main/examples)
* **Support**: [support@cartesia.ai](mailto:support@cartesia.ai)


# Metrics
Source: https://docs.cartesia.ai/line/evaluations/metrics


The Line platform includes a suite of tools for evaluating how your Agent is performing, both during development phase and in production.
You have full control over how metrics for evaluating your agent are defined.

<Frame>
  <iframe />
</Frame>

## System Metrics

By default, all calls made by a Line Agent have a set of system metrics automatically calculated to help evaluate performance.

| System Metric                  | Description                                                                                                  |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------ |
| system\_call\_success          | A boolean status indicating if the call disconnects unexpectedly, for example due to reasoning code crashing |
| system\_text\_to\_speech\_ttfb | The time to first byte of audio generated by the TTS model on the first turn of the conversation             |

### LLM as a Judge

An LLM-as-a-Judge metric is created in the playground by setting a name and specifying a prompt. You can try out different prompts in
the playground against existing call transcripts by copying a call id into the metric creation field and clicking evaluate
to generate a sample output.

<Frame>
  <iframe />
</Frame>

<Tip title="Prompt Tips" icon="gavel">
  Write your LLM as a Judge metrics to return a single value and description
  field.
</Tip>

A metric name can only include lower case letters, digits, and ‘-’, ‘\_’, or ‘.’ characters so that you can manage it
from a cli. Metric names must also be unique within your organization.

## Assigning Metrics

Once a metric is created, it can be assigned to an Agent via the playground from the Agent page. All subsequent calls made
to or from that Agent will have metric results calculated and available to view in the console and API. Note
that when you assign a metric to an existing Agent, it won’t be automatically run on previous calls.

<Frame>
  <img alt="Assign a metric" />
</Frame>


# Metrics Results
Source: https://docs.cartesia.ai/line/evaluations/results

View the results from metrics run against all calls handled by your agent.

Metrics results are accessible via both API and the playground.

Each metric result contains relevant information to help you analyze your calls. Some fields include:

```
- metric_id
- metric_name
- agent_id
- call_id
- summary
- transcript
- deployment_id
- value
- status
```

To view the full schema, visit the API [List Metric Results](/api-reference/agents/metrics/list-metric-results).

## API

To get metrics via the API, you can specify a few filter parameters including `call_id`, `agent_id` and more. You can retrieve these metric results or export them into a CSV. [List Metric Results](/api-reference/agents/metrics/list-metric-results) and [Export Metric Results](/api-reference/agents/metrics/export-metric-results) have the same query parameters available and differ only in the response format.

#### Example Request for CSV Results

<CodeGroup>
  ```zsh cURL lines theme={null}
  curl --location 'https://api.cartesia.ai/agents/metrics/export?metric_id={metric_id}&limit=100&starting_after={previous_next_page_metric_id}' \
  --header 'Cartesia-Version: 2025-04-16' \
  --header 'Authorization: Bearer {YOUR_API_KEY}'
  ```

  ```python Python lines theme={null}
  import requests

  url = "https://api.cartesia.ai/agents/metrics/export"
  params = {
      "metric_id": "{metric_id}",
      "limit": 100,
      "starting_after": "{previous_next_page_metric_id}"
  }
  headers = {
      "Content-Type": "application/json",
      "Cartesia-Version": "2025-04-16",
      "Authorization": "Bearer <YOUR_API_KEY>"
  }

  response = requests.get(url, headers=headers, params=params)

  if response.status_code == 200:
      # Save CSV to file
      with open("metrics.csv", "w", encoding="utf-8") as f:
          f.write(response.text)
      print("CSV file saved as metrics.csv")
  else:
      print(f"Error {response.status_code}: {response.text}")
  ```

  ```typescript Javascript lines theme={null}
  const response = await fetch(
    "https://api.cartesia.ai/agents/metrics/export?metric_id={metric_id}&limit=100&starting_after={previous_next_page_metric_id}",
    {
      method: "GET",
      headers: {
        "Content-Type": "application/json",
        "Cartesia-Version": "2025-04-16",
        Authorization: "Bearer <your_api_key>",
      },
    }
  );
  ```
</CodeGroup>

## Console

Metrics are visible in the playground for a specific call record.


# Deployments
Source: https://docs.cartesia.ai/line/infrastructure/deployments


Deployments are instances of your agent running on Cartesia's servers.

<Frame>
  <img alt="Deployments" />
</Frame>

# State

Only deployments in the `ready` state can handle inbound or outbound calls. At any time, only one deployment is active.
Deployments that fail health checks will not receive traffic.

# Creating a deployment

Use `cartesia deploy` or push to a linked GitHub repository to create a deployment.

Cartesia servers:

1. Build the virtual environment
2. Load `main.py` and instantiate a FastAPI app
3. Run a health check
4. Set the deployment to `ready` and start receiving traffic

<Info>
  Line supports Python 3.9–3.13 (specify in `pyproject.toml`). FastAPI servers only; more frameworks coming soon.
</Info>

<Tip title="Pre-Call Initialization" icon="phone-volume">
  **Pre-Call Initialization**

  Inbound calls will ring for five seconds to allow your application logic to warm up any required resources and establish
  connections.
</Tip>


# Observability
Source: https://docs.cartesia.ai/line/infrastructure/observability

Get full visibility into how your Agent is performing.

Monitor every deployment and call.

<Frame>
  <iframe />
</Frame>

## Deployment

Each deployment generates a unique ID. View logs in the console.

<Frame>
  <img alt="Sample Deployment Logs" />
</Frame>

## Call Logs

You can click into a call and view any logging statements generated by your reasoning code.

<Frame>
  <iframe />
</Frame>

## Transcripts

Each call has a transcript with independently separated transcribed audio and text to be generated. When you export these
transcripts with the API or CLI, these include more granular turn level timestamps.

<Frame>
  <img alt="Sample Call Transcripts" />
</Frame>

## Loggable Events

Record events without tying them to tool calls.

### SDK

In the SDK, yield `LogMessage` events from your agent or tools to record custom events:

```python theme={null}
from line.events import LogMessage

@loopback_tool
async def process_order(ctx, order_id: Annotated[str, "Order ID"]):
    """Process a customer order."""
    result = await api.process_order(order_id)

    # Log a custom event
    yield LogMessage(
        name="order_processed",
        level="info",
        message=f"Processed order {order_id}",
        metadata={"status": result.status, "order_id": order_id}
    )

    return f"Order {order_id} processed: {result.status}"
```

Events are automatically sent to the platform when yielded.

### Websocket

If you're not using the SDK and instead just relying on the bare websocket, logging events will look like this:

```json theme={null}
{
  "type": "log_event",
  "event": "event_name",
  "metadata": {
    "key": "value"
  }
}
```

### Playground

You can view these events in the Playground under the `Transcript` tab of the call.

## Loggable Metrics

Record metrics at any point in your workflow.

### SDK

In the context of the SDK, we can log a metric by broadcasting the `LogMetric` event.
Here's a snippet from the form filling template that exhibits this:

```python theme={null}
# Record the answer in form manager
success = self.form_manager.record_answer(answer)

if success:
  # Log metric for the answered question
  if current_question:
    metric_name = current_question["id"]
    yield LogMetric(name=metric_name, value=answer)
    logger.info(f"📊 Logged metric: {metric_name}={answer}")
```

The user bridge is subscribed to the `LogMetric` event by default, and it will
log it over the websocket by default when it sees that `LogMetric` has been broadcast.

### Websocket

If you're not using the SDK and instead just relying on the bare websocket, logging metrics will look like this:

```json theme={null}
{
  "type": "log_metric",
  "name": "metric_name",
  "value": "metric_value"
}
```

### Playground

You can view these events in the Playground under the `Transcript` tab of the call.

<Frame>
  <img alt="Loggable Metrics in the Playground" />
</Frame>

## Call Recordings

Call recordings can be downloaded from the playground.

<Frame>
  <img alt="Sample Call Recordings" />
</Frame>

## Webhooks

Cartesia sends webhook events to your **HTTPS** endpoint throughout the call lifecycle. Expose **`POST`** + **`application/json`** and verify the **`x-webhook-secret`** header matches your stored secret.

<Frame>
  <img alt="Sample Call Webhooks" />
</Frame>

### Verify the webhook secret

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    if request.headers.get("x-webhook-secret") != os.environ["LINE_WEBHOOK_SECRET"]:
        return jsonify({"error": "unauthorized"}), 401
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    if (req.headers["x-webhook-secret"] !== process.env.LINE_WEBHOOK_SECRET)
      return res.status(401).json({ error: "unauthorized" });
    ```
  </Tab>
</Tabs>

### Event types

| Event                | When                           | Typed field |
| -------------------- | ------------------------------ | ----------- |
| `call_started`       | Call session begins            | `call`      |
| `call_completed`     | Call ends normally             | `call`      |
| `call_failed`        | Call ends with error           | `call`      |
| `call_turn`          | Each conversational turn       | `turn`      |
| `post_call_analysis` | After async analysis completes | `analysis`  |

### Envelope fields

Every webhook event includes these top-level fields:

| Field        | Description                   |
| ------------ | ----------------------------- |
| `type`       | Event type (see table above). |
| `call_id`    | Call identifier.              |
| `agent_id`   | Agent that handled the call.  |
| `webhook_id` | Webhook config id.            |
| `timestamp`  | RFC 3339 event time.          |

### `call`

Present on `call_started`, `call_completed`, and `call_failed` events. Matches the [GET /agents/calls/\{call\_id}](/api-reference/agents/calls/get-call) response. Some events (e.g. `call_started`) may omit fields like `end_time` that do not yet have a valid value.

| Field                     | Description                                                                                                                                    |
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `id`                      | Call identifier.                                                                                                                               |
| `agent_id` / `agent_name` | Agent details.                                                                                                                                 |
| `status`                  | `started`, `completed`, or `failed`.                                                                                                           |
| `start_time` / `end_time` | RFC 3339 timestamps.                                                                                                                           |
| `end_reason`              | Why the call ended (e.g. `client_hangup`, `agent_hangup`, `inactivity`). See [EndReason](/api-reference/agents/calls/get-call) for all values. |
| `transcript`              | Array of turns (see `turn` below).                                                                                                             |
| `telephony_params`        | `from`, `to`, `direction`, `call_sid`, `connection_type`.                                                                                      |
| `error_message`           | Error detail (failed calls only).                                                                                                              |
| `metadata`                | User-supplied metadata passed at call start.                                                                                                   |
| `summary`                 | Call summary (if available at event time).                                                                                                     |

### `turn`

Present on `call_turn` events. One turn per agent or user utterance.

| Field                               | Description                                           |
| ----------------------------------- | ----------------------------------------------------- |
| `role`                              | `assistant` or `user`.                                |
| `text`                              | Turn text.                                            |
| `start_timestamp` / `end_timestamp` | Seconds from call start.                              |
| `tts_ttfb`                          | Agent TTS time-to-first-byte (seconds), when present. |
| `tool_calls`                        | Tool calls made during this turn, when present.       |

### `analysis`

Present on `post_call_analysis` events. Sent after async analysis completes (currently summary generation; evaluations and metrics will be added here in the future).

| Field     | Description                |
| --------- | -------------------------- |
| `summary` | 1-2 sentence call summary. |

### Example: `call_completed`

```json theme={null}
{
  "type": "call_completed",
  "call_id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
  "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
  "webhook_id": "agent_webhook_P3MgdLf1cpaucZJ7xWehCC",
  "end_reason": "client_hangup",
  "timestamp": "2026-04-16T01:08:50.061907836Z",
  "call": {
    "id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
    "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
    "agent_name": "My Agent",
    "status": "completed",
    "start_time": "2026-04-16T01:08:37.413659Z",
    "end_time": "2026-04-16T01:08:50.036327Z",
    "end_reason": "client_hangup",
    "telephony_params": {
      "from": "websocket",
      "to": "agent_rwh4HGMgyhK7rM5ucVqbiC",
      "connection_type": "websocket"
    },
    "transcript": [
      {
        "role": "assistant",
        "text": "Hi there! How can I help you today?",
        "start_timestamp": 0.41,
        "end_timestamp": 3.2,
        "tts_ttfb": 0.065
      },
      {
        "role": "user",
        "text": "I want to schedule an appointment.",
        "start_timestamp": 3.5,
        "end_timestamp": 5.8
      }
    ]
  }
}
```

### Example: `post_call_analysis`

```json theme={null}
{
  "type": "post_call_analysis",
  "call_id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
  "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
  "webhook_id": "agent_webhook_P3MgdLf1cpaucZJ7xWehCC",
  "timestamp": "2026-04-16T01:08:50.955058787Z",
  "analysis": {
    "summary": "The caller requested to schedule an appointment. The agent confirmed availability and booked a slot."
  }
}
```

### Test your endpoint

```bash theme={null}
curl -sS -X POST "https://your-server.example/webhooks/cartesia" \
  -H "Content-Type: application/json" \
  -H "x-webhook-secret: YOUR_WEBHOOK_SECRET" \
  -d '{
    "type": "call_completed",
    "call_id": "ac_test_123",
    "agent_id": "agent_demo",
    "webhook_id": "agent_webhook_test",
    "timestamp": "2026-01-01T00:00:00.000000000Z",
    "call": {
      "id": "ac_test_123",
      "agent_id": "agent_demo",
      "agent_name": "Test Agent",
      "status": "completed",
      "end_reason": "client_hangup",
      "transcript": []
    }
  }'
```

<Note>
  For backwards compatibility, `call_completed` and `call_failed` events also include `body` (transcript array) and a top-level `end_reason`. These are deprecated — use `call.transcript` and `call.end_reason` instead.
</Note>


# Scaling
Source: https://docs.cartesia.ai/line/infrastructure/scaling


## Compute Resources

Each call has access to 1GB memory and 0.5 vCPU. Contact support to increase limits.

<Card title="Contact Support" href="https://cartesia.ai/contact" />

## Concurrency

Concurrent call limits by subscription tier:

| Subscription Tier | Concurrency Limit |
| ----------------- | ----------------- |
| Free              | 8                 |
| Pro               | 12                |
| Startup           | 20                |
| Scale             | 60                |

<Tip title="Outbound Concurrency" icon="dialpad">
  **Outbound Concurrency**

  When triggering outbound calls, you are limited to triggering one call per second while the overall concurrency limits still apply.
</Tip>


# Calls API
Source: https://docs.cartesia.ai/line/integrations/calls-api


Stream audio between your application and your voice agent via WebSocket. Use this for web apps, mobile apps, or to bridge your own telephony provider.

## Quick start

```javascript theme={null}
const ws = new WebSocket(
  `wss://api.cartesia.ai/agents/stream/${agentId}`,
  {
    headers: {
      Authorization: `Bearer ${accessToken}`,
      "Cartesia-Version": "2025-04-16",
    },
  }
);

// Initialize the stream
ws.onopen = () => {
  ws.send(JSON.stringify({
    event: "start",
    config: { input_format: "pcm_44100" },
  }));
};

// Handle agent audio
ws.onmessage = (msg) => {
  const data = JSON.parse(msg.data);
  if (data.event === "media_output") {
    playAudio(atob(data.media.payload));
  }
};

// Send user audio
function sendAudio(audioData) {
  ws.send(JSON.stringify({
    event: "media_input",
    stream_id: streamId,
    media: { payload: btoa(audioData) },
  }));
}
```

Get an access token from the `/access-token` [endpoint](/api-reference/auth/access-token#body-grants-agent). See [Authenticating Client Apps](/get-started/authenticate-your-client-applications) for details.

***

## Connection

Connect to the WebSocket endpoint:

```
wss://api.cartesia.ai/agents/stream/{agent_id}
```

**Headers:**

| Header             | Value            |
| ------------------ | ---------------- |
| `Authorization`    | `Bearer {token}` |
| `Cartesia-Version` | `2025-04-16`     |

## Protocol Overview

The WebSocket connection uses JSON messages for control events and base64-encoded audio for media.

The client sends a `start` event, the server responds with `ack`, then both sides exchange audio and control events until the connection closes.

## Client events

### Start Event

Initializes the audio stream configuration.

* `config` overrides your agent's default input audio settings
* `stream_id` is optional. If not provided, the server generates one and returns it in the `ack` event

**This must be the first message sent.**

```json theme={null}
{
  "event": "start",
  "stream_id": "unique_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "introduction": "Hello, I'm an AI assistant",
    "system_prompt": "### Your Role \n You are a helpful assistant"
  },
  "metadata": {
    "to": "user@example.com",
    "from": "+1234567890"
  }
}
```

**Fields:**

* `stream_id` (optional): Stream identifier. If not provided, server generates one
* `config.input_format`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`)
* `config.voice_id` (optional): Override the agent's default TTS voice
* `agent` (optional): Allows configuring individual agent calls via API and previewing changes in introduction or prompt without publishing to production
* `metadata` (optional): Custom metadata object. These will be passed through to the agent code, but there are some special fields you can use as well:
  * `to` (optional): Destination identifier for call routing (defaults to agent ID)
  * `from` (optional): Source identifier for the call (defaults to "websocket")

### Media Input Event

Audio data sent from the client to the server. `payload` audio data should be base64 encoded.

```json theme={null}
{
  "event": "media_input",
  "stream_id": "unique_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
```

**Fields:**

* `stream_id`: Unique identifier for the Stream from the ack response
* `media.payload`: Base64-encoded audio data in the format specified in the start event

### DTMF Event

Sends DTMF (dual-tone multi-frequency) tones.

```json theme={null}
{
  "event": "dtmf",
  "stream_id": "example_id",
  "dtmf": "1"
}
```

**Fields:**

* `stream_id`: Stream identifier
* `dtmf`: DTMF digit (0-9, \*, #)

### Custom Event

Sends custom metadata to the agent.

```json theme={null}
{
  "event": "custom",
  "stream_id": "example_id",
  "metadata": {
    "user_id": "user123",
    "session_info": "custom_data"
  }
}
```

**Fields:**

* `stream_id`: Stream identifier
* `metadata`: Object containing key-value pairs of custom data

## Server events

### Ack Event

Confirms stream configuration. Returns the server-generated `stream_id` if one wasn't provided in the `start` event.

```json theme={null}
{
  "event": "ack",
  "stream_id": "example_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "system_prompt": "### Your Role \n You are a helpful assistant",
    "introduction": "Hello, I'm an AI assistant"
  }
}
```

### Media Output Event

Server sends agent audio response. `payload` is base 64 encoded audio data.

```json theme={null}
{
  "event": "media_output",
  "stream_id": "example_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
```

### Clear Event

Indicates the agent wants to clear/interrupt the current audio stream.

```json theme={null}
{
  "event": "clear",
  "stream_id": "example_id"
}
```

### Transfer Call Event

Indicates the agent wants to transfer the call to a phone number. The client is responsible for initiating the transfer on its telephony side.

```json theme={null}
{
  "event": "transfer_call",
  "stream_id": "example_id",
  "transfer": {
    "target_phone_number": "+1234567890"
  }
}
```

**Fields:**

* `stream_id`: Stream identifier
* `transfer.target_phone_number`: E.164 phone number to transfer the call to

## Connection Management

### Inactivity Timeout

The server closes idle connections after **180 seconds**. Any client message resets the timer:

* Application messages (media\_input, dtmf, custom events)
* Standard WebSocket ping frames
* Any other valid WebSocket message

When the timeout occurs, the connection is closed with:

* **Code:** 1000 (Normal Closure)
* **Reason:** `"connection idle timeout"`

### Ping/Pong Keepalive

To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive:

```python theme={null}
# Client sends ping to reset inactivity timer
pong_waiter = await websocket.ping()
latency = await pong_waiter
```

```javascript theme={null}
// Requires the Node.js `ws` library — the browser WebSocket API does not expose ping()
setInterval(() => {
  if (websocket.readyState === WebSocket.OPEN) {
    websocket.ping();
  }
}, 60000); // Send ping every 60 seconds
```

The server automatically responds to ping frames with pong frames and resets the inactivity timer upon receiving any message.

### Connection Close

The connection can be closed by either the client or server using WebSocket close frames.

**Client-initiated close:**

```python theme={null}
await websocket.close(code=1000, reason="session completed")
```

**Server-initiated close:**
When the agent ends the call, the server closes the connection with:

* **Code:** 1000 (Normal Closure)
* **Reason:** `"call ended by agent"` or `"call ended by agent, reason: {specific_reason}"` if additional context is available

## Best Practices

1. **Send `start` first** — The connection closes if any other event is sent before `start`.
2. **Choose the right audio format** — Match the format to your source: `mulaw_8000` for telephony, `pcm_44100` for web clients.
3. **Handle closes cleanly** — Always capture close codes and reasons for debugging and recovery.
4. **Keep the connection alive** — Send WebSocket ping frames every 60–90 seconds to avoid the 180-second inactivity timeout.
5. **Manage stream IDs** — Provide your own `stream_id` values to improve observability across systems.
6. **Recover from idle timeouts** — On `1000 / connection idle timeout`, reconnect and resend a `start` event.


# Overview
Source: https://docs.cartesia.ai/line/integrations/overview


Your Line agent needs audio input to work. Choose based on your use case.

## Telephony

Use [Cartesia Telephony](/line/integrations/telephony/phone-numbers) for managed phone numbers. Cartesia provisions numbers and handles the telephony infrastructure for inbound and outbound use cases.

You can also use your own telephony stack by connecting to the [Calls API](/line/integrations/calls-api).

<Note>
  Bringing your own phone numbers or CCaaS provider is on the roadmap.
</Note>

## Web and Mobile Apps

Use the [Calls API](/line/integrations/calls-api) to stream audio between your application and the agent via WebSocket.

```javascript theme={null}
const ws = new WebSocket(`wss://api.cartesia.ai/agents/stream/${agentId}`);
```

This option works great for:

* Web applications with browser microphone access
* Mobile apps with native audio capture

## Pricing

| Feature                  | Price per Minute | Notes                                 |
| ------------------------ | ---------------- | ------------------------------------- |
| Agent Calling            | \$0.06           | Base rate for all voice agent calls   |
| Telephony (add-on)       | +\$0.014         | Additional when using managed numbers |
| **Total with Telephony** | **\$0.074**      | Combined cost for phone-based calls   |

View your usage and remaining Voice Agent credits on the [Subscription](https://play.cartesia.ai/subscription) page.


# Outbound
Source: https://docs.cartesia.ai/line/integrations/telephony/outbound-dialing


Agents can make outbound dials with an API request. Simply specify a set of target phone numbers and your agent ID
to place your dial.

<Warning title="Compliance" icon="triangle-exclamation">
  **Compliance**

  You are solely responsible for remaining complaint with relevant local regulations for dialing including the Telephone
  Consumer Protection Act (TCPA).

  See Cartesia's [Acceptable Use Policy](https://cartesia.ai/legal/acceptable-use.html) for more detail.
</Warning>

<CodeGroup>
  ```bash Bash lines theme={null}
  curl -X POST "https://api.cartesia.ai/twilio/call/outbound" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $CARTESIA_API_KEY" \
  -H "Cartesia-Version: 2025-04-16" \
  -d '{
    "target_numbers": ["YOUR_PHONE_NUMBER"],
    "agent_id": "YOUR_AGENT_ID",
    "metadata": {
      "customer_id": "cust_123",
      "custom_prompt": "Be extra friendly"
    }
  }'
  ```

  ```python Python lines theme={null}
  import requests

  url = "https://api.cartesia.ai/twilio/call/outbound"

  headers = {
      "Content-Type": "application/json",
      "Authorization": "Bearer YOUR_CARTESIA_API_KEY",
      "Cartesia-Version": "2025-04-16"
  }

  payload = {
      "target_numbers": ["YOUR_PHONE_NUMBER"],
      "agent_id": "YOUR_AGENT_ID",
      "metadata": {
          "customer_id": "cust_123",
          "custom_prompt": "Be extra friendly"
      }
  }

  response = requests.post(url, headers=headers, json=payload)

  print("Status Code:", response.status_code)
  print("Response:", response.json())
  ```

  ```bash CLI theme={null}
  # Trigger an outbound call from a deployed agent to a specific number
  cartesia call <phone_number> <agent_id>
  ```
</CodeGroup>

The `metadata` field accepts any JSON object up to 1MB. This data is passed to your agent code deployment and can be accessed to customize agent behavior per call.

You can access the metadata in your agent code via the `call_request.metadata` object in your `get_agent` function.

```python theme={null}
async def get_agent(env, call_request):
    if call_request.metadata:
        logger.info(f"Received metadata: {call_request.metadata}")
    # Use metadata to customize agent behavior
    return LlmAgent(...)
```

<Note>You are limited to one outbound dial placed per second, any requests faster than one dial per second will be queued. </Note>


# Phone Numbers
Source: https://docs.cartesia.ai/line/integrations/telephony/phone-numbers


Cartesia Telephony provides managed phone numbers so your agent can receive and make real phone calls without setting up your own telephony infrastructure.

## Provisioning

The platform automatically provisions a phone number for each agent when you promote to production. When an agent is deleted, the assigned phone number is released and cannot be re-assigned to another agent.

<Note>
  Bringing your own phone numbers or CCaaS provider is on the roadmap.
</Note>

## Finding Your Phone Number

When viewing your Line agents from the Playground, you can see the provisioned phone number on the Agents page in the card:

<Frame>
  <img alt="Phone number shown in agent card" />
</Frame>

Or in the header once you navigate to the agent's page:

<Frame>
  <img alt="Phone number shown in agent header" />
</Frame>

You can also retrieve your phone number using the [CLI](/line/cli).

List all agents to see their phone numbers:

```bash theme={null}
cartesia agents ls
```

Or get detailed information for a specific agent:

```bash theme={null}
cartesia status <agent_id>
```

This returns agent information including name, deployments, and phone numbers.


# Introduction
Source: https://docs.cartesia.ai/line/introduction

Build intelligent, low-latency voice agents with Line.

## What is Line?

Line brings voice to your text agents with Cartesia's state-of-the-art speech models. We handle audio orchestration, deployment, and observability so you can focus on your agent's reasoning.

## Get Started

<CardGroup>
  <Card title="Quickstart" icon="rocket" href="./start-building/quickstart">
    Build, deploy, and call your first agent
  </Card>

  <Card title="Agent Builder" icon="sparkles" href="./start-building/agent-builder">
    Prototype and iterate on agents without code
  </Card>

  <Card title="SDK" icon="code" href="./sdk/overview">
    Write your custom reasoning logic in code
  </Card>
</CardGroup>

## Audio Orchestration

Line deploys your code in seconds in our managed runtime with auto-scaling and blazing fast audio processing, using [Ink](https://cartesia.ai/ink) for speech-to-text and [Sonic](https://cartesia.ai/sonic) for text-to-speech.

<Frame>
  <img alt="Line voice agent platform architecture" />
</Frame>

## What You Can Build

Line gives you full control over your agent's behavior through code: connect any LLM, call external APIs, query databases, and handle interruptions and turn-taking.

## Developer Tools

* **[CLI](/line/cli)** – Deploy and test agents from your terminal
* **[Call logs](/line/infrastructure/observability#call-logs)** – Debug conversations and monitor performance
* **[Evaluations](/line/evaluations/metrics)** – Measure agent quality with custom metrics
* **[Deployments](/line/infrastructure/observability#deployment)** – Track versions and roll back changes


# Agents
Source: https://docs.cartesia.ai/line/sdk/agents


Agents process input events and yield output events to control the conversation.

## What is an Agent?

An Agent controls the input/output event loop. The `process` method receives events (user speech, call start, etc.) and yields responses.

An Agent can be:

1. A **class** with a `process` method
2. A **function** with the same signature `(env, event) -> AsyncIterable[OutputEvent]`

```python theme={null}
from line.events import CallStarted, UserTurnEnded, AgentSendText

class HelloAgent:
    async def process(self, env, event):
        if isinstance(event, CallStarted):
            yield AgentSendText(text="Hello!")
        elif isinstance(event, UserTurnEnded):
            yield AgentSendText(text="I heard you!")
```

**How an Agent works:**

* Events arrive (user speaks, call starts, button pressed)
* SDK calls `agent.process(env, event)`
* Agent yields output events (speech, tool calls, handoffs)
* SDK handles audio, LLM calls, and state management

***

## LlmAgent

Use the built-in `LlmAgent` which wraps 100+ LLM providers via LiteLLM:

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",  # Or "gpt-5.2", "gemini/gemini-2.5-flash", etc.
    api_key="your-api-key",
    tools=[...],  # Optional list of tools
    config=LlmConfig(
        system_prompt="You are a helpful assistant...",
        introduction="Hello! How can I help you today?",
    ),
)
```

### Prompting

`system_prompt` to define your agent's personality and `introduction` for the greeting:

```python theme={null}
import os
from line import CallRequest
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import AgentEnv, VoiceAgentApp

SYSTEM_PROMPT = """You are a friendly customer service agent.

Rules:
- Be polite and empathetic
- Confirm understanding before taking action
-  end_call to gracefully end conversations
"""

async def get_agent(env: AgentEnv, call_request: CallRequest):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt=SYSTEM_PROMPT,
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

### Supported Models

| Provider                                                            | Model Examples                                                         |
| ------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| Anthropic                                                           | `anthropic/claude-haiku-4-5-20251001`, `anthropic/claude-sonnet-4-5`   |
| OpenAI                                                              | `gpt-5.4`, `gpt-5.2`                                                   |
| Google                                                              | `gemini/gemini-2.5-flash-preview-09-2025`, `gemini/gemini-3.0-preview` |
| And 100+ more via [LiteLLM](https://docs.litellm.ai/docs/providers) |                                                                        |

### LlmConfig Options

| Option              | Type                  | Description                                                |
| ------------------- | --------------------- | ---------------------------------------------------------- |
| `system_prompt`     | `str`                 | The system prompt defining agent behavior                  |
| `introduction`      | `Optional[str]`       | Message sent on call start. `None` or `""` to wait for r   |
| `temperature`       | `Optional[float]`     | Sampling temperature                                       |
| `max_tokens`        | `Optional[int]`       | Maximum tokens per response                                |
| `top_p`             | `Optional[float]`     | Nucleus sampling threshold                                 |
| `stop`              | `Optional[List[str]]` | Stop sequences                                             |
| `seed`              | `Optional[int]`       | Random seed for reproducibility                            |
| `presence_penalty`  | `Optional[float]`     | Presence penalty for token generation                      |
| `frequency_penalty` | `Optional[float]`     | Frequency penalty for token generation                     |
| `num_retries`       | `int`                 | Number of retries on failure (default: 2)                  |
| `fallbacks`         | `Optional[List[str]]` | Fallback models if primary fails                           |
| `timeout`           | `Optional[float]`     | Request timeout in seconds                                 |
| `reasoning_effort`  | `Optional[str]`       | `none`, `low`, `medium`, or `high`. Dependent on provider. |
| `extra`             | `Dict[str, Any]`      | Provider-specific options passed through to LiteLLM        |

### History Management

`LlmAgent` exposes a `history` attribute for structured control over the conversation history the LLM sees.

**Adding entries:**

```python theme={null}
# Append a user note (role="user" is the default)
agent.history.add_entry("The user prefers formal language.")

# Insert before a specific event
agent.history.add_entry("Context about the caller.", before=some_event)
```

**Replacing history segments:**

```python theme={null}
# Replace the entire history
agent.history.update(new_events)

# Replace everything from `start` onward
agent.history.update(new_events, start=some_event)

# Replace a specific segment
agent.history.update(new_events, start=start_event, end=end_event)
```

### Per-Turn Overrides

`process()` accepts keyword arguments that apply to just that turn without mutating the agent:

```python theme={null}
# Higher temperature for just this turn
await agent.process(env, event, config=LlmConfig(temperature=0.9))

# Swap a specific tool for one turn
await agent.process(env, event, tools=[custom_lookup_tool])

# Inject ephemeral context
await agent.process(env, event, context="The user is a VIP customer.")

# Completely override history for one turn
await agent.process(env, event, history=custom_history_list)
```

Only explicitly set `LlmConfig` fields take effect — unset fields fall through to the agent's stored config.

To change tools permanently (e.g., enabling `end_call` after a certain point), modify `agent.tools` directly instead of using per-turn overrides.

***

## Controlling the Conversational Loop

Use **event filters** to control when your agent’s `process` method runs, and which events can interrupt it.

### Default Behavior

```python theme={null}
# Agent processes these events:
run_filter = [CallStarted, UserTurnEnded, CallEnded]

# These events interrupt the agent:
cancel_filter = [UserTurnStarted]
```

This means: agent greets on call start, responds when user finishes speaking, and can be interrupted.

### Customizing Filters

Return a tuple from `get_agent` to override defaults:

```python theme={null}
from line.events import CallStarted, UserTurnEnded, UserTurnStarted, CallEnded

async def get_agent(env, call_request):
    agent = LlmAgent(...)
    
    # Customize behavior
    run_filter = [CallStarted, UserTurnEnded, CallEnded]
    cancel_filter = [UserTurnStarted]
    
    return (agent, run_filter, cancel_filter)
```

### Common Customizations

**More responsive (process partial transcriptions):**

```python theme={null}
from line.events import CallStarted, UserTurnEnded, UserTextSent, CallEnded

run_filter = [CallStarted, UserTurnEnded, UserTextSent, CallEnded]
cancel_filter = [UserTurnStarted]
```

This makes your agent start processing before the user finishes speaking, creating a more responsive experience.

**Uninterruptible turns:**

If you want a single message to complete without being interrupted by the user, mark the output as `interruptible=False` when sending it with `AgentSendText`.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(
    text="Before we continue, I need to share a quick disclaimer.",
    interruptible=False,
)
```

**Custom logic with functions:**

```python theme={null}
def business_hours_only(event):
    hour = datetime.now().hour
    if isinstance(event, (CallStarted, CallEnded)):
        return True
    return isinstance(event, UserTurnEnded) and 9 <= hour < 17

return (agent, business_hours_only, [UserTurnStarted])
```

<Tip>
  For advanced patterns like guardrails, routing, and agent wrappers, see [Advanced Patterns](./patterns#agent-wrappers).
</Tip>

***

## Handling Incoming Calls

When a call arrives, you can inspect caller information and configure how your agent responds before it starts.

1. A call arrives from a web client or telephony provider
2. Your `pre_call_handler` receives a `CallRequest` with caller details
3. You return configuration (voice, language) or reject the call
4. Your `get_agent` function creates an agent using the enriched request

### Parsing the CallRequest

Contains information about the incoming call:

| Field           | Type             | Description                                     |
| --------------- | ---------------- | ----------------------------------------------- |
| `call_id`       | `str`            | Unique identifier for the call                  |
| `from_`         | `str`            | Caller identifier (phone number or client ID)   |
| `to`            | `str`            | Called number or agent ID                       |
| `agent_call_id` | `str`            | Agent call ID for logging/correlation           |
| `metadata`      | `Optional[dict]` | Custom data passed from your client application |
| `agent`         | `AgentConfig`    | Prompts configured in Playground or via API     |

The `agent` field contains an `AgentConfig` with:

| Field           | Type            | Description                                                        |
| --------------- | --------------- | ------------------------------------------------------------------ |
| `system_prompt` | `Optional[str]` | System prompt configured in Playground or via the Calls API        |
| `introduction`  | `Optional[str]` | Introduction message configured in Playground or via the Calls API |

### Returning a PreCallResult

Use `pre_call_handler` to set voice, language, or reject calls before your agent starts:

```python theme={null}
from line.voice_agent_app import CallRequest, PreCallResult, VoiceAgentApp

async def pre_call_handler(call_request: CallRequest):
    return PreCallResult(
        metadata={"tier": "premium"},  # Merged into call_request.metadata
        config={
            "tts": {
                "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
                "model": "sonic-3",
                "language": "en",
            }
        }
    )

app = VoiceAgentApp(get_agent=get_agent, pre_call_handler=pre_call_handler)
```

Your client application can pass metadata (user ID, language preference, account tier) in the call request. Your `pre_call_handler` reads this and configures TTS/STT accordingly.

#### Configuration Options

**TTS Options:**

| Option                  | Type   | Description                                                                              |
| ----------------------- | ------ | ---------------------------------------------------------------------------------------- |
| `voice_id`              | string | Voice identifier (UUID)                                                                  |
| `model`                 | string | TTS model (`sonic-3`, `sonic-turbo`)                                                     |
| `language`              | string | Language code (`en`, `es`, `hi`, etc.)                                                   |
| `pronunciation_dict_id` | string | [Custom pronunciation dictionary](/build-with-cartesia/sonic-3/custom-pronunciations) ID |

**STT Options:**

| Option     | Type   | Description                          |
| ---------- | ------ | ------------------------------------ |
| `language` | string | Language code for speech recognition |

#### Rejecting Calls

Return `None` to reject a call with a 403 status:

```python theme={null}
async def pre_call_handler(call_request: CallRequest):
    if is_blocked(call_request.from_):
        return None  # Rejects with 403
    return PreCallResult()
```

#### Custom Pronunciations

Use a [pronunciation dictionary](/build-with-cartesia/sonic-3/custom-pronunciations) to control how specific words are spoken:

```python theme={null}
async def pre_call_handler(call_request: CallRequest):
    return PreCallResult(
        config={
            "tts": {
                "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
                "model": "sonic-3",
                "pronunciation_dict_id": "your-dict-id",
            }
        }
    )
```

### Accessing call metadata in your Agent logic

The `CallRequest` is available in `get_agent`:

```python theme={null}
async def get_agent(env, call_request):
    # Log call information
    logger.info(f"Call {call_request.call_id} from {call_request.from_}")

    # Access metadata passed from your application (or added in pre_call_handler)
    customer_id = call_request.metadata.get("customer_id") if call_request.metadata else None
    customer_name = call_request.metadata.get("customer_name") if call_request.metadata else None

    # Build a personalized system prompt using metadata
    base_prompt = call_request.agent.system_prompt or "You are a helpful customer service agent."

    if customer_id:
        base_prompt += f"\n\nCurrent customer ID: {customer_id}"
    if customer_name:
        base_prompt += f"\nCustomer name: {customer_name}"

    return LlmAgent(
        model="gpt-5-nano",
        api_key=os.getenv("OPENAI_API_KEY"),
        config=LlmConfig(
            system_prompt=base_prompt,
            introduction=call_request.agent.introduction,
        ),
    )
```

`LlmConfig.from_call_request()` handles the priority chain automatically:

1. `CallRequest.agent.system_prompt` value (if set)
2. Your fallback value (if provided)
3. SDK default

```python theme={null}
async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig.from_call_request(
            call_request,
            fallback_system_prompt="You are a sales assistant.",
            fallback_introduction="Hi! How can I help with your purchase?",
            temperature=0.7,  # Additional LlmConfig options
        ),
    )
```

Using `CallRequest` lets you iterate on system prompts from the Playground instantly, while code handles the technical configuration and fallback defaults.

### Letting The User Speak First

Set `introduction` to an empty string to wait for the user to speak first:

```python theme={null}
config=LlmConfig.from_call_request(
    call_request,
    fallback_system_prompt=SYSTEM_PROMPT,
    fallback_introduction="",
)
```

***

## Custom Agent Function

For advanced use cases, you can build agents from scratch as functions:

```python theme={null}
from line.events import UserTurnEnded, AgentSendText, CallStarted

async def my_agent(env, event):
    if isinstance(event, CallStarted):
        yield AgentSendText(text="Hello! How can I help?")
    elif isinstance(event, UserTurnEnded):
        user_text = event.content[0].content if event.content else ""
        yield AgentSendText(text=f"You said: {user_text}")
```

## Custom Agent Class

Or as classes with state:

```python theme={null}
class GreetingAgent:
    def __init__(self, greeting: str):
        self.greeting = greeting
        self.greeted = False

    async def process(self, env, event):
        if isinstance(event, CallStarted) and not self.greeted:
            yield AgentSendText(text=self.greeting)
            self.greeted = True
```

<Tip>
  Most developers can use `LlmAgent` with tools rather than building custom agents from scratch! Custom agents are powerful when you need full control over the event processing logic without LLM reasoning.
</Tip>


# Events
Source: https://docs.cartesia.ai/line/sdk/events


Events are typed Python objects for communication between your agent and the Cartesia platform. Your agent receives **input events** from the harness and yields **output events** to control the conversation.

<Tip>
  To learn which events trigger your agent and how to customize this behavior (e.g., responding to DTMF, preventing interruptions), see [Controlling the Conversational Loop](/line/sdk/agents#controlling-the-conversational-loop).
</Tip>

## Input Events

Input events are received by your agent from the Cartesia harness. All input events include an optional `history` field containing the complete conversation history. When `history` is `None`, the event is being used within a history list; when `history` contains a list, the event has the full conversation context attached.

### Call Lifecycle

| Event         | Description            |
| ------------- | ---------------------- |
| `CallStarted` | The call has connected |
| `CallEnded`   | The call has ended     |

```python theme={null}
from line.events import CallStarted, CallEnded

async def process(self, env, event):
    if isinstance(event, CallStarted):
        yield AgentSendText(text="Hello! How can I help?")
    elif isinstance(event, CallEnded):
        # Perform cleanup
        pass
```

### User Turn Events

| Event             | Description                                                     |
| ----------------- | --------------------------------------------------------------- |
| `UserTurnStarted` | The user started speaking (triggers interruption by default)    |
| `UserTurnEnded`   | The user finished speaking (triggers new agent turn by default) |
| `UserTextSent`    | User text content (within `UserTurnEnded.content`)              |
| `UserDtmfSent`    | User pressed a DTMF button                                      |

```python theme={null}
from line.events import UserTurnEnded, UserTextSent

if isinstance(event, UserTurnEnded):
    for content in event.content:
        if isinstance(content, UserTextSent):
            user_message = content.content
```

### Agent Turn Events (in history)

| Event              | Description                |
| ------------------ | -------------------------- |
| `AgentTurnStarted` | Agent started its turn     |
| `AgentTurnEnded`   | Agent finished its turn    |
| `AgentTextSent`    | Agent text that was spoken |
| `AgentDtmfSent`    | DTMF tone sent by agent    |

### Handoff Event

| Event            | Description                           |
| ---------------- | ------------------------------------- |
| `AgentHandedOff` | Control transferred to a handoff tool |

### Custom Event

| Event            | Description                                                                                                        |
| ---------------- | ------------------------------------------------------------------------------------------------------------------ |
| `UserCustomSent` | Custom metadata sent from the client via the WebSocket [`custom` event](/line/integrations/calls-api#custom-event) |

Received when your client application sends a `custom` WebSocket event to the call stream. The event carries a `metadata` dict with whatever key-value pairs the client included:

```python theme={null}
from line.events import UserCustomSent

async def process(self, env, event):
    if isinstance(event, UserCustomSent):
        action = event.metadata.get("action")
        # React to client-side triggers (e.g., button clicks, form submissions)
```

***

## Output Events

Output events are yielded by your agent to control the conversation.

### Speech

You can choose to send messages with `AgentSendText`.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(text="Hello! How can I help you today?")
```

By default, users can interrupt the agent. However, if you have a disclaimer or another important message that you wish be uninterruptible, you can set the `interruptible` flag as false.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(
    text="Before we continue, I need to share a quick disclaimer.",
    interruptible=False,
)
```

### Call Control

```python theme={null}
from line.events import AgentEndCall, AgentTransferCall, AgentSendDtmf

# End the call
yield AgentEndCall()

# Transfer to another number
yield AgentTransferCall(target_phone_number="+14155551234")

# Send DTMF tone
yield AgentSendDtmf(button="1")
```

### Dynamic Configuration

Update call settings (voice, pronunciation, language) mid-conversation using `AgentUpdateCall`:

```python theme={null}
from line.events import AgentUpdateCall

# Change voice
yield AgentUpdateCall(voice_id="5ee9feff-1265-424a-9d7f-8e4d431a12c7")

# Change pronunciation dictionary
yield AgentUpdateCall(pronunciation_dict_id="dict-123")

# Change language
yield AgentUpdateCall(language="es")

# Update multiple settings at once
yield AgentUpdateCall(
    voice_id="spanish-voice-id",
    pronunciation_dict_id="spanish-dict-id",
    language="es"
)
```

**AgentUpdateCall Parameters:**

| Field                   | Type                     | Description                                                                       |
| ----------------------- | ------------------------ | --------------------------------------------------------------------------------- |
| `type`                  | `Literal["update_call"]` | Event type identifier (automatically set)                                         |
| `voice_id`              | `Optional[str]`          | Updates the agent's voice                                                         |
| `pronunciation_dict_id` | `Optional[str]`          | Updates the pronunciation dictionary                                              |
| `language`              | `Optional[str]`          | Updates the language used on speech-to-text (STT) and text-to-speech (TTS) models |

All fields are optional—only set fields are updated.

### Tool Events

These are emitted by `LlmAgent` to track tool execution:

```python theme={null}
from line.events import AgentToolCalled, AgentToolReturned

# Emitted when LLM calls a tool
yield AgentToolCalled(
    tool_call_id="call_123",
    tool_name="get_weather",
    tool_args={"city": "San Francisco"}
)

# Emitted when tool returns
yield AgentToolReturned(
    tool_call_id="call_123",
    tool_name="get_weather",
    tool_args={"city": "San Francisco"},
    result="72°F and sunny"
)
```

### Logging

```python theme={null}
from line.events import LogMetric, LogMessage

# Log a metric
yield LogMetric(name="response_time_ms", value=150)

# Log a message
yield LogMessage(
    name="order_lookup",
    level="info",
    message="Found order #12345",
    metadata={"order_id": "12345"}
)
```

### Custom Events

Send arbitrary metadata from your agent to the harness:

```python theme={null}
from line.events import AgentSendCustom

yield AgentSendCustom(metadata={"action": "show_form", "form_id": "checkout"})
```

Pair with [`UserCustomSent`](#custom-event) for bidirectional metadata exchange.

### Voice & Language Control

Change voice or speech recognition language mid-call:

```python theme={null}
from line.events import AgentUpdateCall

# Switch to Spanish voice and speech recognition
yield AgentUpdateCall(voice_id="spanish-voice-id", language="es")

# Enable multilingual auto-detect mode
yield AgentUpdateCall(language="multilingual")
```

The `language` field sets the ASR (speech recognition) language. Pass any language code supported by [Ink STT](/build-with-cartesia/stt-models), or `"multilingual"` for automatic language detection.

***

## Event History

All input events include an optional `history` field containing the conversation history. When `history` is `None`, the event is inside a history list; when it contains a list, full conversation context is attached. `LlmAgent` handles this automatically—you only need to understand history if building custom agents.

### Accessing History

```python theme={null}
from line.events import UserTextSent, AgentTextSent

async def process(self, env, event):
    for past_event in event.history:
        if isinstance(past_event, UserTextSent):
            print(f"User said: {past_event.content}")
        elif isinstance(past_event, AgentTextSent):
            print(f"Agent said: {past_event.content}")
```

<Accordion title="Event types in history">
  Events in the history list have `history=None` to avoid redundant nesting. The event types are the same as regular input events:

  | Event Type         | Description               |
  | ------------------ | ------------------------- |
  | `CallStarted`      | Call began                |
  | `UserTurnStarted`  | User started speaking     |
  | `UserTextSent`     | User's transcribed speech |
  | `UserDtmfSent`     | User's DTMF button press  |
  | `UserTurnEnded`    | User finished speaking    |
  | `AgentTurnStarted` | Agent started responding  |
  | `AgentTextSent`    | Agent's spoken text       |
  | `AgentDtmfSent`    | Agent's DTMF tone         |
  | `AgentTurnEnded`   | Agent finished responding |
  | `CallEnded`        | Call ended                |
</Accordion>

<Accordion title="How LlmAgent processes history">
  `LlmAgent` automatically converts the event history to LLM messages:

  * **User messages**: From `UserTextSent` events
  * **Assistant messages**: From `AgentTextSent` events
  * **Tool calls**: From `AgentToolCalled` and `AgentToolReturned` events

  This means the LLM sees full context including previous tool calls and results, enabling it to reference that information without making redundant API calls.
</Accordion>

<Accordion title="Custom agents: Using history">
  If building a custom agent (not using `LlmAgent`), you can use history for context, summarization, or pattern detection:

  ```python theme={null}
  class CustomAgent:
      async def process(self, env, event):
          user_turns = sum(
              1 for e in event.history
              if isinstance(e, UserTurnEnded)
          )

          if user_turns > 5:
              yield AgentSendText(text="We've been chatting for a while! Is there anything else I can help with?")
  ```
</Accordion>


# SDK Overview
Source: https://docs.cartesia.ai/line/sdk/overview


The [Line SDK](https://github.com/cartesia-ai/line/) is a Python framework for building voice agents. Handles audio infrastructure, speech recognition, and conversation flow.

```bash theme={null}
uv add cartesia-line
```

<Note>
  New to Line? Start with the [Quickstart](/line/start-building/quickstart) to build and deploy your first agent.
</Note>

## Core Concepts

| Component                                           | Purpose                                                                 |
| --------------------------------------------------- | ----------------------------------------------------------------------- |
| [`Agent`](./agents)                                 | Controls the input/output event loop via a `process` method             |
| [`LlmAgent`](./agents#llmagent)                     | Built-in agent that wraps 100+ LLM providers via LiteLLM                |
| [`Tools`](./tools)                                  | Functions your agent can call—database lookups, handoffs, web search    |
| [`VoiceAgentApp`](./agents#handling-incoming-calls) | HTTP server that connects your agent to Cartesia's audio infrastructure |

```python theme={null}
import os
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import VoiceAgentApp

async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)
```

The agent speaks the `introduction` when a call starts, then responds to whatever the user says using the LLM.

## Features

* **Real-time interruption support** — Handles audio interruptions and turn-taking out-of-the-box.
* **Tool calling** — Connect to databases, APIs, and external services
* **Multi-agent handoffs** — Route conversations between specialized agents
* **Web search** — Built-in tool for real-time information lookup

## Add Capabilities

### Look up information

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool
async def get_order_status(ctx, order_id: Annotated[str, "The order ID"]):
    """Look up an order's current status."""
    order = await db.get_order(order_id)
    return f"Order {order_id} is {order.status}"
```

### Handoff to another agent

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, agent_as_handoff, end_call

spanish_agent = LlmAgent(
    model="gpt-5-nano",
    api_key=os.getenv("OPENAI_API_KEY"),
    tools=[end_call],
    config=LlmConfig(
        system_prompt="You speak only in Spanish.",
        introduction="¡Hola! ¿Cómo puedo ayudarte?",
    ),
)

main_agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[
        end_call,
        agent_as_handoff(
            spanish_agent,
            name="transfer_to_spanish",
            description="Transfer when user requests Spanish.",
        ),
    ],
    config=LlmConfig(...),
)
```

### Search the web

```python theme={null}
from line.llm_agent import end_call, web_search

agent = LlmAgent(
    tools=[end_call, web_search],  # Add built-in web search
    ...
)
```

See [Tools](./tools) for the full guide.

## Code Examples

| Example                                                                                   | Description                                        |
| ----------------------------------------------------------------------------------------- | -------------------------------------------------- |
| [Basic Chat](https://github.com/cartesia-ai/line/tree/main/examples/basic_chat)           | Simple conversational agent                        |
| [Chat Supervisor](https://github.com/cartesia-ai/line/tree/main/examples/chat_supervisor) | Fast chat model with powerful reasoning escalation |
| [Form Filler](https://github.com/cartesia-ai/line/tree/main/examples/form_filler)         | Collect structured data via conversation           |
| [Multi-Agent](https://github.com/cartesia-ai/line/tree/main/examples/transfer_agent)      | Hand off between specialized agents                |

### Integrations

| Integration                                                                                   | Description              |
| --------------------------------------------------------------------------------------------- | ------------------------ |
| [Exa Web Research](https://github.com/cartesia-ai/line/tree/main/example_integrations/exa)    | Real-time web search     |
| [Browserbase](https://github.com/cartesia-ai/line/tree/main/example_integrations/browserbase) | Fill web forms via voice |

## Next Steps

<CardGroup>
  <Card title="Agents" icon="robot" href="./agents">
    Configure prompts, LLMs, and conversation flow
  </Card>

  <Card title="Tools" icon="wrench" href="./tools">
    Add custom tools and multi-agent handoffs
  </Card>
</CardGroup>


# Advanced Patterns
Source: https://docs.cartesia.ai/line/sdk/patterns


Patterns for production voice agents: observability, tool design, multi-agent systems, and guardrails.

## Complete Example: Multi-Agent Customer Service

This example combines prompting, all three tool types, and multi-agent handoffs:

```python theme={null}
import os
from typing import Annotated
from line import CallRequest
from line.llm_agent import (
    LlmAgent, LlmConfig, loopback_tool, passthrough_tool,
    agent_as_handoff, end_call
)
from line.events import AgentSendText, AgentTransferCall
from line.voice_agent_app import AgentEnv, VoiceAgentApp

# Loopback tool: Fetch order info for LLM to contextualize
@loopback_tool
async def get_order_status(ctx, order_id: Annotated[str, "The order ID"]):
    """Look up order status by ID."""
    order = await db.get_order(order_id)
    return f"Order {order_id}: {order.status}, delivers {order.delivery_date}"

# Passthrough tool: Deterministic transfer action
@passthrough_tool
async def transfer_to_human(ctx):
    """Transfer to a human agent."""
    yield AgentSendText(text="Let me connect you with a team member who can help further.")
    yield AgentTransferCall(target_phone_number="+18005551234")

SYSTEM_PROMPT = """You are a friendly customer service agent for Acme Corp.

You can:
- Look up order status using get_order_status
- Transfer to a human agent using transfer_to_human
- Transfer to Spanish support using transfer_to_spanish
- End calls politely using end_call

Rules:
- Always confirm the order ID before looking it up
- Offer to transfer to a human if you can't resolve the issue
- Transfer to Spanish support if the user speaks Spanish or requests it
- Be empathetic and professional
"""

async def get_agent(env: AgentEnv, call_request: CallRequest):
    # Spanish-speaking specialist agent
    spanish_agent = LlmAgent(
        model="gpt-5-nano",
        api_key=os.getenv("OPENAI_API_KEY"),
        tools=[get_order_status, transfer_to_human, end_call],
        config=LlmConfig(
            system_prompt="Eres un agente de servicio al cliente amigable para Acme Corp. Habla solo en español.",
            introduction="¡Hola! Gracias por llamar a Acme Corp. ¿Cómo puedo ayudarte hoy?",
        ),
    )

    # Main English-speaking agent with handoff capability
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[
            get_order_status,
            transfer_to_human,
            agent_as_handoff(
                spanish_agent,
                handoff_message="Transferring you to our Spanish-speaking team...",
                name="transfer_to_spanish",
                description="Transfer to Spanish support when user speaks Spanish or requests it.",
            ),
            end_call,
        ],
        config=LlmConfig(
            system_prompt=SYSTEM_PROMPT,
            introduction="Hi! Thanks for calling Acme Corp. How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

***

## Observability

### Log Metrics

Track performance and business metrics:

```python theme={null}
from line.events import LogMetric, LogMessage

@loopback_tool
async def process_order(ctx, order_id: Annotated[str, "Order ID"]):
    """Process a customer order."""
    import time
    start = time.time()

    result = await api.process_order(order_id)

    # Log timing metric
    yield LogMetric(name="order_processing_ms", value=(time.time() - start) * 1000)

    # Log business event
    yield LogMessage(
        name="order_processed",
        level="info",
        message=f"Processed order {order_id}",
        metadata={"status": result.status}
    )

    return f"Order {order_id} processed: {result.status}"
```

### Built-in LLM Agent Metrics

`LlmAgent` automatically emits three timing metrics on every turn — no code needed:

| Metric               | Description                                                                            |
| -------------------- | -------------------------------------------------------------------------------------- |
| `llm_first_chunk_ms` | Time from start of response generation to first chunk (text or tool call) from the LLM |
| `llm_first_text_ms`  | Time from start of response generation to first text chunk                             |
| `agent_turn_ms`      | Total agent processing time for the turn                                               |

***

## Tool Patterns

### Validation in Tools

Validate inputs before processing:

```python theme={null}
@loopback_tool
async def book_appointment(
    ctx,
    date: Annotated[str, "Date in YYYY-MM-DD format"],
    time: Annotated[str, "Time in HH:MM format"]
):
    """Book an appointment."""
    from datetime import datetime

    try:
        dt = datetime.strptime(f"{date} {time}", "%Y-%m-%d %H:%M")
    except ValueError:
        return "Invalid date or time format. Please use YYYY-MM-DD and HH:MM."

    if dt < datetime.now():
        return "Cannot book appointments in the past."

    # Proceed with booking
    return f"Appointment booked for {dt.strftime('%B %d at %I:%M %p')}"
```

### Async Operations in Tools

Handle long-running operations with proper timeout handling:

```python theme={null}
import asyncio

@loopback_tool
async def search_inventory(ctx, query: Annotated[str, "Search query"]):
    """Search inventory with timeout protection."""
    try:
        result = await asyncio.wait_for(
            inventory_api.search(query),
            timeout=5.0
        )
        return f"Found {len(result.items)} items matching '{query}'"
    except asyncio.TimeoutError:
        return "Search is taking longer than expected. Please try a more specific query."
```

### Error Handling

Handle errors gracefully in tools:

```python theme={null}
@loopback_tool
async def get_account_info(ctx, account_id: Annotated[str, "Account ID"]):
    """Look up account information."""
    try:
        account = await api.get_account(account_id)
        return f"Account {account_id}: Balance ${account.balance:.2f}"
    except AccountNotFoundError:
        return f"Account {account_id} not found."
    except Exception as e:
        logger.error(f"Error fetching account: {e}")
        return "Sorry, I couldn't retrieve that account information right now."
```

***

## Agent Wrappers

Agent wrappers add cross-cutting behavior (logging, validation, routing) without modifying the underlying agent.

### Guardrails: Safety and Content Filtering

Wrappers are ideal for implementing guardrails that filter unsafe content in both directions:

```python theme={null}
class GuardrailsAgent:
    def __init__(self, inner_agent, safety_api):
        self.inner = inner_agent
        self.safety_api = safety_api

    async def process(self, env, event):
        # Pre-processing: Check user input for unsafe content
        if isinstance(event, UserTurnEnded):
            user_text = event.content[0].content if event.content else ""

            if await self.safety_api.is_unsafe(user_text):
                yield AgentSendText(text="I'm here to help with appropriate requests. Let's keep our conversation respectful.")
                return

        # Post-processing: Check agent output for safety issues
        async for output in self.inner.process(env, event):
            if isinstance(output, AgentSendText):
                if await self.safety_api.is_unsafe(output.text):
                    yield LogMessage(
                        name="safety_violation",
                        level="warning",
                        message=f"Blocked unsafe output: {output.text[:100]}..."
                    )
                    yield AgentSendText(text="I apologize, but I can't provide that information.")
                    continue

            yield output
```

Common guardrail patterns:

* Content safety filtering (toxicity, hate speech, PII)
* Rate limiting and abuse prevention
* Compliance checks (HIPAA, financial regulations)
* Brand safety (off-brand responses)

### Routing Between Multiple Agents

Dynamically switch between specialized agents based on conversation context:

```python theme={null}
class RouterAgent:
    def __init__(self, default_agent, specialists: dict):
        self.default = default_agent
        self.specialists = specialists
        self.current = default_agent

    async def process(self, env, event):
        # Switch agent based on user input
        if isinstance(event, UserTurnEnded):
            user_text = event.content[0].content if event.content else ""

            if "billing" in user_text.lower():
                self.current = self.specialists.get("billing", self.default)
            elif "technical" in user_text.lower():
                self.current = self.specialists.get("technical", self.default)

        async for output in self.current.process(env, event):
            yield output
```

Use with `LlmAgent`:

```python theme={null}
async def get_agent(env, call_request):
    return RouterAgent(
        default_agent=LlmAgent(
            model="gpt-5-nano",
            api_key=os.getenv("OPENAI_API_KEY"),
            config=LlmConfig(system_prompt="You are a helpful assistant..."),
        ),
        specialists={
            "billing": LlmAgent(
                model="gpt-5-nano",
                api_key=os.getenv("OPENAI_API_KEY"),
                config=LlmConfig(system_prompt="You are a billing specialist..."),
            ),
            "technical": LlmAgent(
                model="anthropic/claude-haiku-4-5-20251001",
                api_key=os.getenv("ANTHROPIC_API_KEY"),
                config=LlmConfig(system_prompt="You are a technical support specialist..."),
            ),
        }
    )
```

### Best Practices

Keep wrappers focused on a single responsibility. Use `async for` and `yield` to preserve streaming. Stack simple wrappers rather than building one complex one.

```python theme={null}
# Composable wrappers
agent = LoggingWrapper(
    ValidationWrapper(
        LlmAgent(...)
    )
)
```

***

## Example Implementations

Full working examples demonstrating these patterns:

| Example                                                                                       | Pattern             | Description                                            |
| --------------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------ |
| [Form Filler](https://github.com/cartesia-ai/line/tree/main/examples/form_filler)             | Stateful tools      | Walk users through a YAML-defined form with validation |
| [Multi-Agent Transfer](https://github.com/cartesia-ai/line/tree/main/examples/transfer_agent) | agent\_as\_handoff  | English/Spanish agent handoff                          |
| [Chat Supervisor](https://github.com/cartesia-ai/line/tree/main/examples/chat_supervisor)     | Background research | Separate agents for talking and longer-thinking        |


# Tools
Source: https://docs.cartesia.ai/line/sdk/tools


Tools let your agent perform actions and retrieve information. The SDK supports three tool paradigms that differ in how they affect conversation flow.

## Defining Tools

Any properly annotated function can be a tool. The SDK uses the function's docstring as the description and type annotations for parameters:

```python theme={null}
from typing import Annotated

async def get_weather(
    ctx,
    city: Annotated[str, "The city to check weather for"],
    units: Annotated[str, "celsius or fahrenheit"] = "fahrenheit"
):
    """
    Look up the current weather in a given city.
    """
    return f"72°F and sunny in {city}"
```

<Note>
  The first parameter of every tool must be `ctx` (the tool context). This provides access to conversation state and is required for forward compatibility. Your tool parameters follow after `ctx`.
</Note>

***

## Tool Types

<Note>
  Plain functions passed to `tools` are automatically wrapped as loopback tools. Use decorators (`@loopback_tool`, `@passthrough_tool`, `@handoff_tool`) for explicit control.
</Note>

### Loopback Tools (`@loopback_tool`)

The default behavior. The tool's result is sent back to the LLM, which can then continue generating a response.

```python theme={null}
from line.llm_agent import loopback_tool

@loopback_tool
async def get_account_balance(ctx, account_id: Annotated[str, "The account ID"]):
    """Look up the balance for a customer account."""
    balance = await api.get_balance(account_id)
    return f"${balance:.2f}"
```

**Use for:** Information retrieval, calculations, API queries.

### Passthrough Tools (`@passthrough_tool`)

Output events go directly to the user, bypassing the LLM. Use for deterministic actions.

```python theme={null}
from line.llm_agent import passthrough_tool
from line.events import AgentSendText, AgentEndCall

@passthrough_tool
async def end_call_with_message(ctx, message: Annotated[str, "Goodbye message"]):
    """End the call with a custom goodbye message."""
    yield AgentSendText(text=message)
    yield AgentEndCall()
```

**Use for:** Call control (`EndCall`, `TransferCall`, `SendDtmf`), deterministic responses.

### Handoff Tools (`@handoff_tool`)

Transfers control to another handler. All future events are routed to the handoff target instead of the original agent.

```python theme={null}
from typing import Annotated
from line.llm_agent import handoff_tool
from line.events import AgentHandedOff, AgentSendText, UserTurnEnded, AgentEndCall

@handoff_tool
async def run_satisfaction_survey(
    ctx,
    customer_name: Annotated[str, "The customer's name"],
    event
):
    """Hand off to a customer satisfaction survey at the end of the call."""
    if isinstance(event, AgentHandedOff):
        # First call - send introduction
        yield AgentSendText(
            text=f"Thank you for your call, {customer_name}. "
            "Please stay on the line for a brief satisfaction survey. "
            "On a scale of 1 to 5, how would you rate your experience today?"
        )
        return

    # Subsequent calls - handle survey responses
    if isinstance(event, UserTurnEnded):
        user_response = event.content[0].content if event.content else ""
        yield AgentSendText(text=f"You rated us {user_response}. Thank you for your feedback!")
        yield AgentEndCall()
```

**Use for:** Custom multi-step flows, specialized handlers with their own logic.

When using a handoff tool, the `event` parameter receives different values depending on timing:

* **First call**: `event` is `AgentHandedOff` — use this to send a transition message
* **Subsequent calls**: `event` is the actual `InputEvent` (`UserTurnEnded`, etc.)

Once a handoff occurs, the original agent no longer receives events. The handoff tool function handles all future conversation turns.

<Tip>
  To hand off to another `LlmAgent`, use the [`agent_as_handoff`](#agent_as_handoff) helper instead of writing a raw `@handoff_tool`. It handles the delegation automatically.
</Tip>

***

## Built-in Tools

```python theme={null}
from line.llm_agent import end_call, send_dtmf, transfer_call, web_search

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[end_call, send_dtmf, transfer_call, web_search],
    config=LlmConfig(...),
)
```

| Tool            | Description                                | When to Use                                                   |
| --------------- | ------------------------------------------ | ------------------------------------------------------------- |
| `end_call`      | Ends the call                              | User says "goodbye" or the agent's objective has been met     |
| `transfer_call` | Transfers to another number (E.164 format) | Escalating to human agents, routing to departments            |
| `web_search`    | Searches the web for real-time info        | Current events, live prices, recent news the LLM doesn't know |

**Examples:**

```python theme={null}
# End call: Let the LLM decide when conversation is complete
tools=[end_call]  # LLM calls this when user says "thanks, bye!"

# Transfer: Route to human support
tools=[transfer_call]  # LLM calls transfer_call(target_phone_number="+18005551234")

# Web search with custom context size
tools=[web_search(search_context_size="high")]  # More context for complex queries
```

### `end_call`

Ends the current call and disconnects. The actual hangup occurs after the agent's final speech completes, so the user hears the full goodbye message before disconnection.

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, end_call

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[end_call],
    config=LlmConfig(...),
)
```

By default, `end_call` uses a conservative policy that only ends the call when:

* The user's objective is fully complete
* The user explicitly says goodbye
* The agent has said a natural goodbye

#### Custom Description

We recommend providing a custom description tailored to your use case. The description **fully replaces** the default—it is not appended—so include complete instructions with explicit do/don't guidance.

```python theme={null}
from line.llm_agent import end_call

# Restaurant reservation agent
tools=[end_call(description="""Ends the call and disconnects.

Call when ALL of the following are true:
- The reservation is confirmed with date, time, party size, and name.
- You have repeated the reservation details back to the guest.
- The guest confirms the details are correct or says goodbye.

Do not call when:
- The guest asks to modify the reservation.
- Details are missing or unconfirmed.
- The guest says 'okay' or 'thanks' without an explicit goodbye.

If unsure, ask: 'Is there anything else I can help you with for your reservation?'
""")]

# Order confirmation agent
tools=[end_call(description="""Ends the call and disconnects.

Call when ALL of the following are true:
- The order is placed and confirmed.
- You have provided the order number and estimated delivery time.
- The customer acknowledges with a goodbye phrase.

Do not call when:
- The customer has questions about their order.
- Payment has not been confirmed.
- The customer says 'got it' without saying goodbye.
""")]
```

| Parameter     | Type            | Description                                                                                                 |
| ------------- | --------------- | ----------------------------------------------------------------------------------------------------------- |
| `description` | `Optional[str]` | Fully replaces the default description. Include complete instructions for when the LLM should end the call. |

### `agent_as_handoff`

Creates a handoff tool from another `Agent`—the easiest way to implement multi-agent workflows.

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, agent_as_handoff, end_call, UpdateCallConfig

spanish_agent = LlmAgent(
    model="gpt-5-nano",
    api_key=os.getenv("OPENAI_API_KEY"),
    tools=[end_call],
    config=LlmConfig(
        system_prompt="You speak only in Spanish.",
        introduction="¡Hola! ¿Cómo puedo ayudarte?",
    ),
)

main_agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[
        end_call,
        agent_as_handoff(
            spanish_agent,
            handoff_message="Transferring to Spanish support...",
            update_call=UpdateCallConfig(
                voice_id="spanish-voice-id",
                pronunciation_dict_id="spanish-pronunciation-dict-id"
            ),
            name="transfer_to_spanish",
            description="Use when user requests Spanish.",
        ),
    ],
    config=LlmConfig(...),
)
```

| Parameter         | Type                         | Description                                                                             |
| ----------------- | ---------------------------- | --------------------------------------------------------------------------------------- |
| `agent`           | `Agent`                      | The agent to hand off to                                                                |
| `handoff_message` | `Optional[str]`              | Message spoken before the handoff                                                       |
| `update_call`     | `Optional[UpdateCallConfig]` | Optional config to update call settings (voice, pronunciation, language) before handoff |
| `name`            | `Optional[str]`              | Tool name for the LLM                                                                   |
| `description`     | `Optional[str]`              | When the LLM should use this tool                                                       |

When called, `agent_as_handoff` automatically sends the handoff message, updates the call settings if specified, triggers the new agent's introduction, and routes all future events to it.

<Tip>
  See [Advanced Patterns](/line/sdk/patterns) for a complete multi-agent example with loopback, passthrough, and handoff tools.
</Tip>

***

## Long-Running Tools

By default, tool calls are terminated when the agent is interrupted (though any reasoning and tool call response values already produced are preserved for use in the next generation).

For tools that are expected to take a long time to complete, set `is_background=True`. The tool will continue running in the background until completion regardless of interruptions, then loop back to the LLM to produce a response.

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool(is_background=True)
async def search_database(ctx, query: Annotated[str, "Search query"]) -> str:
    """Search the database - may take several seconds."""
    results = await slow_database_search(query)
    return format_results(results)

@loopback_tool(is_background=True)
async def generate_report(ctx, report_type: Annotated[str, "Type of report"]) -> str:
    """Generate a detailed report - runs in background."""
    report = await compile_report(report_type)
    return report
```

<Note>
  Background tools are useful when:

  * The operation may take longer than typical user patience (e.g., complex searches, report generation)
  * You want the user to be able to speak while the operation completes
  * The result should be delivered even if the user interrupts with another question
</Note>


# Agent Builder
Source: https://docs.cartesia.ai/line/start-building/agent-builder


Prototype voice agents in the Playground. Test prompts, configure voices, and deploy in seconds.

## Create your agent

Go to [play.cartesia.ai/agents](https://play.cartesia.ai/agents) and select **Start in Playground**.

<Frame>
  <img alt="Create your first voice agent options" />
</Frame>

Customize your agent's behavior, voice, and greeting.

<Frame>
  <img alt="Dynamic agent configuration interface" />
</Frame>

**System Prompt** — Define your agent's role and guidelines. You can also provide a natural language description of your agent and the platform will generate a structured system prompt.

**Voice** — Choose from Cartesia's voice library. Preview voices before selecting.

**Initial Message** — Set the greeting your agent speaks when calls start. Check **Skip agent introduction** to have the agent wait for the user to speak first.

**Background Sound** — Add ambient audio for call center atmospheres or office environments.

**Preview** changes before publishing.

## Continue building in code

Connect your Playground agent to GitHub to customize with code.

<Steps>
  <Step title="Connect to GitHub">
    On your agent page, click **Connect to GitHub**. Authorize Cartesia to create a repository.
  </Step>

  <Step title="Clone locally">
    ```bash theme={null}
    git clone https://github.com/your-org/your-agent.git
    cd your-agent
    ```
  </Step>

  <Step title="Install dependencies">
    ```bash theme={null}
    uv pip install .
    ```
  </Step>

  <Step title="Edit your agent">
    Open `main.py` to add tools, custom logic, or modify the prompt.
  </Step>

  <Step title="Deploy">
    Push to deploy your changes.

    ```bash theme={null}
    git push
    ```
  </Step>
</Steps>

## Next steps

<CardGroup>
  <Card title="Quickstart" icon="rocket" href="/line/start-building/quickstart">
    Build agents with the SDK
  </Card>

  <Card title="Agents" icon="robot" href="/line/sdk/agents">
    Prompts, voices, and pre-call configuration
  </Card>
</CardGroup>


# Quickstart
Source: https://docs.cartesia.ai/line/start-building/quickstart


Build an agent, deploy it, and make your first call within minutes.

## Prerequisites

* A free Cartesia account ([sign up here](https://play.cartesia.ai))
* Python 3.9+
* An LLM API key (Anthropic, OpenAI, Google, etc.)
* [uv](https://docs.astral.sh/uv/) (Python package and project manager)

## Install the CLI

```bash theme={null}
curl -fsSL https://cartesia.sh | sh
cartesia auth login
```

## Install uv

Install [uv](https://docs.astral.sh/uv/), a fast Python package manager to manage dependencies and virtual environments.

```bash theme={null}
curl -LsSf https://astral.sh/uv/install.sh | sh
```

## Create your agent

Create a new project and install dependencies. uv will automatically set up a virtual environment and manage your packages.

```bash theme={null}
uv init my-voice-agent && cd my-voice-agent
uv add cartesia-line
```

Create `main.py`:

```python theme={null}
import os
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import VoiceAgentApp

async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001", # Or "gpt-5-nano", "gemini/gemini-2.5-flash", etc.
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

## Test locally

Start your agent server.

```bash theme={null}
ANTHROPIC_API_KEY=your-api-key PORT=8000 uv run python main.py
```

In a separate terminal, chat with your agent by simply running:

```bash theme={null}
cartesia chat 8000
```

This lets you test your agent's reasoning before deploying.

## Deploy

Link your project and deploy.

```bash theme={null}
cartesia init    # Choose "Create new" and name your agent
cartesia deploy
```

Your agent deploys in under 30 seconds on Cartesia's managed runtime.

## Set environment variables

Configure your API key for the deployed agent.

```bash theme={null}
cartesia env set ANTHROPIC_API_KEY=your-api-key
```

Or import from a `.env` file:

```bash theme={null}
cartesia env set --from .env
```

## Make a call

Call your agent from your phone.

```bash theme={null}
cartesia call +1XXXXXXXXXX
```

Or visit the [Playground](https://play.cartesia.ai/agents) to call from the web.

## Next steps

<CardGroup>
  <Card title="Add tools" icon="wrench" href="/line/sdk/tools">
    Connect databases, APIs, and external services
  </Card>

  <Card title="Configure prompts" icon="robot" href="/line/sdk/agents">
    Customize system prompts and conversation flow
  </Card>

  <Card title="Calls API" icon="globe" href="/line/integrations/calls-api">
    Connect web clients via WebSocket
  </Card>

  <Card title="Agent Builder" icon="sparkles" href="/line/start-building/agent-builder">
    Build agents visually in the Playground
  </Card>
</CardGroup>


# LLMs documentation files
Source: https://docs.cartesia.ai/tools/ai/llms-txt

Machine-readable index files for assistants and tooling that ingest Cartesia documentation.

Plain-text, machine-readable exports of the documentation.

Designed for systems that fetch and parse docs over HTTP, such as agents, MCP servers, and crawlers.

## Endpoints

Both endpoints are public over HTTPS and require no API key. Fetch directly inside a tool or pipeline: agents with web fetch or read URL tools, MCP servers, or custom crawlers.

**[llms.txt](https://docs.cartesia.ai/llms.txt)** (default)\
Condensed index aligned with the `llms.txt` convention ([llmstxt.org](https://llmstxt.org/)). Use it when your system can fetch specific docs over HTTP.

* Smaller context
* Faster processing
* Better for retrieval workflows

**[llms-full.txt](https://docs.cartesia.ai/llms-full.txt)**\
Fuller coverage of the docs site. Use it when your system needs broader upfront content in one fetch. Consumes more context and tokens when fed to your LLM.

* More URLs and text
* Higher recall
* Better for indexing and batch jobs


# MCP
Source: https://docs.cartesia.ai/tools/ai/mcp


The **`cartesia-mcp`** package exposes Cartesia through the **Model Context Protocol (MCP)** so MCP-capable clients—**Cursor**, **Claude Desktop**, **OpenAI Agents**, and similar—can list voices, run **TTS**, and use other Cartesia-backed tools via the protocol instead of custom scripts.

You need a [Cartesia API key](https://play.cartesia.ai/keys). The [PyPI package](https://pypi.org/project/cartesia-mcp/) currently requires **Python 3.13 or newer** as its minimum; confirm the supported version on PyPI before you install.

**Installation**, the **uvx** shortcut, and **MCP client configuration** (executable path, environment variables, Claude Desktop or Cursor) are documented in the **[cartesia-mcp](https://github.com/cartesia-ai/cartesia-mcp)** README so setup stays in sync with releases.

<Card title="cartesia-mcp" icon="github" href="https://github.com/cartesia-ai/cartesia-mcp">
  The official Cartesia MCP Server
</Card>


# JavaScript/TypeScript
Source: https://docs.cartesia.ai/tools/client-libraries/javascript-typescript

The library that powers the Cartesia Playground.

<Card title="Cartesia JS" icon="github" href="https://github.com/cartesia-ai/cartesia-js">
  The Official TS/JS client for the Cartesia API.
</Card>


# Python
Source: https://docs.cartesia.ai/tools/client-libraries/python

The official Python library for the Cartesia API.

<Card title="Cartesia Python" icon="github" href="https://github.com/cartesia-ai/cartesia-python">
  The official Python client for the Cartesia API.
</Card>


# API Conventions
Source: https://docs.cartesia.ai/use-the-api/api-conventions


<Warning>
  All endpoints use HTTPS. HTTP is not supported. API keys that call the API
  over HTTP may be subject to automatic rotation.
</Warning>

All API requests use the following base URL: `https://api.cartesia.ai`. (For WebSockets the corresponding protocol is `wss://`.)

### Always send a `Cartesia-Version` header

Each request you send our API should have a `Cartesia-Version` header containing the date (`YYYY-MM-DD`) when you tested your integration. For WebSockets, you can alternately use the `?cartesia_version` query parameter, which will take precedence.

This will help us provide you with timely deprecation notices and enable us to provide automatic backwards compatibility where possible.

For a given `Cartesia-Version`, we will preserve existing input and output fields, but we may make non-breaking changes, such as:

1. Add optional request fields.
2. Add additional response fields.
3. Change conditions for specific error types
4. Add variants to enum-like output values.

Our versioning scheme is inspired by the [Anthropic API](https://docs.anthropic.com/en/api/versioning).

### Use API keys when making requests from a server

Create a new API key at [play.cartesia.ai/keys](https://play.cartesia.ai/keys). Include `Authorization: Bearer <api_key>` in the headers of your requests.

### Use access tokens when making requests from a client app

Never use API keys in client apps; they grant full account access and can be extracted from browser or mobile code.

Instead, your server can generate a short-lived access token and send it to the client. See the [Access Token API Reference](/api-reference/auth/access-token) for how to generate one.

* For HTTP requests, include `Authorization: Bearer <access_token>` in the headers.

* For WebSocket connections, pass the token as the `?access_token=<access_token>` query parameter since browsers can't set headers on WebSocket handshakes.

### Check response codes

Our API uses standard HTTP response codes; refer to [httpstatuses.io](https://httpstatuses.io).

### Parse structured error responses

For `Cartesia-Version` values on or after `2026-03-01`, Cartesia returns structured JSON errors.

For the full error reference (all current error codes, schemas, and field nullability), see [API Errors](/use-the-api/api-errors).

```json HTTP error response (Cartesia-Version 2026-03-01 and newer) theme={null}
{
  "error_code": "concurrency_limited",
  "title": "Too many concurrent requests",
  "message": "You have exceeded your plan's concurrency limit.",
  "request_id": "550e8400-e29b-41d4-a716-446655440000"
}
```

Field meanings:

1. `error_code`: machine-readable identifier for client logic; can be `null`.
2. `title`: short human-readable summary.
3. `message`: detailed human-readable explanation.
4. `request_id`: request identifier for support/debugging.
5. `doc_url`: optional link to docs for the specific error (when available).

Common `error_code` values today include `quota_exceeded`, `concurrency_limited`, `voice_model_mismatch`, `voice_not_found`, `model_not_found`, `language_not_supported`, `file_too_large`, `unsupported_audio_format`, and `plan_upgrade_required`.

WebSocket and SSE error events include the same error fields plus transport context:

```json WebSocket/SSE error event (Cartesia-Version 2026-03-01 and newer) theme={null}
{
  "type": "error",
  "done": true,
  "status_code": 429,
  "error_code": "concurrency_limited",
  "title": "Too many concurrent requests",
  "message": "You have exceeded your plan's concurrency limit.",
  "request_id": "550e8400-e29b-41d4-a716-446655440000:happy-monkeys-fly:8a0f5f3a-3b2f-4f28-b73e-8c5f27e2f8bb",
  "context_id": "happy-monkeys-fly"
}
```

Notes:

1. `context_id` appears for TTS WebSocket errors when available.
2. `status_code` is included in WebSocket/SSE error payloads; for HTTP, status remains in the HTTP response status line.
3. `request_id` is always a string; HTTP and SSE request IDs are UUIDs, while WebSocket request IDs may include additional context.

For `Cartesia-Version` values before `2026-03-01` (and invalid versions), legacy error formats are returned instead:

1. HTTP errors are plain text in `Title: Message` format.
2. WebSocket/SSE errors use legacy envelopes with string-only error messages.

### Pass data according to the method

All GET requests use query parameters to pass data. All POST requests use a JSON body or `multipart/form-data`.


# API Errors
Source: https://docs.cartesia.ai/use-the-api/api-errors


For `Cartesia-Version: 2026-03-01` and newer, Cartesia returns structured JSON error objects.

For older API versions, errors may be plain text (for example `Title: Message`).

## HTTP Error Object

```json HTTP error response theme={null}
{
  "error_code": "concurrency_limited",
  "title": "Too many concurrent requests",
  "message": "You have exceeded your plan's concurrency limit.",
  "request_id": "550e8400-e29b-41d4-a716-446655440000"
}
```

| Field        | Type            | Required | Nullable | Notes                                                             |
| ------------ | --------------- | -------- | -------- | ----------------------------------------------------------------- |
| `error_code` | `string`        | Yes      | Yes      | Machine-readable code. Can be `null` if no specific code applies. |
| `title`      | `string`        | Yes      | No       | Short human-readable error summary.                               |
| `message`    | `string`        | Yes      | No       | Detailed human-readable error explanation.                        |
| `request_id` | `string` (UUID) | Yes      | No       | Request identifier for support/debugging.                         |
| `doc_url`    | `string`        | No       | No       | Optional docs link for the error. Omitted when not available.     |

## WebSocket Error Event Object

```json WebSocket error event theme={null}
{
  "type": "error",
  "done": true,
  "status_code": 429,
  "error_code": "concurrency_limited",
  "title": "Too many concurrent requests",
  "message": "You have exceeded your plan's concurrency limit.",
  "request_id": "550e8400-e29b-41d4-a716-446655440000:happy-monkeys-fly:8a0f5f3a-3b2f-4f28-b73e-8c5f27e2f8bb",
  "context_id": "happy-monkeys-fly"
}
```

| Field         | Type      | Required | Nullable | Notes                                                                                                           |
| ------------- | --------- | -------- | -------- | --------------------------------------------------------------------------------------------------------------- |
| `type`        | `string`  | Yes      | No       | Always `"error"`.                                                                                               |
| `done`        | `boolean` | Yes      | No       | Currently always `true` for error events.                                                                       |
| `status_code` | `integer` | Yes      | No       | HTTP-like status code for the error.                                                                            |
| `error_code`  | `string`  | Yes      | Yes      | Machine-readable code. Can be `null`.                                                                           |
| `title`       | `string`  | Yes      | No       | Short human-readable error summary.                                                                             |
| `message`     | `string`  | Yes      | No       | Detailed human-readable error explanation.                                                                      |
| `request_id`  | `string`  | Yes      | No       | Request identifier for support/debugging. For WebSocket, this may be a UUID or a derived per-message ID string. |
| `doc_url`     | `string`  | No       | No       | Optional docs link for the error. Omitted when not available.                                                   |
| `context_id`  | `string`  | No       | No       | TTS context identifier. Present when available.                                                                 |

## SSE Error Event Object

SSE errors are sent with `event: error` and JSON in the `data:` line.

```text SSE error event theme={null}
event: error
data: {"type":"error","done":true,"status_code":500,"error_code":null,"title":"Unexpected error","message":"An unexpected error occurred, please contact support@cartesia.ai if the problem persists.","request_id":"550e8400-e29b-41d4-a716-446655440000"}
```

| Field         | Type            | Required | Nullable | Notes                                                         |
| ------------- | --------------- | -------- | -------- | ------------------------------------------------------------- |
| `type`        | `string`        | Yes      | No       | Always `"error"`.                                             |
| `done`        | `boolean`       | Yes      | No       | Currently always `true` for error events.                     |
| `status_code` | `integer`       | Yes      | No       | HTTP-like status code for the error.                          |
| `error_code`  | `string`        | Yes      | Yes      | Machine-readable code. Can be `null`.                         |
| `title`       | `string`        | Yes      | No       | Short human-readable error summary.                           |
| `message`     | `string`        | Yes      | No       | Detailed human-readable error explanation.                    |
| `request_id`  | `string` (UUID) | Yes      | No       | Request identifier for support/debugging.                     |
| `doc_url`     | `string`        | No       | No       | Optional docs link for the error. Omitted when not available. |

## Current Error Codes

<Note>
  More error codes may be added in the future. Integrations should handle unknown
  `error_code` values gracefully.
</Note>

| `error_code`               | Meaning                                                                   |
| -------------------------- | ------------------------------------------------------------------------- |
| `quota_exceeded`           | The account has exceeded quota (for example credits or agents usage).     |
| `concurrency_limited`      | The account has exceeded the plan's concurrency limit.                    |
| `voice_model_mismatch`     | The requested voice is incompatible with the requested model.             |
| `voice_not_found`          | The requested voice does not exist.                                       |
| `model_not_found`          | The requested model does not exist.                                       |
| `language_not_supported`   | The requested language is not supported for the requested model or voice. |
| `file_too_large`           | The uploaded file is too large.                                           |
| `unsupported_audio_format` | The provided audio format is not supported.                               |
| `plan_upgrade_required`    | The feature requires a higher plan tier.                                  |


# Compare TTS Endpoints
Source: https://docs.cartesia.ai/use-the-api/compare-tts-endpoints

How bytes, SSE, and WebSocket differ for text-to-speech, and when to use each.

Cartesia exposes three ways to turn text into speech. The same models, voices, and core parameters apply everywhere. What changes is how you connect, how audio is framed on the wire, and whether you get timestamps, continuations (streaming model output into one spoken line), or many generations on one connection.

All three endpoints stream audio as it is produced. The bytes endpoint delivers that stream as a single HTTP response body (the same pattern the playground uses). SSE and WebSocket stream too; they chunk audio into multiple events or messages, which is how per-chunk metadata such as timestamps is carried.

## Feature comparison

|           | Multiple generations per connection | Timestamps | Continuations |
| --------- | ----------------------------------- | ---------- | ------------- |
| WebSocket | Yes                                 | Yes        | Yes           |
| Bytes     | No (one `POST` per generation)      | No         | No            |
| SSE       | No (one `POST` per generation)      | Yes        | No            |

An **utterance** is one stretch of speech you want pronounced as a single unit (usually a sentence or a line of UI copy). **Continuations** let you send that utterance as several WebSocket messages that share a `context_id`. See [Stream inputs using continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) and [contexts](/use-the-api/tts-websocket/contexts).

```mermaid theme={null}
flowchart TD
    Q1{"Are you streaming text from an LLM<br>or other partial input?"}
    Q2{"Do you need timestamps on HTTP<br>without WebSocket?"}
    Q3{"Will you speak often enough that<br>repeated connect/TLS cost hurts?"}
    WS["WebSocket"]
    SSE["SSE"]
    Bytes["Bytes"]

    Q1 -- "Yes" --> WS
    Q1 -- "No" --> Q2
    Q2 -- "Yes" --> SSE
    Q2 -- "No" --> Q3
    Q3 -- "Yes" --> WS
    Q3 -- "No" --> Bytes
```

If you care about time-to-first-byte on every turn, remember that a new HTTPS request pays for TCP and TLS again; that overhead is often on the same order as TTFB for the audio itself. WebSocket amortizes that cost when you keep the socket open.

SSE is still supported for stacks that already consume Server-Sent Events or when you want timestamps while staying on HTTP. For audio only, bytes is usually the better HTTP choice (smaller encoding than JSON-wrapped chunks).

## Pick an endpoint in one minute

| What you are building                                                                                                                              | Use this                                                   | Short label                         |
| -------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------- |
| Full transcript in one request; you want a streaming HTTP body (efficient; same pattern as the playground)                                         | [`POST /tts/bytes`](/api-reference/tts/bytes)              | Stream speech (bytes)               |
| Full transcript in one request; you need timestamps without WebSocket, or your stack already uses SSE                                              | [`POST /tts/sse`](/api-reference/tts/sse)                  | Stream speech with timestamps (SSE) |
| Long-lived session, partial transcript (for example LLM tokens), lowest latency across many turns, timestamps, or several utterances on one socket | [WebSocket `/tts/websocket`](/api-reference/tts/websocket) | Live session (WebSocket)            |

If the full transcript is not known up front, use WebSocket with contexts, not bytes or SSE.

***

## Bytes (`POST /tts/bytes`)

Best for batch jobs, caching files, notifications, and anywhere one `POST` per generation is enough.

The response body streams while audio is generated. You can read progressively or buffer to the end. For many output formats this is leaner on the wire than SSE because you receive raw or file bytes instead of JSON-wrapped chunks.

Typical flow:

1. One JSON payload with the full `transcript`, voice, model, and output format (WAV, MP3, raw PCM, and so on).
2. `POST` to `/tts/bytes`.
3. Read the body as data arrives, or consume it to completion.

One request is one generation. For another line of speech, send another `POST`.

See [bytes reference](/api-reference/tts/bytes).

***

## SSE (`POST /tts/sse`)

Best when you need timestamps while staying on HTTP without WebSocket, or when your integration already uses SSE. If you only need audio and not SSE-shaped events, bytes is usually simpler. WebSocket is otherwise the full-featured option for real-time use and supports timestamps as well.

SSE remains available largely for backward compatibility and for teams that standardize on Server-Sent Events.

Typical flow:

1. Same as bytes: one JSON body with the full transcript.
2. `POST` to `/tts/sse`.
3. Consume Server-Sent Events; each event carries the next chunk until completion.

Bytes vs SSE:

|            | Bytes                                           | SSE                                            |
| ---------- | ----------------------------------------------- | ---------------------------------------------- |
| Shape      | One streaming response body (raw or file bytes) | Many SSE events (often JSON plus base64 audio) |
| Timestamps | No                                              | Yes (in the event payload)                     |

You still send one full transcript per request: SSE does not support WebSocket-style continuations across multiple `POST`s. An optional `context_id` is echoed for your logs; it does not merge multiple HTTP requests into one utterance. To send text in pieces over time, use WebSocket.

See [SSE reference](/api-reference/tts/sse).

***

## WebSocket (`/tts/websocket`)

Best for assistants, games, telephony-style stacks, or any case where the connection stays open and transcript text may arrive over time.

Why people choose WebSocket:

1. Latency: you pay connect cost once; later generations avoid extra TCP/TLS round trips (often tens to low hundreds of ms per turn).
2. Streaming input: send fragments as they arrive (for example from an LLM) and keep prosody across them. See [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) and [contexts](/use-the-api/tts-websocket/contexts).
3. Timestamps: word- or segment-level timing (model and language limits apply; see WebSocket docs).
4. Multiplexing: multiple `context_id` values on one connection for overlapping utterances.

Typical flow:

1. Open the WebSocket.
2. Send JSON messages. When one utterance is split across messages, use `context_id` and `continue`: set `continue: true` on partials, and `continue: false` on the last part of that utterance (or use the empty-transcript pattern in [contexts](/use-the-api/tts-websocket/contexts) if you cannot know the final string yet).
3. Read audio until the server finishes that context.

See [WebSocket reference](/api-reference/tts/websocket).

***

## Continuations

If you are not streaming text from a model, start with bytes or SSE, not continuations.

When you do use WebSocket streaming, keep one stable `context_id` per utterance, `continue: true` on every partial, and `continue: false` on the final message for that utterance (see [contexts](/use-the-api/tts-websocket/contexts)).

Things that break text or prosody:

* Concatenation: chunks are joined verbatim. A missing space produces `"...world!How..."` instead of `"...world! How..."`.
* SSML and numbers: avoid splitting tokens that must stay together (for example decimals in SSML). See `max_buffer_delay_ms` in the [continuations guide](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).

If you leave `continue: true` longer than you meant, contexts eventually expire on their own and audio is still generated and flushed according to server rules. It is not a runaway failure mode. You should still send `continue: false` when you know the utterance is complete so your client state matches the server. Do not reuse old `context_id` values for unrelated utterances.

***

## Why WebSocket uses `context_id` (and HTTP does not)

On `POST /tts/bytes` and `POST /tts/sse`, you send a complete transcript in one JSON body. There is no continuation protocol across requests.

`context_id` and `continue` matter on WebSocket when one utterance's text is split across multiple messages. The server concatenates chunks that share a `context_id`. `continue: true` means more text is coming; `continue: false` finalizes that utterance.

Mental model:

* Whole line of speech in one string: bytes or SSE. No context API.
* Text arrives in pieces: WebSocket, one `context_id` per utterance, with continuations.

***

## API ergonomics (all endpoints)

* For server-side calls, prefer the API key in the `Authorization` header instead of query strings (headers are less likely to appear in access logs). WebSocket URLs in browsers may need different patterns for your platform.
* Model IDs, voices, and core generation parameters match across bytes, SSE, and WebSocket. What differs is wire format, how chunks are exposed, and whether input can be streamed with continuations.

***

## Where to go next

<CardGroup>
  <Card title="Stream speech (bytes)" icon="download" href="/api-reference/tts/bytes">
    One POST, streaming response body
  </Card>

  <Card title="Stream speech with timestamps (SSE)" icon="waveform" href="/api-reference/tts/sse">
    Timestamps and SSE-chunked audio
  </Card>

  <Card title="Live session (WebSocket)" icon="plug" href="/api-reference/tts/websocket">
    Streaming input, multiplexing, lowest latency across turns
  </Card>
</CardGroup>


# Concurrency and WebSocket Limits
Source: https://docs.cartesia.ai/use-the-api/concurrency-limits-and-timeouts

Learn about concurrency limits and timeouts with the Cartesia API.

Your account is subject to two types of rate limits: WebSocket limits and generation concurrency limits.

## Concurrency limits by subscription plan

Your subscription plan determines how many requests can be processed simultaneously. Sonic Text-to-Speech (TTS) and Ink Speech-to-Text (STT) each have separate concurrency limits with the same values per plan.

| Plan       | TTS Concurrent Requests | STT Concurrent Requests |
| ---------- | ----------------------- | ----------------------- |
| Free       | 2                       | 8                       |
| Pro        | 3                       | 12                      |
| Startup    | 5                       | 20                      |
| Scale      | 15                      | 60                      |
| Enterprise | Custom                  | Custom                  |

<Note>
  Sonic (Text-to-Speech) and Ink (Speech-to-Text) services have separate concurrent request limits. For example, if you're on the Scale plan, you can have up to 15 concurrent TTS requests AND 60 concurrent STT requests running simultaneously.
</Note>

## Text-to-Speech (TTS) Concurrency

We measure TTS generation concurrency in terms of the number of unique contexts active at a given time.

* For HTTP endpoints, each request is treated as a separate context and counts toward your concurrency limit.
* For WebSockets, a unique <code>context\_id</code> defines a context—sending additional requests with the same <code>context\_id</code> does not increase your concurrency usage. This is because requests to the same context are processed sequentially.
* STT **does not** count towards your TTS concurrency limit

If you exceed your TTS concurrency limit, you will receive a `429 Too Many Requests` error. You can check your concurrency limit and upgrade it on the playground at [play.cartesia.ai](https://play.cartesia.ai).

### Interpreting TTS concurrency limits

How you interpret your TTS concurrency limit depends on how you're using the Sonic model family.

<AccordionGroup>
  <Accordion title="Conversational use cases">
    For real-time conversational use cases, such as powering voice agents, we've found that the number of parallel conversations you can support is effectively 4X your concurrency limit. This is just a rule of thumb, and depends on the types of conversations you're supporting. You can reach out to us to discuss your specific use case.

    For example, if you have a TTS concurrency limit of 15, you can typically support 60 parallel conversations.
  </Accordion>

  <Accordion title="Non-conversational use cases">
    For non-conversational use cases, such as generating speech in batch jobs, there is a more direct relationship between your concurrency limit and the number of parallel generations you can support.

    For example, if you have a TTS concurrency limit of 15, you can typically support 15 parallel TTS generations. You can use a connection pool to ensure you don't exceed your concurrency limit.
  </Accordion>
</AccordionGroup>

### TTS WebSocket limits

We limit the number of parallel TTS WebSocket connections to 10X your concurrency limit. For example, if you have a concurrency limit of 15, you can have up to 150 parallel TTS WebSocket connections.

If you exceed your WebSocket limit, you will receive a `429 Too Many Requests` error on trying to open a new WebSocket connection.

Usually, when users run into TTS WebSocket limits (even at scale), it's because they're not properly closing idle connections. Beyond closing idle connections, you can also create a connection pool to ensure you don't exceed your WebSocket limit.

### TTS WebSocket timeouts

We close idle TTS WebSocket connections after 5 minutes. We recommend closing and re-opening a new websocket connection when connections stay idle for long periods of time.

## Speech-to-Text (STT) Concurrency

Each active transcription stream counts as one concurrent request, regardless of whether you're using HTTP or WebSocket connections.

* Each concurrent HTTP or WebSocket connection counts toward your STT concurrency limit
* Idle STT WebSockets still count towards your STT concurrency limit
* TTS **does not** count towards your STT concurrency limit

If you exceed your STT concurrency limit, you will receive a `429 Too Many Requests` error.

### STT WebSocket timeouts

We close idle STT WebSocket connections after 3 minutes. We recommend closing and re-opening a new websocket connection when connections stay idle for long periods of time.


# Migrating From OpenAI Whisper to Cartesia Ink
Source: https://docs.cartesia.ai/use-the-api/migrate-from-open-ai

Use Cartesia's Batch Speech-to-Text API with OpenAI's client libraries

<Info>
  Batch Speech-to-Text: This documentation covers OpenAI SDK compatibility for Cartesia Ink's batched transcription endpoint.

  For real-time transcription, use our [Streaming STT endpoint](/api-reference/stt/stt).
</Info>

Cartesia's Batch Speech-to-Text API is compatible with OpenAI's client libraries, enabling seamless migration from OpenAI Whisper.

## Endpoints

**Cartesia Native:** `/stt` - Full feature support\
**OpenAI Compatible:** `/audio/transcriptions` - Drop-in replacement for Whisper on the OpenAI SDK

## Migration Guide for OpenAI SDK

Replace your OpenAI base URL with `https://api.cartesia.ai` to use the compatibility layer for Cartesia:

### Parameter Support

**Supported Parameters**:

* `file` - The audio file to transcribe
* `model` - Use `ink-whisper` for Cartesia's latest model
* `language` - Input audio language (ISO-639-1 format)
* `timestamp_granularities` - Include `["word"]` to get word-level timestamps

**Response Format**: Always returns JSON with transcribed text, duration, language, and optionally word timestamps.

For the complete parameter reference, see our [Batch STT API documentation](/api-reference/stt/transcribe).

### Python Example

```python theme={null}
from openai import OpenAI

client = OpenAI(
    api_key="your-cartesia-api-key",
    base_url="https://api.cartesia.ai"
)

with open("audio.wav", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        file=audio_file,
        model="ink-whisper",
        language="en",
        timestamp_granularities=["word"]
    )
    
print(transcript.text)
```

### Node.js Example

```typescript theme={null}
import OpenAI from 'openai';
import fs from 'fs';

const client = new OpenAI({
  apiKey: 'your-cartesia-api-key',
  baseURL: 'https://api.cartesia.ai'
});

const transcription = await client.audio.transcriptions.create({
  file: fs.createReadStream('audio.wav'),
  model: 'ink-whisper',
  language: 'en',
  timestamp_granularities: ['word']
});

console.log(transcription.text);
```

## Direct API Usage

Both endpoints accept identical parameters and return the same JSON response format:

### Cartesia Native Endpoint

```bash theme={null}
curl -X POST https://api.cartesia.ai/stt \
  -H "X-API-Key: your-cartesia-api-key" \
  -F "file=@audio.wav" \
  -F "model=ink-whisper" \
  -F "language=en" \
  -F "timestamp_granularities[]=word"
```

### OpenAI-Compatible Endpoint

```bash theme={null}
curl -X POST https://api.cartesia.ai/audio/transcriptions \
  -H "X-API-Key: your-cartesia-api-key" \
  -F "file=@audio.wav" \
  -F "model=ink-whisper" \
  -F "language=en" \
  -F "timestamp_granularities[]=word"
```

## Migration from OpenAI

To migrate from OpenAI's Whisper API to Cartesia:

1. **Update the base URL**: Change from `https://api.openai.com/v1` to `https://api.cartesia.ai`
2. **Update authentication**: Replace your OpenAI API key with your Cartesia API key
3. **Update model names**: Use `ink-whisper` instead of OpenAI's model names
4. **Keep the same endpoint**: Continue using `/audio/transcriptions`
5. **Avoid unsupported parameters**: Remove `prompt`, `temperature`, and `response_format` parameters
6. **Use timestamp\_granularities (Optional)**: Add `timestamp_granularities: ["word"]` to get word-level timestamps

The core functionality remains the same, with JSON responses containing transcribed text and optional word timestamps.


# Buffering
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/buffering

Control how text is buffered before speech generation to balance prosody and latency.

Cartesia supports two buffering modes for streaming TTS: **managed buffering** and **custom buffering**. The right choice depends on how much control you need over the prosody-latency tradeoff.

<Tip>
  **Start with managed buffering.** It produces natural-sounding speech with minimal integration effort. Switch to custom buffering only if you need fine-grained control.
</Tip>

## Managed buffering

Stream LLM tokens directly to Cartesia and let the API decide when to start generating speech. This is the same approach used in Cartesia's managed voice agents platform.

Set `max_buffer_delay_ms` to a value greater than 0 (the default is 3000ms) and stream text token by token.

```json theme={null}
{
  "model_id": "sonic-3",
  "transcript": "Hello",
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-context",
  "continue": true,
  "max_buffer_delay_ms": 3000
}
```

The API buffers incoming text until it has enough context to produce high-quality speech, or until `max_buffer_delay_ms` elapses—whichever comes first. This produces results similar to sentence-level aggregation while still optimizing for latency.

**When to use managed buffering:**

* You're streaming LLM output token by token
* You want natural-sounding speech without building buffering logic
* You want a simple integration with good defaults

## Custom buffering

Handle buffering yourself and send complete phrases or sentences to Cartesia. Set `max_buffer_delay_ms` to `0` so the API generates speech immediately from whatever you provide.

```json theme={null}
{
  "model_id": "sonic-3",
  "transcript": "Hello, my name is Sonic.",
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-context",
  "continue": true,
  "max_buffer_delay_ms": 0
}
```

With custom buffering, you control the prosody-latency tradeoff directly:

* **Full sentences** produce the best prosody but add latency while you wait for the sentence to complete.
* **Partial sentences** reduce latency but may result in less natural speech at chunk boundaries.

**When to use custom buffering:**

* You need precise control over when speech generation starts
* You have your own sentence detection or text aggregation logic
* You're optimizing for a specific latency target

## Avoid the middle ground

A common mistake is to aggregate text client-side into sentences or phrases *and* use the default `max_buffer_delay_ms` of 3000ms. This can cause unnecessary latency—after receiving a complete sentence, the API may wait up to 3000ms for additional input before generating speech.

Pick one approach:

* **Managed buffering:** Stream tokens with `max_buffer_delay_ms > 0` and let Cartesia handle aggregation.
* **Custom buffering:** Aggregate text yourself and set `max_buffer_delay_ms = 0`.

## Configuration reference

<ParamField type="number">
  Maximum time in milliseconds the API waits for additional input before generating speech from buffered text.

  * **Range:** 0–5000ms
  * **Default:** 3000ms
  * Set to `0` for custom buffering (no server-side buffering)
  * Set to `> 0` for managed buffering
</ParamField>

<Warning>
  If you use `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with managed buffering, make sure decimal values are not split across tokens. Submitting `1.0` as `1`, `.`, `0` will cause parsing errors.
</Warning>

## Tips for best results

* **End sentences with punctuation.** Without closing punctuation (`.`, `?`, `!`), the model may treat text as incomplete and wait for the buffer delay to elapse before generating. See [streaming inputs with continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) for more details.
* **Signal when input is done.** When a turn is complete, use `continue: false` (WebSocket) or `no_more_inputs()` (SDK) so the model doesn't wait for more text.
* **Test with realistic input patterns.** Buffering behavior depends on how text arrives—test with actual LLM output rather than pre-written text.


# Context Flushing and Flush IDs
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/context-flushing-and-flush-i-ds

Learn about managing multiple transcript generations with context flushing.

## Overview

When using [context IDs with the WebSocket API](/use-the-api/tts-websocket/contexts), all audio chunks for transcripts submitted to a single context share the same context ID. This makes it difficult to determine which audio chunks correspond to specific transcript submissions.

While this behavior works well for streaming audio, some implementations require the ability to map audio chunks back to their originating transcripts.

<Frame>
  <img alt="context_flushing" />
</Frame>

## Manual Flushing

Manual flushing creates clear boundaries between transcript submissions within the same context.

### How It Works

Each time you trigger a manual flush, the system increments a `flush_id` counter. This ID is included in corresponding response audio chunk payloads, allowing you to track which transcript generated specific audio chunks.

### Implementation

To trigger a manual flush:

1. Send a request with these parameters:
   * `continue=True` (indicates you're continuing with the same context)
   * `flush=True` (triggering the flush operation)
   * Empty transcript
   * Same context ID as your previous request

### Example Flow

```
1. Submit transcript 1 on context 1
2. Flush context 1
3. Submit transcript 2 on context 1
```

In this flow:

* All audio chunks from transcript 1 will have `flush_id=1`
* The manual flush increments the ID
* All audio chunks from transcript 2 will have `flush_id=2`

## Payload Structure

Each audio chunk payload includes a `flush_id` field that serves as a transcript identifier. This ID increments with each manual flush operation, creating a clear boundary between transcript submissions.

## When to Use Manual Flushing

Consider using manual flushing when:

* You need to associate audio chunks with their originating transcripts
* Your application architecture expects a one-to-one relationship between transcripts and response streams
* You're integrating with frameworks that assume each transcript has a corresponding generator

This feature is particularly helpful when using multiple providers, as it aligns the Cartesia API with systems that expect discrete generator responses per transcript.


# Contexts
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/contexts


<Info>
  This is a hands-on guide to input streaming using WebSocket contexts. For a conceptual overview of how input streaming works in Sonic, see the [input streaming guide](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).
</Info>

> In many real time use cases, you don't have your transcripts available upfront—like when you're generating them using an LLM. For these cases, Sonic supports input streaming.

The context IDs you pass to the Cartesia API identify speech contexts. Contexts maintain prosody between their inputs—so you can send a transcript in multiple parts and receive seamless speech in return.

To stream in inputs on a context, just pass a `continue` flag (set to `true`) for every input that you expect will be followed by more inputs. (By default, this flag is set to `false`.)

To finish a context, just set `continue` to `false`. If you do not know the last transcript in advance, you can send an input with an empty transcript and `continue` set to `false`.

<Note>Contexts automatically expire 1 second after the last audio output is streamed out. Attempting to send another input on the same context ID after expiry is not supported.</Note>

<ParamField type="boolean">
  Whether this input may be followed by more inputs.
</ParamField>

### Input Format

1. Inputs on the same context must keep all fields except `transcript`, `continue`, and `duration` the same.
2. Transcripts are concatenated verbatim, so make sure they form a valid transcript when joined together. Make sure to include any spaces between words or punctuations as necessary. For example, in languages with spaces, you should include a space at the end of the preceding transcript, e.g. transcript 1 is `Thanks for coming, ` and transcript 2 is `it was great to see you.`

### Example

Let's say you're trying to generate speech for "Hello, Sonic! I'm streaming inputs." You should stream in the following inputs (repeated fields omitted for brevity). Note: all other fields (e.g. `model_id`, `language`) are required and should be passed unchanged between requests with input streaming.

```json Input Streaming theme={null}
{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "inputs.", "continue": false, "context_id": "happy-monkeys-fly"}
```

<Tip>
  If [streaming in input tokens](/build-with-cartesia/capability-guides/stream-inputs-using-continuations), we recommend using `max_buffer_delay_ms`, which sets the maximum time the model will buffer text before starting generation.

  Without this option set, the model will start generating immediately on the first request, giving you full control over buffering of inputs.
</Tip>

If you don't know the last transcript in advance, you can send an input with an empty transcript and `continue` set to `false`:

```json Input Streaming theme={null}
{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "inputs.", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "", "continue": false, "context_id": "happy-monkeys-fly"}
```

### Output

You will only receive `done: true` after outputs for the entire context have been returned.

Outputs for a given context will always be in order of the inputs you streamed in. (That is, if you send input A and then input B on a context, you will first receive the chunks corresponding to input A, and then the chunks corresponding to input B.)

## Cancelling Requests

You may also cancel outgoing requests through the websocket.

To cancel a request, send a JSON message with the following structure:

```json WebSocket Request theme={null}
{
  "context_id": "happy-monkeys-fly",
  "cancel": true
}
```

When you send a cancel request:

1. It will only halt requests that have not begun generating a response yet.
2. Any currently generating request will continue sending responses until completion.

<Note>
  The `context_id` in the cancel request should match the `context_id` of the request you want to cancel.
</Note>


# Get API Key
Source: https://docs.cartesia.ai/api-reference/api-keys/get

/latest.yml GET /api-keys/{id}
Returns metadata for a single API key.


# List API Keys
Source: https://docs.cartesia.ai/api-reference/api-keys/list

/latest.yml GET /api-keys
Returns a paginated list of standard API keys owned by the authenticating organization. Only metadata is returned, not the keys themselves. Admin API keys are not included.


# Generate a New Access Token
Source: https://docs.cartesia.ai/api-reference/auth/access-token

/latest.yml POST /access-token
Generates a new Access Token for the client. These tokens are short-lived and should be used to make requests to the API from authenticated clients.


# Create
Source: https://docs.cartesia.ai/api-reference/datasets/create

/latest.yml POST /datasets/
Create a new dataset


# Delete
Source: https://docs.cartesia.ai/api-reference/datasets/delete

/latest.yml DELETE /datasets/{id}
Delete a dataset


# Delete file
Source: https://docs.cartesia.ai/api-reference/datasets/delete-file

/latest.yml DELETE /datasets/{id}/files/{fileID}
Remove a file from a dataset


# Get
Source: https://docs.cartesia.ai/api-reference/datasets/get

/latest.yml GET /datasets/{id}
Retrieve a specific dataset by ID


# List
Source: https://docs.cartesia.ai/api-reference/datasets/list

/latest.yml GET /datasets/
Paginated list of datasets


# List files
Source: https://docs.cartesia.ai/api-reference/datasets/list-files

/latest.yml GET /datasets/{id}/files
Paginated list of files in a dataset


# Update
Source: https://docs.cartesia.ai/api-reference/datasets/update

/latest.yml PATCH /datasets/{id}
Update an existing dataset


# Upload file
Source: https://docs.cartesia.ai/api-reference/datasets/upload-file

/latest.yml POST /datasets/{id}/files
Upload a new file to a dataset


# Create
Source: https://docs.cartesia.ai/api-reference/fine-tunes/create

/latest.yml POST /fine-tunes/
Create a new fine-tune


# Delete
Source: https://docs.cartesia.ai/api-reference/fine-tunes/delete

/latest.yml DELETE /fine-tunes/{id}
Delete a fine-tune


# Get
Source: https://docs.cartesia.ai/api-reference/fine-tunes/get

/latest.yml GET /fine-tunes/{id}
Retrieve a specific fine-tune by ID


# List
Source: https://docs.cartesia.ai/api-reference/fine-tunes/list

/latest.yml GET /fine-tunes/
Paginated list of all fine-tunes for the authenticated user


# List Voices
Source: https://docs.cartesia.ai/api-reference/fine-tunes/list-voices

/latest.yml GET /fine-tunes/{id}/voices
List all voices created from a fine-tune


# Infill (Bytes)
Source: https://docs.cartesia.ai/api-reference/infill/bytes

/latest.yml POST /infill/bytes
Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.

**The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.**

At least one of `left_audio` or `right_audio` must be provided.

As with all generative models, there's some inherent variability, but here's some tips we recommend to get the best results from infill:
- Use longer infill transcripts
  - This gives the model more flexibility to adapt to the rest of the audio
- Target natural pauses in the audio when deciding where to clip
  - This means you don't need word-level timestamps to be as precise
- Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
  - This helps the model generate more natural transitions


# Create
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/create

/latest.yml POST /pronunciation-dicts/
Create a new pronunciation dictionary


# Delete
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/delete

/latest.yml DELETE /pronunciation-dicts/{id}
Delete a pronunciation dictionary


# Get
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/get

/latest.yml GET /pronunciation-dicts/{id}
Retrieve a specific pronunciation dictionary by ID


# List
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/list

/latest.yml GET /pronunciation-dicts/
List all pronunciation dictionaries for the authenticated user


# Update
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/update

/latest.yml PATCH /pronunciation-dicts/{id}
Update a pronunciation dictionary


# Get Agent Usage
Source: https://docs.cartesia.ai/api-reference/usage/agents

/latest.yml GET /usage/agents
Returns your agent usage over time, bucketed by the requested interval.


# Get Credit Usage
Source: https://docs.cartesia.ai/api-reference/usage/credits

/latest.yml GET /usage/credits
Returns your credit usage over time, bucketed by the requested interval.


# Voice Changer (Bytes)
Source: https://docs.cartesia.ai/api-reference/voice-changer/bytes

/latest.yml POST /voice-changer/bytes
Takes an audio file of speech, and returns an audio file of speech spoken with the same intonation, but with a different voice.

This endpoint is priced at 15 characters per second of input audio.


# Voice Changer (SSE)
Source: https://docs.cartesia.ai/api-reference/voice-changer/sse

/latest.yml POST /voice-changer/sse


# Audio encodings
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/audio-encodings


Pick the encoding that matches your downstream pipeline.

## TTS output encodings

Used in the `output_format.encoding` field when generating audio.

| Encoding    | Bit depth        | Best for                                                        | Pair with sample rate             |
| ----------- | ---------------- | --------------------------------------------------------------- | --------------------------------- |
| `pcm_s16le` | 16-bit int       | General-purpose playback, browsers, audio players, most devices | 44100 (CD quality) or 16000–48000 |
| `pcm_f32le` | 32-bit float     | ML post-processing, high-fidelity recording, audio analysis     | 48000                             |
| `pcm_mulaw` | 8-bit compressed | North American / Japanese telephony (G.711μ), Twilio            | 8000                              |
| `pcm_alaw`  | 8-bit compressed | European / international telephony (G.711A)                     | 8000                              |

### `pcm_s16le`

16-bit signed integer PCM, little-endian. Matches the standard audio CD format and is the most widely supported encoding across audio players, browsers, and hardware. Use this as your default unless you have a specific reason to choose another format.

```json theme={null}
{
  "container": "raw",
  "encoding": "pcm_s16le",
  "sample_rate": 44100
}
```

### `pcm_f32le`

32-bit floating point PCM, little-endian. Provides the highest precision and dynamic range. Use when your pipeline handles float audio end-to-end—for example, feeding generated audio into an ML model, performing signal processing with NumPy/SciPy, or recording to a lossless format for later mastering.

```json theme={null}
{
  "container": "raw",
  "encoding": "pcm_f32le",
  "sample_rate": 48000
}
```

### `pcm_mulaw`

8-bit μ-law compressed PCM. The standard encoding for North American and Japanese telephone networks (G.711μ). Use this when sending audio to Twilio or any telephony provider that expects μ-law. Always pair with an 8000 Hz sample rate to match the telephony standard.

```json theme={null}
{
  "container": "raw",
  "encoding": "pcm_mulaw",
  "sample_rate": 8000
}
```

### `pcm_alaw`

8-bit A-law compressed PCM. The standard encoding for European and international telephone networks (G.711A). Use when your telephony infrastructure expects A-law rather than μ-law. Always pair with an 8000 Hz sample rate.

```json theme={null}
{
  "container": "raw",
  "encoding": "pcm_alaw",
  "sample_rate": 8000
}
```

## STT input encodings

Used in the `encoding` parameter when sending audio for transcription. Must match the actual encoding of your audio source.

| Encoding    | Bit depth        | Common sources                                                      |
| ----------- | ---------------- | ------------------------------------------------------------------- |
| `pcm_s16le` | 16-bit int       | Microphones, browsers (Web Audio API), most audio capture libraries |
| `pcm_s32le` | 32-bit int       | Professional audio interfaces                                       |
| `pcm_f16le` | 16-bit float     | Half-precision ML pipelines                                         |
| `pcm_f32le` | 32-bit float     | ML models, Web Audio API `AudioWorklet` nodes, NumPy/SciPy          |
| `pcm_mulaw` | 8-bit compressed | North American telephony, Twilio streams                            |
| `pcm_alaw`  | 8-bit compressed | European telephony systems                                          |

For best STT performance, resample your audio to `pcm_s16le` at 16000 Hz before sending.


# Choosing a Voice
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/choosing-a-voice

How to pick the best voice for your Voice Agents

When designing a voice agent experience, the voice that your agents will speak in is a critical choice that will influence your customers' experience.

Cartesia offers 500+ voices out-out-of-box, as well as the ability to clone your own voices.

### Featured Voices

We feature a set of Voices that we've found work well for our customers and pass our internal quality checks. These voices are a great starting point to find the best Voice for your voice agent.

Featured Voices are displayed with a check mark icon next to their names on [play.cartesia.ai](https://play.cartesia.ai/).

### Stable voices (best for voice agents)

For voice agents in production, we've found that more stable, realistic voices perform better than studio quality, emotive voices. From our testing, we think these are the top performing English Voices for voice agents in Sonic 3:

* **Male**: Ronald, Carson
* **Female**: Katie, Jacqueline, Brooke

### Emotive voices (best for AI characters)

Our latest model, Sonic 3, is very expressive with some voices like Tessa and Maya labeled as emotive in the playground, and respond well to [emotion instructions](/build-with-cartesia/sonic-3/volume-speed-emotion).

If your use case requires more expressive speech (e.g. companion apps, game characters), then we suggest trying:

* **Male**: Kyle, Cory
* **Female**: Tessa, Ariana

We tag such voices as Emotive in our playground and you can see a full list [here](https://play.cartesia.ai/voices?tags=Emotive).


# Choosing TTS parameters
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/choosing-tts-parameters


Our Text-to-Speech API includes many parameters that can be bewildering to developers who have not
worked with audio before.

In general, you should pick the highest precision and sample rate supported by every stage of your audio
pipeline, including telephony and device outputs.

A typical digital audio setup will perform well with these settings, which match the standard audio CD format:

```
output_format: {
	container: "raw",
	encoding: "pcm_s16le",
	sample_rate: 44100,
}
```

If you know your pipeline supports a higher encoding and sample rate end to end, the highest quality settings are:

```
output_format: {
	container: "raw",
	encoding: "pcm_f32le",
	sample_rate: 48000,
}
```

## Reference

<ParamField type="string">
  The container format (if any), for the audio output.

  Available options: `RAW`, `WAV`, `MP3`. Only the Bytes endpoint supports all container formats;
  our streaming endpoints (SSE, Websockets) only support `RAW`.
</ParamField>

<ParamField type="string">
  The encoding of the output audio. Available options: `pcm_f32le`, `pcm_s16le`, `pcm_mulaw`, `pcm_alaw`.

  For detailed guidance on when to use each encoding, see [Audio encodings](/build-with-cartesia/capability-guides/audio-encodings).
</ParamField>

<ParamField type="number">
  The sample rate of the output audio. Remember that to represent a given signal, the sample rate
  must be at least twice the highest frequency component of the signal (Nyquist theorem).

  Available options: `8000`, `16000`, `22050`, `24000`, `44100`, `48000`.
</ParamField>

## Examples

### Audio CD quality

Standard audio CDs are encoded as `pcm_s16le` at 44.1kHz sample rate:

```
output_format: {
	container: "raw",
	encoding: "pcm_s16le",
	sample_rate: 44100,
}
```

This performs well for consumer digital audio setups.

### Telephony

Many customers send their audio output over Twilio. Since all audio sent over Twilio is
transcoded to µlaw encoding with 8kHz sample rate (to match the telephony standard), you should
specify the following output\_format:

```
output_format: {
  container: "raw",
	encoding: "pcm_mulaw",
	sample_rate: 8000,
}
```

### Bluetooth headsets

If you happen to know that that the user is using a Bluetooth headset (such as AirPods) to multiplex
both microphone input and headphone output, the user will be on the Bluetooth Hands-Free Profile
(HFP), limiting sample rate to 16kHz. (In practice, it's difficult to programmatically determine the
end-user's microphone/speaker devices, so this example is a bit contrived.)

```
output_format: {
	container: "raw"
	encoding: "pcm_s16le",
	sample_rate: 16000,
}
```


# Clone Voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices

Learn how to get the best voice clones from your audio clips.

<Frame>
  <img />
</Frame>

Voice cloning is available through the [playground](https://play.cartesia.ai) and the [API](/2024-11-13/api-reference/voices/clone). With current API versions, instant cloning uses **high-similarity** mode: clones sound more like the source clip, but may reproduce background noise. For the legacy **stability** workflow, pin API version `2024-11-13` and see [Older TTS models](/build-with-cartesia/tts-models/older-models).

For the best voice clones, we recommend following these best practices:

## General best practices for voice cloning

1. **Choose an appropriate script to speak.** You want your recording to align as closely as possible with the voice you want to generate. For example, don't read a colorless transcript in a monotone voice unless you're aiming for a monotonous clone. Instead, prepare a script that is suited to your use case and has the right energy.
2. **Speak as clearly as possible and avoid background noise.** For example, when recording yourself, try to use a high-quality microphone and be in a quiet space.
3. **Avoid long pauses.** Pauses in the recording will be mimicked by the cloned voice, such as between sentences. Ensure your recording matches the pacing you want your voice to follow.
4. **Trim your recording.** The audio you provide should roughly contain speech from start to finish. Make sure the speaker is not cut-off and that there's no excessive silence at the beginning or end. You can use a tool like Audacity or our playground make the perfect clip from your recording.
5. **Speak in the target language.** For instance, if you want the cloned voice to speak Spanish, speak Spanish in the recording. If this is not possible, you can use Cartesia's localization feature—available in the playground and in the API—to convert your clone to a different language.

## Best practices for high-similarity clones

1. **Limit your recording to ten seconds.** This is the sweet spot for high-similarity clones. A longer clip will not result in a better clone.
2. **Set `enhance` to `false` when cloning.** Unless your source clip has substantial background noise, any postprocessing will reduce the similarity of the clone to the source clip.


# End-to-end Pro Voice Cloning (Python)
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices-pro/api

Use Cartesia's REST API to create a Pro Voice Clone.

> **Prerequisites**
>
> 1. You have a **Cartesia API token** (export it as `CARTESIA_API_TOKEN`).
> 2. You have at least 1 M credits on your account.
> 3. You have a folder called `samples/` with one or more `.wav` files.

```python lines theme={null}
"""
End-to-end Pro Voice Cloning example.

Steps
-----
1. Create a dataset.
2. Upload audio files from samples/ to the dataset.
3. Kick off a fine-tune from that dataset.
4. Poll until fine-tune is completed.
5. Get the voices produced by the fine-tune.
"""

import os
import time
from pathlib import Path

import requests

API_BASE = "https://api.cartesia.ai"
API_HEADERS = {
    "Cartesia-Version": "2025-04-16",
    "Authorization": f"Bearer {os.environ['CARTESIA_API_KEY']}",
}


def create_dataset(name: str, description: str) -> str:
    """POST /datasets → dataset id."""
    res = requests.post(
        f"{API_BASE}/datasets",
        headers=API_HEADERS,
        json={"name": name, "description": description},
    )
    res.raise_for_status()
    return res.json()["id"]


def upload_file_to_dataset(dataset_id: str, path: Path) -> None:
    """POST /datasets/{dataset_id}/files (multipart/form-data)."""
    with path.open("rb") as fp:
        res = requests.post(
            f"{API_BASE}/datasets/{dataset_id}/files",
            headers=API_HEADERS,
            files={"file": fp, "purpose": (None, "fine_tune")},
        )
    res.raise_for_status()


def create_fine_tune(dataset_id: str, *, name: str, language: str, model_id: str) -> str:
    """POST /fine-tunes → fine-tune id."""
    body = {
        "name": name,
        "description": "Pro Voice Clone demo",
        "language": language,
        "model_id": model_id,
        "dataset": dataset_id,
    }
    res = requests.post(f"{API_BASE}/fine-tunes", headers=API_HEADERS, json=body, timeout=60)
    res.raise_for_status()
    return res.json()["id"]


def wait_for_fine_tune(ft_id: str, every: float = 10.0) -> None:
    """Poll GET /fine-tunes/{id} until status == completed."""
    start = time.monotonic()
    while True:
        res = requests.get(f"{API_BASE}/fine-tunes/{ft_id}", headers=API_HEADERS)
        res.raise_for_status()
        status = res.json()["status"]
        print(f"fine-tune {ft_id} -> {status}. Elapsed: {time.monotonic() - start:.0f}s")
        if status == "completed":
            return
        if status == "failed":
            raise RuntimeError(f"fine-tune ended with status={status}")
        time.sleep(every)


def list_voices(ft_id: str) -> list[dict]:
    """GET /fine-tunes/{id}/voices → list of voices."""
    res = requests.get(f"{API_BASE}/fine-tunes/{ft_id}/voices", headers=API_HEADERS)
    res.raise_for_status()
    return res.json()["data"]


if __name__ == "__main__":
    # Create the dataset
    DATASET_ID = create_dataset("PVC demo", "Samples for a Pro Voice Clone")
    print("Created dataset:", DATASET_ID)

    # Upload .wav files to the dataset
    for wav_path in Path("samples").glob("*.wav"):
        upload_file_to_dataset(DATASET_ID, wav_path)
        print(f"Uploaded {wav_path.name} to dataset {DATASET_ID}")

    # Ask for confirmation before kicking off the fine-tune
    confirmation = input(
        "Are you sure you want to start the fine-tune? It will cost 1M credits upon successful completion (yes/no): "
    )
    if confirmation.lower() != "yes":
        print("Fine-tuning cancelled by user.")
        exit()

    # Kick off the fine-tune
    FINE_TUNE_ID = create_fine_tune(
        DATASET_ID,
        name="PVC demo",
        language="en",
        model_id="sonic-2",
    )
    print(f"Started fine-tune: {FINE_TUNE_ID}")

    # Wait for training to finish
    wait_for_fine_tune(FINE_TUNE_ID)
    print("Fine-tune completed!")

    # Fetch the voices created by the fine-tune
    voices = list_voices(FINE_TUNE_ID)
    print("Voices IDs:")
    for voice in voices:
        print(voice["id"])
```


# Pro Voice Cloning
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices-pro/playground


## Why use Pro Voice Cloning?

A Professional Voice Clone (PVC) is a voice that uses a fine-tune of our TTS model on your data, which allows it to create an almost exact replica of the voice it hears including accent, speaking style, and audio quality.

Compared to [Instant Voice Cloning](/build-with-cartesia/capability-guides/clone-voices), Pro Voice Cloning can capture the exact nuances of your hours of studio-quality audio voice data.

<Frame>
  <img />
</Frame>

## Overview

Pro Voice Cloning is available in the [Playground](https://play.cartesia.ai/pro-voice-cloning) for anyone with a Cartesia subscription of Startup or higher. It allows you to create highly accurate voice clones by leveraging a larger amount of data compared to instant cloning.

| Feature             | Required audio data | Pricing: cost to create | Pricing: cost to use for TTS |
| ------------------- | ------------------- | ----------------------- | ---------------------------- |
| Instant Voice Clone | 10 seconds          | Free                    | 1 credit per character       |
| Pro Voice Clone     | 3 hours             | 1M credits on success   | 1.5 credits per character    |

When you create a Pro Voice Clone, Cartesia first fine-tunes a model on your data, then creates Voices from selected clips of your data. These Voices are tied to the fine-tuned model and will be automatically used with these Voices for text-to-speech.

<Frame>
  <img />
</Frame>

## Get started

Visit the Pro Voice Clone tab to get started on your first PVC. On the home page, you can to see all your fine-tuned models and their statuses (i.e Draft, Failed, Training, Completed).

<Frame>
  <img />
</Frame>

<Steps>
  <Step title="Prepare Data">
    Fill out the form to create a Pro Voice Clone.

    <Frame>
      <img />
    </Frame>

    Then, upload all of the audio files you want to use for training. You can upload multiple
    files at once. Files must be one of the following audio formats:

    * .wav
    * .mp3
    * .flac
    * .ogg
    * .oga
    * .ogx
    * .aac
    * .wma
    * .m4a
    * .opus
    * .ac3
    * .webm

    Pro Voice Clones require a minimum of 30 minutes of audio, but we recommend 2 hours of audio for optimal balance of quality and effort. The Pro Voice Clone will closely match your uploaded data, so make sure it sounds the way you like in terms of background noise, loudness, and speech quality.
    Generally, it's better to upload audio with only the speaker you which to clone. Multi-speaker audio can interfere with cloning quality.

    <Frame>
      <img />
    </Frame>

    If you also reused data from past Pro Voice Clones. Switch to the **Select dataset** tab to view previous datasets. These datasets can be edited separately from your PVCs and are helpful for managing your audio files.

    <Frame>
      <img />
    </Frame>
  </Step>

  <Step title="Train Model">
    Training should take 3 hours to complete. You'll only be charged if the training is successful. If training fails, you can click the `Re-attempt Training` button to try again or contact [support](mailto:support@cartesia.ai) if the failures persist.
  </Step>

  <Step title="Test Voices">
    Once training is complete, we'll automatically create four Voices based on different source audio clips from your dataset. These Voices are internally linked to your fine-tuned model, which will be used when you specify the model ID of the fine-tuned model in your requests.

    The Voices are also available in the Voice Library under My Voices and can be used through the API.

    <Frame>
      <img />
    </Frame>

    **Note about base model updates:**

    We've fine-tuned the latest base model available in production, which is reflected in the displayed model ID. This means that the fine-tuned model is fixed to this particular model ID and will not be activated if you use a different `model-id`. PVCs will not automatically be updated for future base models, and will need to be retrained on each new base model.
    Retraining a new fine-tuned model with new data or the latest base model will again cost 1M credits.
  </Step>
</Steps>


# Localize voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/localize-voices

Learn how to localize voices for your brand or product.

<LocalizeVoicesIntro />

The localization feature accepts a voice to localize, the gender of the voice, and the target language and accent to localize to, and produces a Voice that you can use to generate speech (or save as a new voice).

<Frame>
  <img />
</Frame>


# Stream Inputs using Continuations
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/stream-inputs-using-continuations

Learn how to stream input text to Sonic TTS.

In many real-time use cases, you don't have input text available upfront—like when you're generating it on the fly using a language model. For these cases, we support input streaming through a feature we call *continuations*.

<Info>
  This guide will cover how input streaming works from the perspective of the TTS model. If you just want to implement input streaming, see [the WebSocket API reference](/api-reference/tts/tts), which implements continuations using *contexts*.
</Info>

## Continuations

Continuations are generations that extend already generated speech. They're called continuations because you're continuing the generation from where the last one left off, maintaining the *prosody* of the previous generation.

If you don't use continuations, you get sudden changes in prosody that create seams in the audio.

<Note>
  Prosody refers to the rhythm, intonation, and stress in speech. It's what makes speech flow naturally and sound human-like.
</Note>

Let's say we're using an LLM and it generates a transcript in three parts, with a one second delay between each part:

1. `Hello, my name is Sonic.`
2. ` It's very nice`
3. ` to meet you.`

To generate speech for the whole transcript, we might think to generate speech for each part independently and stitch the audios together:

<Frame>
  <img alt="no_continuations" />
</Frame>

Unfortunately, we end up with speech that has sudden changes in prosody and strange pacing:

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>

Now, let's try the same transcripts, but using continuations. The setup looks like this:

<Frame>
  <img alt="continuations" />
</Frame>

Here's what we get:

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>

As you can hear, this output sounds seamless and natural.

<Check>
  You can scale up continuations to any number of inputs. There is no limit.
</Check>

## Caveat: Streamed inputs should form a valid transcript when joined

This means that `"Hello, world!"` can be followed by `" How are you?"` (note the leading space) but not `"How are you?"`, since when joined they form the invalid transcript `"Hello, world!How are you?"`.

In practice, this means you should maintain spacing and punctuation in your streamed inputs.

<Warning>
  **End complete sentences with closing punctuation** (for example `.`, `?`, or `!`).

  If a streamed chunk does not end with sentence-ending punctuation, the model often treats it as an incomplete sentence. That can cause:

  * **Extra latency:** Text may stay in the automatic input buffer until the model sees a clearer boundary or until `max_buffer_delay_ms` elapses (**3000ms by default**), so audio starts later than you expect.
  * **Audio artifacts:** The model expects natural sentence endings; without closing punctuation, the generated audio sometimes ends with odd or distorted sounds.

  When a user-facing utterance is finished, put terminal punctuation on the final segment (and signal that no more text is coming on the context when appropriate, for example `no_more_inputs()` in the SDK or `continue: false` over the WebSocket).
</Warning>

## Automatic buffering with `max_buffer_delay_ms`

When streaming inputs from LLMs word-by-word or token-by-token, we buffer text until the optimal transcript length for our model. The default buffer is 3000ms, if you wish to modify this you can use the `max_buffer_delay_ms` parameter, though we *do not recommend making this change*.

<Warning>
  If you plan on using `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with buffering, make sure decimal values are not split up.
  Submitting `1.0` as `1`, `.`, `0` will result in unintended failure modes.
</Warning>

### How it works

When set, the model will buffer incoming text chunks until it's confident it has enough context to generate high-quality speech, or the buffer delay elapses, whichever comes first.

Without this buffer, the model would immediately start generating with each input, which could result in choppy audio or unnatural prosody if inputs are very small (like single words or tokens).

### Configuration

* **Range**: Values between 0-5000ms are supported
* **Default**: 3000ms

Use this *only* if

* you have custom buffering client side, in which case you can set this to 0
* you have choppiness even at 3000ms, in which case you can try a higher value

```js lines theme={null}
// Example WebSocket request with `max_buffer_delay_ms`
{
  "model_id": "sonic-3",
  "transcript": "Hello",  // First word/token
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-conversation-123",
  "continue": true,
  "max_buffer_delay_ms": 3000  // Buffer up to 3000ms
}
```

Let's try the following transcripts with continuations and the default `max_buffer_delay_ms=3000`: `['Hello', 'my name', 'is Sonic.', "It's ", 'very ', 'nice ', 'to ', 'meet ', 'you.']`

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>


# Custom Pronunciations
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/custom-pronunciations

Learn how to specify custom pronunciations for words that are hard to get right, like proper nouns or domain-specific terms.

All models in the Sonic TTS family support custom pronunciations in your transcripts. Try out the pronunciation tool on our [demo](https://play.cartesia.ai/demos/pronunciation) page.

<Tabs>
  <Tab title="Sonic-3">
    `sonic-3` supports custom pronunciation dictionaries, which allow specifying how to pronounce a specific word or words more easily and sustainably.

    At its core, a dictionary is a simple search and replace, which directs the model to use another string in lieu of the text for the transcript. The pronunciation can either be an [IPA pronunciation](/build-with-cartesia/sonic-3/phonemes), or a "sounds-like" guidance:

    ```json lines theme={null}
    [
      {
        "text": "bayou",
        "pronunciation": "<<ˈ|b|ɑ|ˈ|j|u>>"
      },
      {
        "text": "jambalaya",
        "pronunciation": "<<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>>"
      },
      {
        "text": "tchoupitoulas",
        "pronunciation": "chop-uh-TOO-liss"
      }
    ]
    ```

    These JSONs can then be saved as a pronunciation dictionaries [through our API](https://docs.cartesia.ai/api-reference/pronunciation-dicts/create), or through our [playground](https://play.cartesia.ai/pronunciation). The playground gives affordances for creating and manipulating dictionaries also directly in the UI:

    <img alt="image.png" />

    Once the dictionaries are created, they can be used in any of the TTS APIs by specifying the id in `pronunciation_dict_id`.

    With the above dictionary, the string: `I ate some jambalaya on tchoupitoulas street` would become `I ate some <<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>> on chop-uh-TOO-liss street`  before being handed off to the model, which in turn, would do a better job in pronouncing it properly.

    ## Case Sensitivity

    Dictionary matching is **case-sensitive**, with one exception: a lowercase entry also matches its sentence-start capitalized form. For example, `cat` matches both `cat` and `Cat`, but not `CAT`. An entry for `CAT` only matches `CAT`.

    This applies to multi-word entries too. An entry for `green valley` matches `green valley` and `Green valley`, but not `Green Valley`.

    **Use lowercase entries for common words.** These match the word both mid-sentence (`cat`) and at the start of a sentence (`Cat`), covering the two most common positions.

    **Use exact capitalization for proper nouns.** A term like "LaTeX" should be entered as `LaTeX` so it doesn't collide with a different pronunciation for the common word `latex`. For multi-word proper nouns, enter the exact casing as it appears in your transcripts — for example, `Green Valley` if the transcript capitalizes both words.
  </Tab>

  <Tab title="Sonic-turbo and Sonic-2">
    > For the best controllability around pronunciation, we recommend using `sonic-3`.

    `sonic-2` and `sonic-turbo` use MFA-style IPA for all languages.
    For the best controllability around pronunciation, we recommend using `sonic-2`.

    You can also get custom pronunciations with older Sonic models.
    The `sonic`, `sonic-2024-12-12`, and `sonic-2024-10-19` models use Sonic-flavored IPA phonemes for English.
    The `sonic` and `sonic-2024-12-12` use MFA-style IPA for languages other than English, and the Sonic Preview model uses MFA-style IPA for all languages.
    Note that `sonic-2024-10-19` does not support custom pronunciations for languages other than English.
    We will soon be updating all models to use MFA-style IPA.

    Custom words should be wrapped in double angle brackets `<<` `>>` , with pipe characters `|` between phonemes and no whitespace.
    For example:

    * `Can I get <<x|a|l|a|p|e|ɲ|o>> on that?` (MFA-style IPA)
    * `Can I get <<h|ɑː|l|ˈə|p|eɪ|n|y|ˌoʊ|>> on that?` (Sonic-flavored IPA)

    Each individual word should be wrapped in it’s own set of angle brackets.

    # MFA-style IPA

    ## Constructing Pronunciations

    We use the IPA phoneset as defined by the [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/). Because of the size and complexity of this phoneset, you may find it easier to construct your custom pronunciation starting from existing words with known phonemizations. We suggest the following workflow for constructing a custom pronunciation for a word:

    1. Go to the [MFA pronunciation dictionary index](https://mfa-models.readthedocs.io/en/latest/dictionary/index.html) and find the page corresponding to your language. Make sure the phoneset is MFA, and try to download the latest version (for most languages, v3.0 or v3.1).
       1. This page will give you the full range of acceptable phones for your language under the “phones” section.
    2. Scroll down to the `Installation` section and click on the `Download from the release page` link.
    3. Scroll to the bottom of the release page and download the .dict file; this is a text file mapping words to their constituent phonemes.
       1. The first column in the file contains words, and the last column contains space delimited phonemes. Ignore the other columns.
    4. Look up your word or words that sound similar to your intended pronunciation in the dictionary. Use these pronunciations as a starting point to construct your custom pronunciation.

    Automatic pronunciation suggestions based on audio samples will be added in a future update. Note that MFA-style IPA does not support stress markers.

    ## Example

    Suppose I want to generate the text “This is a generation from Cartesia” and the model is not pronouncing “Cartesia” correctly. I would do the following:

    1. Go to the [MFA pronunciation dictionary index](https://mfa-models.readthedocs.io/en/latest/dictionary/index.html) and look for English pronunciation dictionaries. I see that for US English, the most recent version is v3.1.
       1. I note that the page says that the acceptable phones for US english are `aj aw b bʲ c cʰ cʷ d dʒ dʲ d̪ ej f fʲ h i iː j k kʰ kʷ l m mʲ m̩ n n̩ ow p pʰ pʲ pʷ s t tʃ tʰ tʲ tʷ t̪ v vʲ w z æ ç ð ŋ ɐ ɑ ɑː ɒ ɒː ɔj ə ɚ ɛ ɝ ɟ ɟʷ ɡ ɡʷ ɪ ɫ ɫ̩ ɱ ɲ ɹ ɾ ɾʲ ɾ̃ ʃ ʉ ʉː ʊ ʎ ʒ ʔ θ`

    2. Download the .dict file from the bottom of the [release page](https://github.com/MontrealCorpusTools/mfa-models/releases/tag/dictionary-english_us_mfa-v3.1.0).

    3. Find a word in this dictionary that sounds similar to how I want “Cartesia” to be pronounced. I see this entry in the dictionary:

       `cartesian	0.99	0.14	1.0	1.0	kʰ ɑ ɹ tʲ i ʒ ə n`

    4. Ignore the middle four numeric columns. I want to cut off the part of the pronunciation that corresponds to “-an” and replace it with an “uh” sound. I know that the MFA phoneme for “uh” is `ɐ` (if I didn’t know that, I could also look up “uh” in the dictionary). So the pronunciation I want is `kʰ ɑ ɹ tʲ i ʒ ɐ`.

    5. Format the phonemes it in angle brackets with pipe characters between phonemes and no whitespace. So my transcript is `This is a generation from <<kʰ|ɑ|ɹ|tʲ|i|ʒ|ɐ>>`.

    # (Deprecated) Sonic-flavored IPA

    Sonic-flavored IPA is only for `sonic` and users of our latest models (`sonic-2` and `sonic-turbo`) should use MFA-style IPA.

    Here is a pronunciation guide for Sonic-flavored IPA.
    It follows the [English phonology article on Wikipedia](https://en.wikipedia.org/wiki/English_phonology) for most phonemes,
    but in spots where our model requires different notation than you may expect, we've included a blue `<=` in the margins.

    You can copy/paste some of these uncommon symbols from the original [charts here](https://docs.google.com/spreadsheets/d/1OJbiKtxLyodpNPqVfOu43X2HloLsAixTtFppEuQ_4pI/edit?usp=sharing).

    <Frame>
      <img alt="" />
    </Frame>

    ## Stresses and vowel length markers

    Sonic English requires stress markers for first (`ˈ`) and second (`ˌ`) stressed syllables, which go directly before the vowel. We also use annotations for vowel length (`ː`). The model can also operate without them, but you will have noticeably better robustness and control when using them.
  </Tab>
</Tabs>


# Prompting tips
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/prompting-tips


1. **Use appropriate punctuation.** Add punctuation where appropriate and at the end of each transcript whenever possible.
2. **Use dates in MM/DD/YYYY form.** For example, 04/20/2023.
3. **Add spaces between time and AM/PM.** For example, `7:00 PM`, `7 PM`, `7:00 P.M`.
4. **Insert pauses.** To insert pauses, insert "-" or use [break tags](/build-with-cartesia/formatting-text-for-sonic-2/inserting-breaks-pauses) where you need the pause. These tags are considered 1 character and do not need to be separated with adjacent text using a space -- to save credits you can remove spaces around break tags.
5. **Match the voice to the language.** Each voice has a language that it works best with. You can use the playground to quickly understand which voices are most appropriate for a language.

6) **Stream in inputs for contiguous audio.** Use [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) if generating audio that should sound contiguous in separate chunks.
7) **Specify [custom pronunciations](/build-with-cartesia/sonic-3/custom-pronunciations) for
   domain-specific or ambiguous words.** You may want to do this for proper nouns and trademarks, as
   well as for words that are spelled the same but pronounced differently, like the city of Nice and
   the adjective "nice."
8) **Force [spelling out numbers and letters](/build-with-cartesia/sonic-3/ssml-tags#spelling-out-numbers-and-letters).** You may want to do this for IDs, email addresses, or numeric values.

<Note>For sonic-2, see [Formatting Text for Sonic-2](/build-with-cartesia/formatting-text-for-sonic-2/best-practices).</Note>


# SSML Tags
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/ssml-tags


<Warning>
  Tags for volume, speed, and emotions is in beta and subject to change in the
  future.
</Warning>

Sonic-3 supports SSML-like (Speech Synthesis Markup Language) tags to control generated speech.

## Speed

<Warning>
  Note that if you're streaming token by token, you'll need to buffer the whole value of the speed or volume tags.
  Passing in `1`, `.`, `0` as separate inputs, for example, will result in reading out the tags.
</Warning>

You can guide the speed of a TTS generation with a `speed` tag, which takes a scalar between `0.6` and `1.5`.
This value is roughly a multiplier on the default speed. For example, `1.5` will generate audio at roughly 1.5x the
default speed.

```xml theme={null}
<speed ratio="1.5"/> I like to speak quickly because it makes me sound smart.
```

## Volume

You can guide the volume of a TTS generation with a `volume` tag, which is a number between `0.5`
and `2.0`. The default volume is `1`.

```xml theme={null}
<volume ratio="0.5"/> I will speak softly.
```

## Emotion <span>Beta</span>

<Warning>
  Emotion control is highly experimental, particularly when emotion shifts occur
  mid-generation. If you need to change the emotion in a transcript, we recommend
  using separate generation contexts for each emotion. For best results, use [Voices
  tagged as "Emotive"](https://play.cartesia.ai/voices?tags=Emotive), as emotions may not work reliably with other Voices.
</Warning>

```xml theme={null}
<emotion value="angry"/> I will not allow you to continue this! <emotion value="sad"/> I was hoping for a peaceful resolution.
```

## Pauses and breaks

To insert breaks (or pauses) in generated speech, use a `break` tags with one attribute, `time`. For
example, `<break time="1s"/>`. You can specify the time in seconds (`s`) or milliseconds (`ms`).
For accounting purposes, these tags are considered 1 character and do not need to be separated with adjacent text using a
space -- to save credits you can remove spaces around break tags.

```xml theme={null}
Hello, my name is Sonic.<break time="1s"/>Nice to meet you.
```

## Spelling out numbers and letters

To spell out input text, you can wrap it in `<spell>` tags.

This is particularly useful for pronouncing long numbers or identifiers, such as credit card numbers, phone numbers, or unique IDs.

```xml theme={null}
My name is Bob, spelled <spell>Bob</spell>, my account number is <spell>ABC-123</spell>, my phone number is <spell>(123) 456-7890</spell>, and my credit card is <spell>1234-5678-9012-3456</spell>.
```

If you want to spell out numbers or identifiers and have planned breaks between the generations (e.g. taking a break between the area code of a phone number and the rest of that number), you can combine `<break>` and `<spell>` tags. These tags are considered 1 character and do not need to be separated with adjacent text using a space -- to save credits you can remove spaces around break and spell tags.

```xml theme={null}
My phone number is <spell>(123)</spell><break time="200ms"/><spell>4712177</spell> and my credit card number is <spell>1234</spell><break time="200ms"/><spell>5678</spell> <break time="200ms"/><spell>6347</spell><break time="200ms"/><spell>4537</spell>.
```


# Volume, Speed, and Emotion
Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/volume-speed-emotion


Sonic-3 provides rich controls for the speed, volume, and emotion of generated speech. These controls are available on play.cartesia.ai using the UI controls, or by passing in a `generation_config` parameter, or by using SSML tags within the transcript itself.

<Tip>
  **Sonic-3 interprets these parameters as guidance** instead of as strict adjustments to ensure natural speech, so we recommend testing them against your content to ensure the output matches your expectations.
</Tip>

## Speed and Volume Controls

You can guide the speed and volume of a TTS generation with the `generation_config.speed` and `generation_config.volume` parameters. These values are roughly a multiplier on the default speed and volume, eg, `1.5` will generate audio at 1.5x the default speed.

<ParamField type="number">
  The speed of the generation, ranging from `0.6` to `1.5`.
</ParamField>

<ParamField type="number">
  The volume of the generation, ranging from `0.5` to `2.0`.
</ParamField>

You can also specify these inside the transcript itself, using [SSML](/build-with-cartesia/sonic-3/ssml-tags), for example:

```xml lines theme={null}
<speed ratio="1.5"/> I like to speak quickly because it makes me sound smart.
<volume ratio="1.5"/> And I can be loud, too!
```

## Emotion Controls <span>Beta</span>

By default, the model attempts to interpret the emotional subtext present in the provided transcript. You can guide the emotion of a TTS generation, like a director providing guidance to an actor, using the `generation_config.emotion` parameter.

<Note>
  Emotion tags are good to push the model to be more emotive, but they only work when the emotion is consistent with transcript. For instance, the mismatch below is unlikely to work well:
</Note>

```xml theme={null}
<emotion value="sad"/> I'm so excited!
```

<ParamField type="string">
  The emotional guidance for a generation, one of the emotions below.
</ParamField>

The primary emotions, for which we have the most data and produce the best results, are: `neutral`, `angry`, `excited`, `content`, `sad`, and `scared`.

The complete list of available emotions is: `happy`, `excited`, `enthusiastic`, `elated`, `euphoric`, `triumphant`, `amazed`, `surprised`, `flirtatious`, `joking/comedic`, `curious`, `content`, `peaceful`, `serene`, `calm`, `grateful`, `affectionate`, `trust`, `sympathetic`, `anticipation`, `mysterious`, `angry`, `mad`, `outraged`, `frustrated`, `agitated`, `threatened`, `disgusted`, `contempt`, `envious`, `sarcastic`, `ironic`, `sad`, `dejected`, `melancholic`, `disappointed`, `hurt`, `guilty`, `bored`, `tired`, `rejected`, `nostalgic`, `wistful`, `apologetic`, `hesitant`, `insecure`, `confused`, `resigned`, `anxious`, `panicked`, `alarmed`, `scared`, `neutral`, `proud`, `confident`, `distant`, `skeptical`, `contemplative`, `determined`.

The Voices with the best emotional response are:

* [Leo](https://play.cartesia.ai/voices/0834f3df-e650-4766-a20c-5a93a43aa6e3) (id: `0834f3df-e650-4766-a20c-5a93a43aa6e3`)
* [Jace](https://play.cartesia.ai/voices/6776173b-fd72-460d-89b3-d85812ee518d) (id: `6776173b-fd72-460d-89b3-d85812ee518d`)
* [Kyle](https://play.cartesia.ai/voices/c961b81c-a935-4c17-bfb3-ba2239de8c2f) (id: `c961b81c-a935-4c17-bfb3-ba2239de8c2f`)
* [Gavin](https://play.cartesia.ai/voices/f4a3a8e4-694c-4c45-9ca0-27caf97901b5) (id: `f4a3a8e4-694c-4c45-9ca0-27caf97901b5`)
* [Maya](https://play.cartesia.ai/voices/cbaf8084-f009-4838-a096-07ee2e6612b1) (id: `cbaf8084-f009-4838-a096-07ee2e6612b1`)
* [Tessa](https://play.cartesia.ai/voices/6ccbfb76-1fc6-48f7-b71d-91ac6298247b) (id: `6ccbfb76-1fc6-48f7-b71d-91ac6298247b`)
* [Dana](https://play.cartesia.ai/voices/cc00e582-ed66-4004-8336-0175b85c85f6) (id: `cc00e582-ed66-4004-8336-0175b85c85f6`)
* [Marian](https://play.cartesia.ai/voices/26403c37-80c1-4a1a-8692-540551ca2ae5) (id: `26403c37-80c1-4a1a-8692-540551ca2ae5`)

View the full list of emotive Voices on our [Voice Library with voices tagged 'Emotive'](https://play.cartesia.ai/voices?tags=Emotive).

You can also use [SSML](/build-with-cartesia/sonic-3/ssml-tags) tags for emotions, for example:

```xml theme={null}
<emotion value="angry"/> How dare you speak to me like I'm just a robot!
```

## Nonverbalisms

Insert `[laughter]`in your transcript to make the model laugh. In the future we plan to add more non-speech verbalisms like sighs and coughs.


# STT Models
Source: https://docs.cartesia.ai/build-with-cartesia/stt-models


Ink is a new family of streaming speech-to-text (STT) models for developers building real-time voice applications.

* <Icon icon="circle" /> the latest **stable** snapshot of the model

To use the stable version of the model, we recommend using the base model name (e.g. `ink-whisper`).
In many cases the stable and preview snapshots are the same, but in some cases the preview snapshot may have additional features or improvements.

## `ink-whisper`

Ink Whisper is the fastest, most affordable speech-to-text model — engineered for enterprise deployment in production-grade voice agents. It delivers higher accuracy than baseline Whisper and is optimized for real-time performance in a wide variety of real-world conditions.

Additional Capabilities:

* Handles variable-length audio chunks and interruptions gracefully using dynamic chunking.
* Reliably transcribes speech with background noise.
* Accurately transcribes audio with telephony artifacts, accents, and disfluencies.
* Excels at transcribing proper nouns and domain-specific terminology.

| Snapshot                             | Release Date  | Languages                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Status |
| ------------------------------------ | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |
| <Icon icon="circle" /> `ink-whisper` | June 10, 2025 | `en`, `zh`, `de`, `es`, `ru`, `ko`, `fr`, `ja`, `pt`, `tr`, `pl`, `ca`, `nl`, `ar`, `sv`, `it`, `id`, `hi`, `fi`, `vi`, `he`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `ta`, `no`, `th`, `ur`, `hr`, `bg`, `lt`, `la`, `mi`, `ml`, `cy`, `sk`, `te`, `fa`, `lv`, `bn`, `sr`, `az`, `sl`, `kn`, `et`, `mk`, `br`, `eu`, `is`, `hy`, `ne`, `mn`, `bs`, `kk`, `sq`, `sw`, `gl`, `mr`, `pa`, `si`, `km`, `sn`, `yo`, `so`, `af`, `oc`, `ka`, `be`, `tg`, `sd`, `gu`, `am`, `yi`, `lo`, `uz`, `fo`, `ht`, `ps`, `tk`, `nn`, `mt`, `sa`, `lb`, `my`, `bo`, `tl`, `mg`, `as`, `tt`, `haw`, `ln`, `ha`, `ba`, `jw`, `su`, `yue` | Stable |
| `ink-whisper-2025-06-04`             | June 4, 2025  | `en`, `zh`, `de`, `es`, `ru`, `ko`, `fr`, `ja`, `pt`, `tr`, `pl`, `ca`, `nl`, `ar`, `sv`, `it`, `id`, `hi`, `fi`, `vi`, `he`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `ta`, `no`, `th`, `ur`, `hr`, `bg`, `lt`, `la`, `mi`, `ml`, `cy`, `sk`, `te`, `fa`, `lv`, `bn`, `sr`, `az`, `sl`, `kn`, `et`, `mk`, `br`, `eu`, `is`, `hy`, `ne`, `mn`, `bs`, `kk`, `sq`, `sw`, `gl`, `mr`, `pa`, `si`, `km`, `sn`, `yo`, `so`, `af`, `oc`, `ka`, `be`, `tg`, `sd`, `gu`, `am`, `yi`, `lo`, `uz`, `fo`, `ht`, `ps`, `tk`, `nn`, `mt`, `sa`, `lb`, `my`, `bo`, `tl`, `mg`, `as`, `tt`, `haw`, `ln`, `ha`, `ba`, `jw`, `su`, `yue` | Stable |

To learn how to use the Ink STT family, see [the Speech-to-Text API Reference](/api-reference/stt/stt). You can find a detailed mapping of codes to languages, see the [STT supported languages](/api-reference/stt/stt#request.query.language) list.

## Selecting a Model

When making API calls, you can specify either:

```python lines theme={null}
// Use the base model (automatically routes to the latest snapshot)
{
  model = "ink-whisper",
  ...
}

// Or specify a particular snapshot for consistency
{
  model = "ink-whisper-2025-06-04",
  ...
}
```

### Continuous updates

All models have a base model name (e.g. `ink-whisper`).
We recommend using these for prototyping and development, then switching to a date-versioned model for production use cases to ensure stability.

## Future Updates

New snapshots are released periodically with improvements in performance, additional language support, and new capabilities. Check back regularly for updates.


# API Changes
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/api-changes


Starting June 1, 2026, we are discontinuing several models, snapshots, and languages, and removing voice embeddings from our voice API. Migrate to `sonic-3` for improved naturalness, 42-language support, and fine-grained controls.

## Deprecated models and languages

You can check if you're making requests to deprecated models on [play.cartesia.ai/deprecation/traffic](https://play.cartesia.ai/deprecation/traffic).

### Fully deprecated models

These models will stop serving requests on June 1, 2026.

| Model                | Snapshots affected       | Deprecated languages |
| -------------------- | ------------------------ | -------------------- |
| `sonic`              | All                      | All                  |
| `sonic-english`      | —                        | All                  |
| `sonic-multilingual` | —                        | All                  |
| `sonic-2`            | `sonic-2-2025-03-07`     | All                  |
| `sonic-turbo`        | `sonic-turbo-2025-03-07` | All                  |

### Partially deprecated models

These models will continue to serve a reduced set of languages. The languages listed below will be discontinued on June 1, 2026.

| Model         | Snapshots affected                                               | Deprecated languages       |
| ------------- | ---------------------------------------------------------------- | -------------------------- |
| `sonic-2`     | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | it, nl, pl, ru, sv, tr, hi |
| `sonic-turbo` | `sonic-turbo-2025-06-04`                                         | it, nl, pl, ru, sv, tr     |

## Stable offerings

The following will remain available beyond June 1.

| Model         | Snapshots                                                        | Supported Languages                                                                 |
| ------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| `sonic-3`     | All                                                              | 42 languages — [full list](/build-with-cartesia/tts-models/latest#language-support) |
| `sonic-2`     | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | en, de, es, fr, ja, ko, pt, zh                                                      |
| `sonic-turbo` | `sonic-turbo-2025-06-04`                                         | en, de, es, fr, ja, ko, pt, zh, hi                                                  |

## API changes

These endpoints will be discontinued on June 1, 2026.

| Discontinued Endpoint                      | Replacement                                |
| ------------------------------------------ | ------------------------------------------ |
| Voice Embedding: `POST /voices/clone/clip` | [Clone Voice](/api-reference/voices/clone) |
| Mix Voices: `POST /voices/mix`             | —                                          |
| Create Voice: `POST /voices`               | [Clone Voice](/api-reference/voices/clone) |

These endpoints will stop accepting voice embeddings on June 1, 2026.

| Endpoint with a breaking change       | Replacement                                            |
| ------------------------------------- | ------------------------------------------------------ |
| TTS (bytes): `POST /tts/bytes`        | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |
| TTS (SSE): `POST /tts/sse`            | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |
| TTS (WebSocket): `WSS /tts/websocket` | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) |

You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.

### Moving off of deprecated endpoints

1. Change how you create voices — see [Migrating Voices](/build-with-cartesia/tts-models/migrating-voices).
2. Switch from voice embeddings to IDs — see [Voice IDs](/build-with-cartesia/tts-models/voice-ids).

## Full Checklist

1. Move off of [deprecated models / snapshots / languages](/build-with-cartesia/tts-models/api-changes#deprecated-models-and-languages) onto `sonic-3` or another stable model
2. Move off of [deprecated endpoints](/build-with-cartesia/tts-models/api-changes#api-changes) when creating voices
3. Use [Voice IDs](/build-with-cartesia/tts-models/voice-ids)
4. Check your deprecated model traffic on [play.cartesia.ai/deprecation/traffic](https://play.cartesia.ai/deprecation/traffic)
5. Make sure your voices are migrated on [play.cartesia.ai/deprecation/voices](https://play.cartesia.ai/deprecation/voices)
6. (Optional) Upgrade your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`

## Why are we doing this?

Since the launch of Sonic 3, we've made improvements across pacing, prosody, and naturalness; the vast majority of our customers have migrated to these models with great success. In order to increase our capacity, availability, and serving performance, we have to discontinue our oldest models.

Additionally, our newer models have evolved beyond voice embeddings in order to sound more natural. The parts of our API that accept voice embeddings cannot be made forward-compatible. Migrating to voice IDs will allow us to continue to improve both our models and your voices in tandem.

If you have questions, reach out to [support@cartesia.ai](mailto:support@cartesia.ai).


# Migrating Voices
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/migrating-voices


On June 1, 2026, we are discontinuing our voice embedding (aka stability) TTS models.

Voices listed on [play.cartesia.ai/deprecation/voices](https://play.cartesia.ai/deprecation/voices) will stop working. Simply click "Auto Migrate" to make these voices compatible with the latest Sonic 3, 2, and Turbo snapshots.

If you use voice embeddings rather than voice IDs, see [Voice IDs](/build-with-cartesia/tts-models/voice-ids).

For an overview of all changes, see [API Changes](/build-with-cartesia/tts-models/api-changes).

## Where do these voices come from?

Voices created by these endpoints rely on our voice embedding models:

* [POST /voices](/2024-06-10/api-reference/voices/create)
* [POST /voices/mix](/2024-06-10/api-reference/voices/mix)
* `POST /voices/clone/clip`

## Creating voices

You can move to our [Clone Voice API](/api-reference/voices/clone) or use our [web UI](https://play.cartesia.ai/voices/create/clone) to create voices from 3–10 seconds of source audio.

You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.

Here is an example using the Cartesia SDK:

```python theme={null}
your_api_key: str = ""

client = Cartesia(api_key=your_api_key)

print("Cloning a voice")
with open("3 to 10 seconds of source audio.wav", mode="rb") as f:
    voice = client.voices.clone(
        clip=f,
        # this must match the source audio
        language="en",
        name="My Voice",
        mode="similarity",
)
print(f"Cloned voice {voice.id}")

print("Generating audio...")
generated_audio = client.tts.bytes(
    # voice embeddings will not work after June 1, 2026!
    voice={"mode": "id", "id": voice.id},
    model_id="sonic-3",
    transcript="Hello from Cartesia!",
    language="en",
    output_format={
        "container": "wav",
        "encoding": "pcm_f32le",
        "sample_rate": 44100
    },
)
```


# Older TTS Models
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/older-models


<Note>
  We recommend using [Sonic 3](/build-with-cartesia/tts-models/latest) for best
  results, most languages, and controllability. We continue to serve these older
  models for compatibility.
</Note>

<Danger>
  Some models and snapshots are being discontinued on June 1, 2026 — see [API Changes](/build-with-cartesia/tts-models/api-changes) for details.
</Danger>

> <Icon icon="circle" /> the latest **stable** snapshot of the model\
> <Icon icon="circle" /> to be discontinued June 1, 2026

All models have a base model name (e.g. `sonic-2`, `sonic-turbo`) and date-versioned model names
(e.g. `sonic-2-2025-06-11`).
We recommend using base model names for prototyping and development, then switching to a date-versioned model for production use cases to ensure stability.

When making API calls, you can specify either:

```python lines theme={null}
# Use the base model
# (automatically routes to the latest stable snapshot)
model_id = "sonic-3"

# Or specify a particular snapshot for consistency
model_id = "sonic-3-2026-01-12"
```

## `sonic-2`

Sonic-2 provides ultra-realistic speech with accurate transcript following, minimal hallucinations, and excellent voice cloning. It's latency optimized and achieves 90ms model latency.

Additional Capabilities:

* Higher fidelity voice cloning
* Timestamps for all 15 languages
* [Infill](/2024-11-13/api-reference/infill/bytes) support

| Snapshot                                    | Release Date   | Languages                                                  | Status           |
| ------------------------------------------- | -------------- | ---------------------------------------------------------- | ---------------- |
| <Icon icon="circle" /> `sonic-2-2025-06-11` | June 11, 2025  | en, fr, de, es, pt, zh, ja, ko                             | Stable           |
| `sonic-2-2025-06-11`                        | June 11, 2025  | <Icon icon="circle" /> hi, it, nl, pl, ru, sv, tr          | EOL June 1, 2026 |
| `sonic-2-2025-05-08`                        | May 8, 2025    | en, fr, de, es, pt, zh, ja, ko                             | Stable           |
| `sonic-2-2025-05-08`                        | May 8, 2025    | <Icon icon="circle" /> hi, it, nl, pl, ru, sv, tr          | EOL June 1, 2026 |
| `sonic-2-2025-04-16`                        | April 16, 2025 | en, fr, de, es, pt, zh, ja, ko                             | Stable           |
| `sonic-2-2025-04-16`                        | April 16, 2025 | <Icon icon="circle" /> hi, it, nl, pl, ru, sv, tr          | EOL June 1, 2026 |
| <Icon icon="circle" /> `sonic-2-2025-03-07` | March 7, 2025  | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |

Read these pages to learn more about how to use Sonic-2:

* [Best practices](/build-with-cartesia/formatting-text-for-sonic-2/best-practices)
* [Inserting breaks](/build-with-cartesia/formatting-text-for-sonic-2/inserting-breaks-pauses)
* [Spelling text](/build-with-cartesia/formatting-text-for-sonic-2/spelling-out-input-text)

## `sonic-turbo`

All the power of Sonic, with half the latency (as low as 40ms).

| Snapshot                                        | Release Date  | Languages                                                  | Status           |
| ----------------------------------------------- | ------------- | ---------------------------------------------------------- | ---------------- |
| <Icon icon="circle" /> `sonic-turbo-2025-06-04` | June 6, 2025  | en, fr, de, es, pt, zh, ja, hi, ko                         | Stable           |
| `sonic-turbo-2025-06-04`                        | June 6, 2025  | <Icon icon="circle" /> it, nl, pl, ru, sv, tr              | EOL June 1, 2026 |
| <Icon icon="circle" /> `sonic-turbo-2025-03-07` | March 7, 2025 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |

## <Icon icon="circle" /> `sonic`

The first version of our flagship text-to-speech model. It produces high-accuracy, expressive speech, and is optimized for efficiency to achieve low latency.

| Snapshot                                  | Release Date      | Languages                                                  | Status           |
| ----------------------------------------- | ----------------- | ---------------------------------------------------------- | ---------------- |
| <Icon icon="circle" /> `sonic-2024-12-12` | December 12, 2024 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |
| <Icon icon="circle" /> `sonic-2024-10-19` | October 19, 2024  | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 |

## Deprecated and Preview Model Aliases

The following model aliases are now deprecated. Please use the recommended model names instead:

| Deprecated Alias                            | Use Instead                               |
| ------------------------------------------- | ----------------------------------------- |
| `sonic-3-preview`                           | `sonic-3`                                 |
| `sonic-preview`                             | `sonic-2`                                 |
| <Icon icon="circle" /> `sonic-english`      | <Icon icon="circle" /> `sonic-2024-10-19` |
| <Icon icon="circle" /> `sonic-multilingual` | <Icon icon="circle" /> `sonic-2024-10-19` |


# Sonic 3
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/sonic-3


`sonic-3` is our streaming TTS model, with high naturalness, accurate transcript following, and industry-leading latency. It provides fine-grained control on volume, speed, and emotion.

Key Features:

* **42 languages** supported
* **Volume, speed, and emotion** controls, supported through API parameters and SSML tags
* **Laughter** through `[laughter]` tags

For more information, see [Volume, Speed, and Emotion](/build-with-cartesia/sonic-3/volume-speed-emotion).

### Voice selection

Choosing voices that work best for your use case is key to getting the best performance out of Sonic 3.

* **For voice agents**: We've found stable, realistic voices work better for voice agents than studio, emotive voices. Example American English voices include Katie (ID: `f786b574-daa5-4673-aa0c-cbe3e8534c02`) and Kiefer (ID: `228fca29-3a0a-435c-8728-5cb483251068`).
* **For expressive characters**: We've tagged our most expressive and emotive voices with the `Emotive` tag.  Example American English voices include Tessa (ID: `6ccbfb76-1fc6-48f7-b71d-91ac6298247b`) and Kyle (ID: `c961b81c-a935-4c17-bfb3-ba2239de8c2f`).

For more information and recommendations, see [Choosing a Voice](/build-with-cartesia/capability-guides/choosing-a-voice).

### Language support

Sonic-3 supports the following languages:

<table>
  <tbody>
    <tr><td>English (`en`)</td><td>French (`fr`)</td><td>German (`de`)</td><td>Spanish (`es`)</td></tr>
    <tr><td>Portuguese (`pt`)</td><td>Chinese (`zh`)</td><td>Japanese (`ja`)</td><td>Hindi (`hi`)</td></tr>
    <tr><td>Italian (`it`)</td><td>Korean (`ko`)</td><td>Dutch (`nl`)</td><td>Polish (`pl`)</td></tr>
    <tr><td>Russian (`ru`)</td><td>Swedish (`sv`)</td><td>Turkish (`tr`)</td><td>Tagalog (`tl`)</td></tr>
    <tr><td>Bulgarian (`bg`)</td><td>Romanian (`ro`)</td><td>Arabic (`ar`)</td><td>Czech (`cs`)</td></tr>
    <tr><td>Greek (`el`)</td><td>Finnish (`fi`)</td><td>Croatian (`hr`)</td><td>Malay (`ms`)</td></tr>
    <tr><td>Slovak (`sk`)</td><td>Danish (`da`)</td><td>Tamil (`ta`)</td><td>Ukrainian (`uk`)</td></tr>
    <tr><td>Hungarian (`hu`)</td><td>Norwegian (`no`)</td><td>Vietnamese (`vi`)</td><td>Bengali (`bn`)</td></tr>
    <tr><td>Thai (`th`)</td><td>Hebrew (`he`)</td><td>Georgian (`ka`)</td><td>Indonesian (`id`)</td></tr>
    <tr><td>Telugu (`te`)</td><td>Gujarati (`gu`)</td><td>Kannada (`kn`)</td><td>Malayalam (`ml`)</td></tr>
    <tr><td>Marathi (`mr`)</td><td>Punjabi (`pa`)</td><td /><td /></tr>
  </tbody>
</table>

## Selecting a Model

| Snapshot                                    | Release Date     | Languages                                                                                                                                                              | Status |
| ------------------------------------------- | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
| <Icon icon="circle" /> `sonic-3-2026-01-12` | January 12, 2026 | en, de, es, fr, ja, pt, zh, hi, ko, it, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | Stable |
| `sonic-3-2025-10-27`                        | October 27, 2025 | en, de, es, fr, ja, pt, zh, hi, ko, it, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | Stable |

<Icon icon="circle" /> the latest **stable** snapshot of the model

When making API calls, you can specify either:

```python lines theme={null}
# Use the base model
# (automatically routes to the latest stable snapshot)
model_id = "sonic-3"

# Or specify a particular snapshot for consistency
model_id = "sonic-3-2026-01-12"

# Try the latest (beta) model (can be 'hot swapped')
model_id = "sonic-3-latest"
```

### Continuous updates and model snapshots

All models have a base model name (e.g. `sonic-3`) and a dated snapshot (e.g. `sonic-3-2025-10-27`). Using the base model will automatically keep you up to date with the most recent stable snapshot of that model. If pinning a specific version is important for your use case, we recommend using the dated version.

For testing our latest capabilities, we recommend using `sonic-3-latest`, which is a non-snapshotted version. `sonic-3-latest` can be updated with no notice, and not recommended for production.

To summarize:

| **Model ID**         | Model update behavior                                       | Recommended for                                                                            |
| -------------------- | :---------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
| `sonic-3-YYYY-MM-DD` | Snapshotted, will never change                              | Customers who want to run internal evals before any updates                                |
| `sonic-3`            | Will be updated to point to the most recent stable snapshot | Customers who want stable releases, but want to be up-to-date with the recent capabilities |
| `sonic-3-latest`     | Will always be updated to our latest beta releases          | Testing purposes                                                                           |

## Older Models

For information on `sonic-2`, `sonic-turbo`, `sonic-multilingual`, and `sonic`, see our page on [Older Models](/build-with-cartesia/tts-models/older-models).


# Voice IDs
Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/voice-ids


On June 1, 2026, we are discontinuing our voice embedding (aka stability) TTS models.

If you are currently making generation requests with voice embeddings like this:

```json theme={null}
{
  "voice": {
    "mode": "embedding",
    "embedding": [1, 2, ..., 3, 4]
  },
  "model_id": "sonic-2",
  // ...
}
```

You will need to switch to using voice IDs like this:

```json theme={null}
{
  "voice": {
    "mode": "id",
    "id": "e07c00bc-4134-4eae-9ea4-1a55fb45746b"
  },
  "model_id": "sonic-2",
  // ...
}
```

If you already use voice IDs, see [Migrating Voices](/build-with-cartesia/tts-models/migrating-voices) to make sure your voices will continue to work after the change.

For an overview of all changes, see [API Changes](/build-with-cartesia/tts-models/api-changes).

## Get a voice ID

Choose one of the following options.

### Check out the voice library

Our featured voices have all gone through rigorous evaluations and are ready to use in production.

Check them out at [play.cartesia.ai/voices](https://play.cartesia.ai/voices) and copy the ID of any voice you'd like to use.

### Clone a voice

If you have source audio, create a cloned voice via the [playground](https://play.cartesia.ai/voices/create/clone) or the [API](/api-reference/voices/clone). Cloning returns a voice ID you can use immediately.

### Generate source audio from your existing embedding

If you no longer have the original audio clip used to create your embedding, generate a short sample with `sonic` or `sonic-2` and then clone a new voice.

You can do this on our playground:

1. [play.cartesia.ai/text-to-speech](https://play.cartesia.ai/text-to-speech)
2. [play.cartesia.ai/voices/create/clone](https://play.cartesia.ai/voices/create/clone)

Or with our API:

1. [Text to Speech (Bytes)](/2024-11-13/api-reference/tts/bytes)
2. [Clone Voice](/api-reference/voices/clone)

Here is an example using our SDK:

```python theme={null}
from cartesia import Cartesia

# inputs
your_api_key: str = ""

your_voice_embedding: list[float] = []

language = "en"

transcript = """
It's nice to meet you.
Hope you're having a great day!
Could we reschedule our meeting tomorrow?
Please call me back as soon as possible.
"""

source_tts_model_id = "sonic"

client = Cartesia(api_key=your_api_key)

# Step 1: generate an audio sample
print(f"Generating audio sample {source_tts_model_id=}")
source_audio_iterator = client.tts.bytes(
    voice={"mode": "embedding", "embedding": your_voice_embedding},
    model_id=source_tts_model_id,
    transcript=transcript,
    language=language,
    output_format={
        "container": "wav",
        "encoding": "pcm_f32le",
        "sample_rate": 44100
    },
)

# Step 2: clone a voice
print("Cloning a voice")
voice = client.voices.clone(
    name="My Voice",
    language=language,
    clip=b"".join(source_audio_iterator),
    mode="similarity",
)
print(f"Cloned voice {voice.id}")

# you can now use the voice like this
migrate_to_model = "sonic-3"
generated_sample_file_name = f"{migrate_to_model}_{voice.id}.wav"

cloned_audio_iterator = client.tts.bytes(
    voice={"mode": "id", "id": voice.id},
    model_id=migrate_to_model,
    transcript=transcript,
    language=language,
    output_format={
        "container": "wav",
        "encoding": "pcm_f32le",
        "sample_rate": 44100
    },
)
with open(generated_sample_file_name, "wb") as f:
    for chunk in cloned_audio_iterator:
        f.write(chunk)
print(f"Listen to your new voice: {generated_sample_file_name}")

try:
    import subprocess

    subprocess.run(
        [
            "ffplay",
            "-loglevel",
            "quiet",
            "-autoexit",
            "-nodisp",
            generated_sample_file_name,
        ]
    )
except FileNotFoundError:
    pass
```

## Using Voice IDs

See [TTS (Bytes)](/api-reference/tts/bytes), [TTS (SSE)](/api-reference/tts/sse), and [TTS (WebSocket)](/api-reference/tts/websocket) for full API documentation.

You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks.


# Changelog 2024
Source: https://docs.cartesia.ai/changelog/2024

Product, API, and platform changes for 2024

<Update label="December 2024">
  ### API

  * Pricing updates; character usage columns migrated to bigint; presign URLs for Pro Voice Clone; **`voices/<id>/conditioning`** endpoint; file to dataset in presign; userID-level endpoint restrictions; Stripe Customer ID on checkout.
  * EU deployment and Hindi HC fixes.

  ### Playground

  * New model on Playground highlighting **transcript following** improvements (demo, not GA).
  * Blog and play.cartesia.ai live.

  ### Models / Voices

  * Model aliasing updated for **`sonic`** and **`sonic-preview`**; twilight-morning in API and enterprise; conditioning entries for voice clone and multilingual.
  * Embedding search for LoRA voice selection.

  ### Other

  * Infrastructure and scaling updates.
  * State of Voice blog and map.
</Update>

<Update label="November 2024">
  ### API

  * **Cartesia-Version 2024-11-13** — Upgrade to new API version; **unified clone voice endpoint**; datasets support; files endpoint pagination; FineTuneRequest status; fine-tunes API in Playground; presign URLs for Pro Voice Clone; **Flush Done** event for manual WebSocket flushing; **`<pause>`** tag for continuations.
  * GCP Enterprise.

  ### Playground

  * Changes for new API; replay suite; GCP Enterprise.

  ### Models / Voices

  * **Flush Done** event for manual flushing in WebSocket; **`<pause>`** tag for continuations within a single transcript; spelling fixes; manual flush and flush ID.
  * Empty encoding field allowed for mp3.

  ### Docs

  * API version **2024-11-13**: Sonic 2, capability guides (clone, pronunciations, speed/emotion, continuations, localize), formatting for Sonic 2.
  * Integrations: LiveKit, Pipecat, Rasa, Thoughtly, Twilio, MCP. Enterprise: SSO, organizations. See [API Conventions](/use-the-api/api-conventions).
</Update>

<Update label="October 2024">
  ### API

  * Cartesia JS bytes endpoint; gen blocks removed from character counting; health checks and middleware; **user-level queueing** with queue length cap and timeout; 10× queue size rejection; Slang (continuations) and ConditioningData; voice changer JS SDK.
  * Remove max limit from Playground.

  ### Playground

  * GCP: API and ingress for GCP US Central. Queueing: user-level queueing in API gateway; queue length cap and `queuedRequest` timeout.
  * Voice Changer: Playground UI polish; ConditioningData as part of ResolvedVoice; Slang rollout; flush on start/end of spell tags.
  * LoRA release UI; onboarding data upsert fix; welcome page submit loading state; enterprise contact links.

  ### Other

  * Canonical linking and sitemap.
  * Blog and navigation (Blog, Careers) updates.
</Update>

<Update label="September 2024">
  ### API

  * User-level queueing; queue size and websocket queueing rejection; **`api_status`** field for voice API usability; LoRA pricing and UX cleanup; **flush all audio on DONE token** (including CB); user option to obfuscate transcripts in logs.
  * LoRA and load balancer improvements.

  ### Playground

  * **Function calling**; agent creation, tests, and dev setup; voice agent infrastructure enabled.
  * LoRA: HiFi cloning endpoint and Playground page; 8 new voices on Playground; Indian accent.
  * **Voice Changer** Playground UI; JS SDK for voice changer. Language added to TTS request from `voices/[id]`; flush all audio on DONE token; user option to obfuscate transcripts in logs.

  ### Docs

  * Blog and sitemap updates.
</Update>

<Update label="August 2024">
  ### API

  * Reject invalid transcripts (docs and API gateway); `no_more_inputs` in WebSockets can use `voice_embedding` instead of `voice_id`.
  * Improved bad model id handling.

  ### Playground

  * **Localization** page in Playground and JS client; dialects and future-compatibility. Switch Playground to voice ID; allow both `id` and embedding for `TTSRequest`; archive voices (kept accessible via API).
  * Replay button; feedback form; fix multilingual recommended voices when switching back to English; better error messaging.

  ### Models / Voices

  * **LoRA** support (multiple voices per LoRA, new cache key, easy-brook-lora, vc-flowing-dream).

  ### Other

  * On-device homepage launch; proper links for "Request a demo" button.
  * **LoRA**: multiple voices per LoRA.
</Update>

<Update label="July 2024">
  ### API

  * **Voice Conversion endpoint** — New API endpoint. **Timestamps** on WebSocket endpoint; **per-generation voice controls** (speed, emotion) in API; polar-tree deployed (`sonic-multilingual`); continuous batching support; VocalWave (English) and long-generation support; `sonic-english` → vocal-wave, `sonic-multilingual` → ancient-voice aliasing.
  * **`buffer`** and **`mp3`** params on `/bytes`; MP3 streaming and WAV encoding fixes; request cancellation; empty transcript allowed when `continue=false`; Stripe webhook cache clear; subscription cancellation/reactivation; Redis cache for overage; keys endpoints.
  * Clerk-based auth in API.

  ### Playground

  * Optional **`enhance`** flag for voice cloning in JS client, Python client, and Playground; voice update endpoint and docs; gate voice cloning for free users.
  * Prevent playing audio while playback in progress; download button disabled until generation finished; API key deletion clearer with copy button; character usage indicator; subscription and checkout fixes; gating clone form for free users.

  ### Docs

  * Voice cloning docs; timestamps and continuations; user guides for voice control and Twilio; emotion control and timestamps; "phonemes" terminology.
  * Voice cloning from file.

  ### Other

  * Python client: continuations support, custom `base_url`, fallback for websockets; JS client v1.0.1: `onError` prop on useTTS.
  * Voice controls (speed, emotion) in Python client and docs.
</Update>

<Update label="June 2024">
  ### API

  * **Continuations** — Support for streaming input via SSE and Bytes; **`NoMoreInputs`** signal. **Cartesia Version** enforced via header; Playground and checkout/subscription endpoints send it.
  * 48 kHz added to valid sample rates; `.wav` byte streaming; HTTP streaming endpoint for raw bytes; API standardization (backwards-compatible); new voices endpoints; mulaw and alaw backwards compatibility; Python client v1.0.0 (overhaul, `output_format`); JS client: `pcm_s16le`, `pcm_alaw`, `pcm_mulaw` and improved typing; caching for voices; **`context_id`** in WebSocket response and docs.
  * Stripe webhooks for renewals and expiration; OpenAPI spec update.

  ### Playground

  * Multilingual: `language` parameter on voices API and in API; Playground language selection; multilingual copy on homepage; default `sonic-english` → feasible-haze.
  * Mobile layout improvements; multilingual UI papercuts; voice cloning and empty transcript styling fixes; filtering moved from `voices/[id]` to Speak page.

  ### Models / Voices

  * **`sonic-multilingual`** and **`sonic-english`** aliasing; `language` column on voices.
  * Recommended voices.

  ### Docs

  * Version **2024-06-10**: get-started, API conventions, integrations (LiveKit, Pipecat, Rasa, Thoughtly, Twilio, MCP), clone voices, embeddings/voice mixing. See [API Conventions](/use-the-api/api-conventions).

  ### Other

  * ToS changes; revised pricing tiers; legal notices on sign-in and sign-up; overage toggle in Playground.
  * Character usage limit blocks WebSocket when exceeded.
</Update>

<Update label="May 2024">
  ### API

  * **Cartesia Version** header; HTTP streaming for raw bytes; new voices endpoints; mulaw/alaw backwards compatibility; API standardization (backwards-compatible); Python client v1.0.0; JS client structure overhaul.
  * Clone voice upload fix.

  ### Playground

  * Redesign and Sonic launch copy; subscriptions page; favoriting voices; **emotion and speed sliders**; User vs Default voices; **tags** (Age, Accent) in DB and Playground; **`sample_text`** field (API Gateway and Playground); buffer streamed audio before playback; character usage indicator; API key auto-created on user creation; custom sign-in/sign-up and 404 on sign-out fix; disable generation button while audio playing; human-readable model names and skilled-cherry.
  * Character limit increase.

  ### Models / Voices

  * Human-readable model names; skilled-cherry; polar-tree (`sonic-multilingual`); continuations and output format; Python client numpy array support.
  * Voice cloning disclaimer.

  ### Docs

  * Mintlify docs added.

  ### Other

  * Stripe webhooks for subscriptions; subscription cancellation and reactivation; character usage checks on generation routes; free subscription by default; Scale plan limit (8M chars/month); checkout and receipts.
  * Custom sign-in/sign-up pages.
</Update>

<Update label="April 2024">
  ### API

  * **`model_id`** added as parameter to generate; minimum transcript length enforced; `voice` moved to `AudioGenerationRequest`; experimental router removed; speed controls and voice edit page; video generation endpoint.
  * WhisperX removed from dependencies.
</Update>

<Update label="March 2024">
  ### API

  * WebSocket interrupt support; get voice embedding route; Redis cache for API keys; streaming switched from Octet to JSON; new model `genial-planet-1346`; `voice` param required on requests; formatting support.
  * WhisperX for transcription (later removed).

  ### Playground

  * Voice cloning in the UI; connection info in JS client; audio downloadable; transcript length validation (max 400 chars, empty rejected); improved UX and crash handling when API key missing; welcome message and icons.
  * API key creation on sign-up via Clerk webhooks.

  ### Other

  * Voice cloning and connection info in JS client.
</Update>


# Changelog 2025
Source: https://docs.cartesia.ai/changelog/2025

Product, API, and platform changes for 2025

<Update label="December 2025">
  ### API

  * **sonic-3-latest** (preview) and dated **sonic-3-YYYY-MM-DD** snapshots.
  * **sonic-3-latest** added to Playground TTS with banner when selected. See [Changelog 2026](/changelog/2026).

  ### Voice changes

  * **Voice Library** — December: 25 new voices across 6 languages (12 English, 6 Hindi, 4 Arabic, 1 Spanish, 1 Japanese); 14 featured.
  * Voice library changes; featured voice badge on voice page; **`/voices/recent`** endpoint.

  ### Playground

  * **Report generation** (report button, alert when user reports).
  * **Voice move**; **archive and publish** voices.
  * **PVC**: custom PVC voices UI, multiple user errors surfaced to UI, feature flag for custom model during creation.
  * **Pronunciation dicts**: new backend APIs, generator on create/edit, case sensitivity badge.
  * **Agents**: new text-to-agent UI, create agent from **Github repo tarball**, system prompt generator for UI agent.
  * **Narrations sunset** notice; TTS History pagination; auth strategy for access-tokens.
  * **sonic-3-latest** banner and naming.

  ### Other

  * PVC, STT, and agent improvements.
  * Error handling and error codes.
</Update>

<Update label="November 2025">
  ### API

  * Improved error handling and public error responses; cache invalidation by voice ID.
  * IPVC train API (remove **`markAsReady`**); dataset files overfetch fix; default voice logic fix.

  ### Playground

  * Pronunciation dicts migrate to new backend APIs; persist visual theme to DB; PVC pipeline error and recommendations.
  * Call logs conversation view default; TTS textarea height fix; Sonic-3 model for partners shown.
  * Billing overage "blood bar" and alert fixes; PVC gate for Startup plan.
  * Pronunciation dict generator on create/edit; API version in dialog; featured voice toggle; narrations model selection.

  ### Line / Agents

  * No user audio warning (250ms); Pipecat DeepgramNovaVADFilter.
  * Call recording and artifact storage fixes.

  ### Models / Voices

  * Sonic 3 PVC and normalizer updates; LoRA and PVC error handling; expand option for dataset file count.
  * **`preview_file_url`**; **`tags_operator`** on GET /voices; restrict delete to non-public voices; **`owner_id`** check for fine tune voices; **`user_errors`** for PVCs.
  * New Arabic accents; African French and Canadian French.
</Update>

<Update label="October 2025">
  ### Model changes

  * **Sonic 3 launch (Oct 27)** — **sonic-3-2025-10-27** stable snapshot released; 42 languages; volume, speed, and emotion controls.
  * Real-time conversation with emotion and laughter; \~190ms median latency. See [Sonic 3](/build-with-cartesia/tts-models/latest) and [Volume, Speed, and Emotion](/build-with-cartesia/sonic-3/volume-speed-emotion).

  ### Other

  * Continued PVC, STT, and agent improvements; error handling and public errors; manifone voices; Sonic 3 PVC and normalizer updates.
  * Transcript buffer multilingual and Thai pronunciation dictionary fix; TTFA buffering and reporting; Voice Conversion operator reload; audio norm operator.
</Update>

<Update label="September 2025">
  ### API

  * **`user_id`** to **`owner_id`** in API (model aliasing / ownership).
  * Improved error handling and version/limit checks.

  ### Line / Agents

  * Warning if no user audio for 250ms+; Pipecat **DeepgramNovaVADFilter** for spurious `on_speech_started`.
  * Call recording and artifact storage fixes.

  ### Models / Voices

  * STT: Migrate STT providers to Deepgram where appropriate; Deepgram for non-English or language-detect agents; word-level user text chunks.
  * Sonic 3 / PVC: Sonic 3 PVC updates; Hindi Sonic 3 normalizer revert; LoRA data processing and expand option for dataset file count; PVC errors to webhook.
  * Manifone new voice; African French and Canadian French accents; partner agents can configure TTS models.

  ### Other

  * LoRA bugfixes.
</Update>

<Update label="August 2025">
  ### API

  * Production-facing agent WebSocket; **cancel endpoint** for ending live calls.
  * Improved error handling and public error codes; cache invalidation by voice ID.

  ### Playground

  * Telephony: stop billing for customer-managed numbers; Cartesia vs Twilio param separation.
  * Outbound number management columns.

  ### Line / Agents

  * **Deepgram Nova VAD** (`utterance_end_ms` configurable via **`vad_stop_secs`**).

  ### Models / Voices

  * New endpoint for **`<audio>`** tags; **accent** column on voice API; **`max_buffer_delay`** applied to continuations; eu-north-1 region.
  * **GET /voices** **`tags_operator`**; **`preview_file_url`**; restrict deleting voices to non-public; check **`owner_id`** when listing fine tune voices; **`user_errors`** for PVCs from API.
  * New Arabic accents migration.

  ### Other

  * Max rollover multiplier for credit plans.
</Update>

<Update label="July 2025">
  ### API

  * **`deploy_error`** status fix.

  ### Playground

  * **LangChain** launched voice agents with Cartesia Sonic TTS.
  * Billing: Stripe customer for enterprise if needed; call runtime logs in call logs side panel; Call Logs UI nits (from June work).

  ### Line / Agents

  * Partner pipeline parity with User Agent; **concurrency fix** (negative concurrency); agent metric LLM credit usage for evals; AgentEvaluations functionality.
  * User Code Connector WS handlers fix; agent end turn handling; summarization system prompt; **`user_prompt`** in API; transcript removed from agent metric result; deadlock fix in WS timeout.

  ### Other

  * Flushing and concurrency fixes.
</Update>

<Update label="June 2025">
  ### API

  * **UserCodeAgent** deployment URL; **cancel endpoint** for force-ending live calls via API; Agent EoUD metric; cartesia agent speed-up; user prompt stored separately in agent metrics; **`agent_evaluations`** table; async flush for aggregator; User Code Connector WS and last bot turn handling; deployment URL delay on pickup.
  * Concurrency and WS timeout fixes; improved goroutine handling; agent workers **`/chats`** timeout increase.

  ### Playground

  * **Call Logs** page for agents with data table and side panel; **Agents demo** with Twilio web dialer, visualizer, and like/dislike feedback; deployment detail page and list; **Twilio number provisioning** (Parts 1 & 2); GitConnector redeploy on commit; deployment logs; zip upload for deployment; feature flag by organization; agents gated behind feature flag; **Deepgram as default STT** for agents; orgs v2 (frontend and backend); 20K credits for organizations; enterprise free trial days and email invoice options.
  * **Credit usage**: separate TTS & STT concurrency panels; STT and Infill charts; voice page copyable fields; call runtime logs in call logs panel.

  ### Models / Voices

  * STT: Whisper large v3; serve multiple models in STT pipeline; word-level user text chunks.
  * FinetunedSTTContext fixes.
</Update>

<Update label="May 2025">
  ### API

  * Voice conversion in enterprise.

  ### Playground

  * Post–April: Following [April 2025](/changelog/2025) API changes (embeddings removed; use [Voice IDs](/build-with-cartesia/tts-models/voice-ids) and [Clone Voice](/api-reference/voices/clone)).

  ### Line / Agents

  * User code deployments from DB; **`agent_deployments`** table; STT cartesia-streaming and Pipecat streaming Whisper; Bedrock proxy for OpenAI-compatible; timestamp bug fixes and default to original timestamps.
  * Partner `/chat` and `/config` updates; DTMF support in UserCodeConnector; endpointing architecture.

  ### Models / Voices

  * STT: Batch engine utilization; Pipecat streaming Whisper.
  * Deepgram STT client `url`/`base_url` fix.

  ### Other

  * Voice clone uploads fix.
</Update>

<Update label="April 2025">
  ### Breaking

  * **sonic-2-2025-04-16** — Starting with **`sonic-2-2025-04-16`**, we're removing support for: Embeddings; **`stability`** cloning mode; Experimental controls for speed and emotion. The **`similarity`** cloning mode is dramatically better. To control speed and emotion today, use Instant Voice Cloning (e.g. FFMPEG, Voice Changer, or instant clones from **`sonic-2-2025-03-07`** embeddings). Users who need embeddings or experimental controls can use API version **`2024-11-13`** with model **`sonic-2-2025-03-07`** (both still available). See [Older models](/build-with-cartesia/tts-models/older-models).

  ### API

  * listVoices by ID for single voice; warm-monkey PVC; **access tokens** (JWT); Cartesia-Version 2024-11-13; phoneme/original timestamps language check; TTS History source; LoRA from fine-tune checkpoints; context expiry replaced by input stream delay.
  * **`sonic-2`** and **`sonic-2-2025-04-16`** ignore experimental controls on TTS generations; voice cloning supports only **`similarity`** clones.
  * Removed embeddings from all endpoints; voices may only be specified by Voice ID; **`/tts`** cannot be called with voice embeddings.
  * Deprecated **`/voices/create`** and **`/voices/mix`**.
</Update>

<Update label="March 2025">
  ### Breaking

  * **sonic-2-2025-03-07** is the last Sonic 2 snapshot supporting voice embeddings and experimental controls. Use with API version **`2024-11-13`** for legacy behavior.
  * sonic-preview → JollyTotem, RoseLion deprecated; sonic-2 alias to jolly-totem for speaker switching. See [Older models](/build-with-cartesia/tts-models/older-models).

  ### API

  * **Cartesia-Version** updated to **2024-11-13**; model latency via header on bytes endpoint; new Sonic PVC model warm-monkey; listVoices by ID (single voice); **access tokens** (JWT signing, validation); API-level check for languages supporting phoneme and original timestamps.
  * Organizations and billing; **free credits** 10k → 20k; overages product; subscription cache invalidation webhook; TTS History **source** column (api, playground, narrations); LoRA voices from base VoiceVariation and checkpoint for fine tunes.

  ### Playground

  * **sonic-2** and **sonic-turbo** aliases launched; Sonic 2 / Sonic Turbo messaging (Turbo = 40ms latency).
  * cartesia.ai/sonic and playground updates.

  ### Line / Agents

  * Agent ID in websocket URL; telephony info on partner calls; Pipecat version upgrade; partner demo tool calls; warm-monkey PVC model; prespeak and function call flow updates.
  * Twilio voice routes support agent IDs; Keypad DTMF on agent; half-duplex STT and LLM context; original timestamps support in API.

  ### Other

  * **sonic-pvc** alias and DryVoice as sonic-pvc model. **Python SDK** announced.
</Update>

<Update label="February 2025">
  ### API

  * **listVoices** by ID; localize endpoint voice name fix; 400s for bad body params; text forcing max transcript length; **OpenAI-compatible STT server**; agent with local STT; voice tags; on-device transcripts in evals; jolly-totem as default sonic-preview.
  * S2S and Agents foundational blocks.

  ### Playground

  * Instant cloning enabled for free users; voice tags; localize refactored to use conditioning; listVoices can query by ID for single voice; Sarah (Similarity) and Southern Woman migrations; on-device transcripts.
  * Narrations settings (JSONB).

  ### Line / Agents

  * Agent with local STT; foundational S2S + Agents blocks; design and pipeline work.

  ### Models / Voices

  * STT: cartesia-streaming and Pipecat streaming Whisper; on-device transcripts.
</Update>

<Update label="January 2025">
  ### API

  * **sonic-lite** added to API; EU deployment for prod API; save option for TTS bytes handler; CORS header for **Cartesia-File-ID**; Stripe credits default to `char_limit` in checkout; Redis cache for overage settings; polar-mountain and VC in EU; ListFiles paginator fix.
  * Eval break/spell tags and replacement/normalization mode.

  ### Models / Voices

  * sonic-preview routed to MisunderstoodFrog; polar-mountain added and staged; visionary-yogurt timestamp requests for any language.
  * jolly-totem as default sonic-preview.
</Update>


# Changelog 2026
Source: https://docs.cartesia.ai/changelog/2026

Product, API, and platform changes for 2026

<Update label="April 2026">
  ### Sonic 3.5

  *Sonic 3.5 is now available on `sonic-3-latest`. We'd love for you to try it and tell us what you think.*

  #### Why you should try it

  * **More natural speech, pacing, and emotional expression**, especially noticeable on expressive, conversational, and support-style transcripts.
  * **Cleaner audio quality** across all languages and voices.
  * **Better alphanumeric read-out** — confirmation codes, order numbers, phone numbers, IDs, and emails sound meaningfully more natural, in all supported languages.
  * **Step-change multilingual performance**, particularly Hebrew, Japanese, Spanish, Hindi, German, Korean, and French.
  * **English heteronyms** — tricky English heteronyms like "read," "bass," and "bow" now pronounce correctly in context.

  #### How to try it

  1. Point your API call or Playground request to the model ID `sonic-3-latest`.
  2. Keep your existing voice IDs, request shape, and prompting — no code changes required for most customers.
  3. Send us feedback on any voice or transcript that behaves differently than you expect.

  <Note>
    As with any `-latest` alias, `sonic-3-latest` can be updated without notice and is not recommended for production. Pin to a dated snapshot (e.g. `sonic-3`) for production traffic.
  </Note>

  #### What to know to be successful

  * **Spell tags still work the same way.** If you already wrap alphanumerics in `<spell>...</spell>`, you don't need to change anything — you'll just get better-sounding output. See [here](/build-with-cartesia/sonic-3-5/prompting-tips#controlling-pacing-and-spelling) for more details.
  * **If you use custom delimiters** (commas/periods between characters or groups) to control pacing, our recommended format has changed. Use **spaces between characters** and **commas between groups**, e.g. `A B C, 1 2 3` instead of `A, B, C. 1, 2, 3.`. See [Prompting tips for Sonic 3.5](/build-with-cartesia/sonic-3-5/prompting-tips) for more details.
  * **Speed and volume controls are temporarily disabled** on `sonic-3-latest`. If you rely on speed or volume augmentation (including via SSML), stay on `sonic-3` for now. We believe that Sonic 3.5 has more natural pacing and you may find that you don't need to use speed control as much when using this model.
  * **Timestamps behave slightly differently.** If you use end-of-word timestamps for interruption handling, you should not see a meaningful change. If you depend on beginning-of-word timestamps, please test carefully and reach out if you see regressions for your use case.
  * **Existing Professional Voice Clones (PVCs) do not carry over to `sonic-3-latest`.** Professional Voice Clones are pinned to the base model they were trained on (e.g. `sonic-3`) and will function as a standard voice clone for this model. For more information, see [Clone Voices (Pro)](/build-with-cartesia/capability-guides/clone-voices-pro/playground).
  * **Providing proper context to the model improves naturalness.** Please see our buffering guide [here](/use-the-api/tts-websocket/buffering) for more details.

  #### Where to look for help

  * [Sonic 3.5 model overview](/build-with-cartesia/tts-models/sonic-3-5)
  * [Prompting tips for Sonic 3.5](/build-with-cartesia/sonic-3-5/prompting-tips)
  * [Model aliases and snapshots](/build-with-cartesia/tts-models/latest#continuous-updates-and-model-snapshots)
</Update>

<Update label="March 2026">
  ### Breaking

  * **Text-to-Agent (T2A) API** — Text-to-Agent workflow for Line is **deprecated**.

  ### API

  * **Error responses** — For `Cartesia-Version: 2026-03-01`, we now return structured JSON. See [API Errors](/use-the-api/api-errors).
    * API versions before `2026-03-01` continue to return legacy error formats (for example HTTP `Title: Message`).
    * **Voices** — `PATCH /voices/{id}`: voice owners can now update accent and gender. Voice creation validates language. Invalid voice UUIDs and pronunciation-dictionary IDs return 404 instead of ambiguous errors.
  * **PVC model routing** — PVC voices require a dated model ID (e.g. **`sonic-3-2026-01-12`**) instead of **`sonic-3`**. See [Clone Voices (Pro)](/build-with-cartesia/capability-guides/clone-voices-pro/api).
  * **Voice search** — Name and metadata search is **diacritics-insensitive**.

  ### Playground

  * **Pro voice clones**
    * Clearer **language mismatch** messaging
    * **Background noise removal** is now a simple on/off control
    * **Fine-tuning model support**:
      * Removed support for older models
      * Now only **sonic-3-2026-01-12** is supported
  * **Multilingual agents** — Multilingual agent configuration is now supported in the Playground.
  * **Agents UI** — Search by **call ID** and **agent ID**.

  ### Billing

  * **Concurrency** — Organizations can receive **notifications** when concurrency nears configured **limits**.

  ### Model / voice

  * **Professional Voice Clones** — Backend updates improve stability of the professional voice cloning workflow.
  * **Accents & filters** — Additional **accent** options (e.g. **Irish**, **New Zealand**, **South African**, **Belgian**) and **locale aliases** for accent filtering in APIs and Playground.
  * **Voice Library** — **94** new voices across **17** locales (including Arabic, German, English variants, Spanish, Finnish, French, Hebrew, Hindi, Japanese, Korean, Polish, Portuguese, Swedish, Telugu, Thai, and more).

  ### Self-hosted

  * **On-premises** — API for managing voices on self-hosted deployments.

  ### Cartesia SDK

  * **cartesia-js v3.0.0** (Mar 2) — Major updates:

    * New features: `flush_id` included in chunk and voice changer binary responses; `output_format` and infill support; inline WebSocket response types; byte endpoint returns **ArrayBuffer**; improved **WebPlayer** and client export.
    * Fixes: memory leak and timing issues with abort signals/listeners, handling of empty `Content-Length`, and **TimeoutError** now includes a message.

    See [cartesia-js releases](https://github.com/cartesia-ai/cartesia-js/releases) for full details.
</Update>

<Update label="February 2026">
  ### Line

  * **[History Management API](/line/sdk/agents#history-management)**: You can add or replace the history provided to your agent, for example, to summarize a long conversation.
  * **[Custom User Events](/line/sdk/events#custom-event)**: You can send bidirectional custom events between your client and the agent. You could use this, for example, if you have a web application with UI interactions.
  * **[Uninterruptible Messages](/line/sdk/events#speech)**: You can set messages as uninterruptible. A common use case is a legal disclaimer at the beginning of a call.
  * **End Tool Call Improvements**: The default end call tool call is more conservative to prevent calls from ending prematurely.

  ### API

  * Increased reliability of API connections

  ### Cartesia SDK

  * **cartesia-python v3.0.0** (Feb 9). See full details in [cartesia-python releases](https://github.com/cartesia-ai/cartesia-python/releases).

  ### Playground

  * Shipped a new TTS page
  * Shipped a new Voice Creation page
  * Shipped a new Agents page

  ### Model changes

  * **Improved pronunciation of real-world text patterns across languages**
    * Enhanced support for structured and formatted speech patterns: numbers, dates, times, currency, phone numbers, IDs, percentages, and amounts/measurements.
    * Support for various date formats (YYYY-MM-DD, YYYY/MM/DD, 年月日).
    * Support for measurement units (meters, kg, tablespoon, gigabytes, etc.) with locale awareness.
    * Support for domestic and international phone number formats with locale-specific chunking for French, Italian, German, Portuguese, Korean, and more.
    * Improved alphanumeric ID handling with katakana/hiragana readings and Latin acronym transliteration to katakana for Japanese.
    * Improves all languages except English, Hindi & other Indic languages, Arabic, Hebrew, Chinese, Swedish, Georgian, Bulgarian, and Tagalog (targeted for future updates).
  * **Support for regional and locale-specific pronunciation within languages**
    * Regional voices use region-specific terms in addition to accent (e.g. Belgian and Swiss French "nonante" vs. Canadian and French "quatre-vingt-dix").
    * Region-specific number terminology, currency symbols, date formats, and measurement units.
    * Locale-aware date and time formatting (e.g. Russian year suffixes, French/Spanish time conventions).
    * Locale-aware currency symbol handling (e.g. \$ as "dollars" in en\_US and "pesos" in es\_MX).
    * Locale pronunciation falls back to the primary country for that language (e.g. US for English, Brazil for Portuguese). We will continue to expand locale-aware support.
    * Improves all languages except English, Hindi & other Indic languages, Arabic, Hebrew, Chinese, Swedish, Georgian, Bulgarian, and Tagalog (targeted for future updates). Existing regional pronunciation for English voices (e.g. British) is unaffected.

  ### Voice changes

  * **Voice Library**: 39 new voices across 21 locales

  ### Breaking changes effective June 1, 2026

  The following model snapshots and languages are discontinued effective June 1, 2026:

  | Model                | Snapshots                                                        | Languages                  |
  | -------------------- | ---------------------------------------------------------------- | -------------------------- |
  | `sonic`              | All                                                              | All                        |
  | `sonic-english`      | —                                                                | All                        |
  | `sonic-multilingual` | —                                                                | All                        |
  | `sonic-2`            | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | it, nl, pl, ru, sv, tr, hi |
  |                      | `sonic-2-2025-03-07`                                             | All                        |
  | `sonic-turbo`        | `sonic-turbo-2025-06-04`                                         | it, nl, pl, ru, sv, tr     |
  |                      | `sonic-turbo-2025-03-07`                                         | All                        |

  The following endpoints are discontinued effective June 1, 2026:

  | Discontinued Endpoint                      | Replacement                                |
  | ------------------------------------------ | ------------------------------------------ |
  | Voice Embedding: `POST /voices/clone/clip` | [Clone Voice](/api-reference/voices/clone) |
  | Mix Voices: `POST /voices/mix`             | —                                          |
  | Create Voice: `POST /voices`               | [Clone Voice](/api-reference/voices/clone) |

  The following endpoints stop accepting voice embeddings effective June 1, 2026:

  | Endpoint with a breaking change       | Replacement |
  | ------------------------------------- | ----------- |
  | TTS (bytes): `POST /tts/bytes`        | Voice ID    |
  | TTS (SSE): `POST /tts/sse`            | Voice ID    |
  | TTS (WebSocket): `WSS /tts/websocket` | Voice ID    |
</Update>

<Update label="January 2026">
  ### API

  * **Regionalization** — Calls routed to US, EU, APAC by origin.
  * **Parameterized outbound calls** — [Docs](/line/integrations/telephony/outbound-dialing)
  * **Pronunciation dictionaries** — [Docs](/line/sdk/agents#custom-pronunciations)

  ### Model changes

  * **Sonic-3 model versioning scheme introduced**
    * New preview track: **`sonic-3-latest`** (continuous updates for early access and feedback).
    * Stable track: **`sonic-3`** always points to the most recent stable release.
    * Immutable dated snapshots: **`sonic-3-YYYY-MM-DD`** never change.
    * Details: [Continuous updates and model snapshots](/build-with-cartesia/tts-models/latest#continuous-updates-and-model-snapshots)
  * **Promotion to stable checkpoint:** **`sonic-3-2026-01-12`**
    * Included improvements: consistent speed & volume, custom IPA pronunciations with stronger adherence, Hindi prosody improvements, Korean prosody/intonation improvements.

  ### Voice changes

  * **Featured Voices launched** — Curated set of 30+ best-performing voices (e.g. [Cathy](https://play.cartesia.ai/voices/e8e5fffb-252c-436d-b842-8879b84445b6), [Henry](https://play.cartesia.ai/voices/87286a8d-7ea7-4235-a41a-dd9fa6630feb)).
  * **Voice Library** — December: 25 new voices across 6 languages.
  * **Voice Library** — January: 9 Spanish voices (Mexican, Colombian, Castilian).

  ### Playground

  * Voice library usability improvements (test with your own scripts, call an agent per voice).
  * One-click **Report Issue** on TTS Playground.
  * Mini voice picker (recently used + saved) on TTS page.
  * PVC UI + reliability (loading skeletons, error messages, better behavior with large datasets and silence).

  ### Line

  * **Line SDK v0.2** — [Repo](https://github.com/cartesia-ai/line). Improved DX, long-running tool-call handling, **committed turns**, better turn-taking and transcription.
</Update>


# Error Handling
Source: https://docs.cartesia.ai/examples/error-handling

Example of error handling with SDK exceptions.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def error_handling_example(client: Cartesia) -> None:
        """Example of error handling with SDK exceptions."""
        try:
            _response = client.tts.generate(
                model_id="sonic-3",
                transcript="Hello, world!",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
            )
        except BadRequestError as e:
            print(f"Bad request: {e}")
        except AuthenticationError as e:
            print(f"Auth failed: {e}")
        except NotFoundError as e:
            print(f"Not found: {e}")
        except RateLimitError as e:
            print(f"Rate limited: {e}")
        except APIError as e:
            print(f"API error: {e}")
    ```

    From [cartesia-python/examples/examples.py:545](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L545)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function errorHandling(client: Cartesia): Promise<void> {
      /** Example of error handling with SDK exceptions. */
      try {
        await client.tts.generate({
          model_id: 'sonic-3',
          transcript: 'Hello, world!',
          voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
          output_format: { container: 'wav', encoding: 'pcm_f32le', sample_rate: 44100 },
        });
      } catch (e) {
        if (e instanceof BadRequestError) {
          console.log(`Bad request: ${e.message}`);
        } else if (e instanceof AuthenticationError) {
          console.log(`Auth failed: ${e.message}`);
        } else if (e instanceof NotFoundError) {
          console.log(`Not found: ${e.message}`);
        } else if (e instanceof RateLimitError) {
          console.log(`Rate limited: ${e.message}`);
        } else if (e instanceof APIError) {
          console.log(`API error: ${e.message}`);
        } else {
          throw e;
        }
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:398](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L398)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py error_handling_example
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts errorHandling
    ```
  </Tab>
</Tabs>


# Create Infill Audio
Source: https://docs.cartesia.ai/examples/infill-create

Create infill audio between two clips.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def infill_create(client: Cartesia) -> None:
        """Create infill audio between two clips."""
        from pathlib import Path
        # Can pass file paths directly (as Path objects)
        response = client.tts.infill(
            model_id="sonic-3",
            language="en",
            transcript="Infill text",
            left_audio=Path("left.wav"),
            right_audio=Path("right.wav"),
            voice_id="6ccbfb76-1fc6-48f7-b71d-91ac6298247b",
            output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        response.write_to_file("infill_output.wav")
        print(f"Saved audio to infill_output.wav")
        print(f"Play with: ffplay -f wav infill_output.wav")
    ```

    From [cartesia-python/examples/examples.py:504](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L504)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def infill_create_async(client: AsyncCartesia) -> None:
        """Create infill audio between two clips."""
        from pathlib import Path
        response = await client.tts.infill(
            model_id="sonic-3",
            language="en",
            transcript="Infill text",
            left_audio=Path("left.wav"),
            right_audio=Path("right.wav"),
            voice_id="6ccbfb76-1fc6-48f7-b71d-91ac6298247b",
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        await response.write_to_file("infill_output_async.wav")
        print("Saved audio to infill_output_async.wav")
        print("Play with: ffplay -f wav infill_output_async.wav")
    ```

    From [cartesia-python/examples/async\_examples.py:341](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L341)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py infill_create
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py infill_create_async
    ```
  </Tab>
</Tabs>


# Next.js Full Example
Source: https://docs.cartesia.ai/examples/nextjs

A complete Next.js application with batch TTS, HTTP streaming, and WebSocket streaming.

A full Next.js app demonstrating three approaches to Cartesia TTS in the browser:
batch generation, HTTP streaming, and WebSocket streaming. Includes a server-side
token endpoint so API keys are never exposed to the client.

## Token Endpoint

```typescript app/api/token/route.ts theme={null}
import Cartesia from "@cartesia/cartesia-js";

const client = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY });

export async function POST() {
  const { token } = await client.accessToken.create({
    grants: { tts: true },
    expires_in: 300,
  });
  return Response.json({ token });
}
```

## Batch and HTTP Streaming

```tsx app/page.tsx theme={null}
"use client";

import { useRef, useState } from "react";
import Cartesia from "@cartesia/cartesia-js";

const SAMPLE_RATE = 44100;
const BYTES_PER_SAMPLE = 4; // f32le

async function getToken(): Promise<string> {
  const res = await fetch("/api/token", { method: "POST" });
  const { token } = await res.json();
  return token;
}

// =============================================================================
// Batch: waits for the full response, then plays via <audio> element
// =============================================================================

function BatchCartesiaTTSExample() {
  const audioRef = useRef<HTMLAudioElement>(null);
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      const client = new Cartesia({ token: await getToken() });
      const response = await client.tts.generate({
        model_id: "sonic-3",
        transcript: "Hello! This audio was generated in one batch and then played.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "wav", encoding: "pcm_s16le", sample_rate: SAMPLE_RATE },
      });

      const blob = await response.blob();
      const url = URL.createObjectURL(blob);
      const audio = audioRef.current!;
      audio.src = url;
      audio.onended = () => URL.revokeObjectURL(url);
      await audio.play();
    } finally {
      setLoading(false);
    }
  }

  return (
    <section>
      <h2>Batch</h2>
      <p>Waits for the full audio, then plays via an audio element.</p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Generating..." : "Speak"}
      </button>
      <audio ref={audioRef} controls style={{ display: "block", marginTop: "0.5rem" }} />
    </section>
  );
}

// =============================================================================
// Streaming: plays audio chunks as they arrive via Web Audio API
// =============================================================================

function StreamingCartesiaTTSExample() {
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      const client = new Cartesia({ token: await getToken() });
      const response = await client.tts.generate({
        model_id: "sonic-3",
        transcript:
          "Hello! This audio is being streamed and played as chunks arrive.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: SAMPLE_RATE },
      });

      // Stream the response and play each chunk as it arrives.
      // We buffer incoming bytes so we only decode complete f32 samples —
      // getReader() can split chunks at arbitrary byte boundaries.
      const audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE });
      let nextStartTime = audioCtx.currentTime;
      const reader = response.body!.getReader();
      let leftover = new Uint8Array(0);

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        // Prepend any leftover bytes from the previous chunk
        let bytes: Uint8Array;
        if (leftover.length > 0) {
          bytes = new Uint8Array(leftover.length + value.length);
          bytes.set(leftover);
          bytes.set(value, leftover.length);
        } else {
          bytes = value;
        }

        // Only decode complete samples, save the remainder
        const usableBytes = bytes.length - (bytes.length % BYTES_PER_SAMPLE);
        leftover = bytes.slice(usableBytes);

        if (usableBytes === 0) continue;

        // Copy to an aligned buffer so Float32Array doesn't throw on unaligned offset
        const aligned = new ArrayBuffer(usableBytes);
        new Uint8Array(aligned).set(bytes.subarray(0, usableBytes));
        const floats = new Float32Array(aligned);

        const buf = audioCtx.createBuffer(1, floats.length, SAMPLE_RATE);
        buf.getChannelData(0).set(floats);

        const source = audioCtx.createBufferSource();
        source.buffer = buf;
        source.connect(audioCtx.destination);

        const startTime = Math.max(nextStartTime, audioCtx.currentTime);
        source.start(startTime);
        nextStartTime = startTime + buf.duration;
      }
    } finally {
      setLoading(false);
    }
  }

  return (
    <section>
      <h2>Streaming</h2>
      <p>Plays audio chunks as they arrive via the Web Audio API.</p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Streaming..." : "Speak"}
      </button>
    </section>
  );
}

// =============================================================================
// Page
// =============================================================================

export default function Home() {
  return (
    <main style={{ padding: "2rem", fontFamily: "system-ui" }}>
      <h1>Cartesia TTS — Next.js Example</h1>
      <div style={{ display: "flex", flexDirection: "column", gap: "2rem", marginTop: "1rem" }}>
        <BatchCartesiaTTSExample />
        <StreamingCartesiaTTSExample />
      </div>
      <p style={{ marginTop: "2rem" }}>
        <a href="/websocket">WebSocket streaming example →</a>
      </p>
    </main>
  );
}
```

## WebSocket Streaming

```tsx app/websocket/page.tsx theme={null}
"use client";

import { useState } from "react";
import Cartesia from "@cartesia/cartesia-js";

const SAMPLE_RATE = 44100;

export default function WebSocketExample() {
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      // 1. Get a short-lived token from our server
      const res = await fetch("/api/token", { method: "POST" });
      const { token } = await res.json();

      // 2. Connect via WebSocket from the browser
      const client = new Cartesia({ token });
      const ws = await client.tts.websocket();

      // 3. Stream audio and play each chunk as it arrives
      const audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE });
      let nextStartTime = audioCtx.currentTime;

      const resp = ws.generate({
        model_id: "sonic-3",
        transcript:
          "Hello from a WebSocket! Each audio chunk is played the moment it arrives, giving you the lowest possible latency.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: SAMPLE_RATE },
      });

      for await (const event of resp) {
        if (event.type === "chunk" && event.audio) {
          // event.audio is a Uint8Array of f32le samples
          const aligned = new ArrayBuffer(event.audio.byteLength);
          new Uint8Array(aligned).set(event.audio);
          const floats = new Float32Array(aligned);

          const buf = audioCtx.createBuffer(1, floats.length, SAMPLE_RATE);
          buf.getChannelData(0).set(floats);

          const source = audioCtx.createBufferSource();
          source.buffer = buf;
          source.connect(audioCtx.destination);

          const startTime = Math.max(nextStartTime, audioCtx.currentTime);
          source.start(startTime);
          nextStartTime = startTime + buf.duration;
        }
      }

      ws.close();
    } finally {
      setLoading(false);
    }
  }

  return (
    <main style={{ padding: "2rem", fontFamily: "system-ui" }}>
      <h1>Cartesia TTS — WebSocket Streaming</h1>
      <p>
        Uses the SDK&apos;s WebSocket API directly from the browser.
        Audio plays as each chunk arrives for lowest latency.
      </p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Streaming..." : "Speak"}
      </button>
      <p style={{ marginTop: "1rem" }}>
        <a href="/">← Back to HTTP examples</a>
      </p>
    </main>
  );
}
```

## Run this example

```sh theme={null}
cd cartesia-js/examples/nextjs
npm install
CARTESIA_API_KEY=YOUR_KEY npm run dev
```

Then open [http://localhost:3000](http://localhost:3000).

## Source

<Card title="View on GitHub" icon="github" href="https://github.com/cartesia-ai/cartesia-js/tree/main/examples/nextjs">
  Full Next.js example project
</Card>


# Transcribe Audio
Source: https://docs.cartesia.ai/examples/stt-transcribe

Transcribe audio with word timestamps.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def stt_transcribe(client: Cartesia) -> None:
        """Transcribe audio with word timestamps."""
        with open("audio.wav", "rb") as f:
            response = client.stt.transcribe(
                file=f,
                model="ink-whisper",
                language="en",
                timestamp_granularities=["word"],  # Optional: get word timestamps
            )
        print(response.text)
        if response.words:
            for word in response.words:
                print(f"{word.word}: {word.start}s - {word.end}s")
    ```

    From [cartesia-python/examples/examples.py:526](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L526)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function sttTranscribe(client: Cartesia): Promise<void> {
      /** Transcribe audio with word timestamps. */
      const file = fs.createReadStream('audio.wav');
      const response = await client.stt.transcribe({
        file,
        model: 'ink-whisper',
        language: 'en',
        timestamp_granularities: ['word'],
      });
      console.log(response.text);
      if (response.words) {
        for (const word of response.words) {
          console.log(`${word.word}: ${word.start}s - ${word.end}s`);
        }
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:377](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L377)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py stt_transcribe
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts sttTranscribe
    ```
  </Tab>
</Tabs>


# Download Audio File
Source: https://docs.cartesia.ai/examples/tts-download-file

Generate audio and trigger a file download in the browser.

```typescript theme={null}
async function ttsDownloadFile(client: Cartesia): Promise<void> {
  /** Generate audio and trigger a file download in the browser. */
  const response = await client.tts.generate({
    model_id: 'sonic-3',
    transcript: 'This audio will be downloaded as a file.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'wav', encoding: 'pcm_s16le', sample_rate: 44100 },
  });

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);

  const a = document.createElement('a');
  a.href = url;
  a.download = 'speech.wav';
  a.click();

  URL.revokeObjectURL(url);
}
```

From [cartesia-js/examples/browser\_examples.ts:54](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L54)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Generate to File
Source: https://docs.cartesia.ai/examples/tts-generate-to-file

Use generate() and write_to_file() to write a wav file.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_generate_to_file(client: Cartesia) -> None:
        """Use generate() and write_to_file() to write a wav file."""
        response = client.tts.generate(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        response.write_to_file("output.wav")
        print(f"Saved audio to output.wav")
        print(f"Play with: ffplay -f wav output.wav")
    ```

    From [cartesia-python/examples/examples.py:30](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L30)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsGenerateToFile(client: Cartesia): Promise<void> {
      /** Use generate() and write_to_file() to write a wav file. */
      const response = await client.tts.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'wav', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      const buffer = Buffer.from(await response.arrayBuffer());
      fs.writeFileSync('output.wav', buffer);
      console.log('Saved audio to output.wav');
      console.log('Play with: ffplay -f wav output.wav');
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:29](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L29)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_generate_to_file
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsGenerateToFile
    ```
  </Tab>
</Tabs>


# Play Audio in Browser
Source: https://docs.cartesia.ai/examples/tts-play-audio

Generate a wav and play it using an <audio> element.

```typescript theme={null}
async function ttsPlayAudio(client: Cartesia): Promise<void> {
  /** Generate a wav and play it using an <audio> element. */
  const response = await client.tts.generate({
    model_id: 'sonic-3',
    transcript: 'Hello from the browser!',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'wav', encoding: 'pcm_s16le', sample_rate: 44100 },
  });

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);

  const audio = new Audio(url);
  audio.onended = () => URL.revokeObjectURL(url);
  await audio.play();
}
```

From [cartesia-js/examples/browser\_examples.ts:33](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L33)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# SSE Streaming
Source: https://docs.cartesia.ai/examples/tts-sse-basic

Basic SSE streaming.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_sse_basic(client: Cartesia) -> None:
        """Basic SSE streaming."""
        stream = client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )

        import datetime
        filename = f"tts_sse_basic_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            for event in stream:
                if event.type == "chunk":
                    # v3.x returns raw bytes in event.audio
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:62](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L62)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_sse_basic_async(client: AsyncCartesia) -> None:
        """Basic SSE streaming."""
        stream = await client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )

        filename = f"tts_sse_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            async for event in stream:
                if event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:52](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L52)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_basic
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_sse_basic_async
    ```
  </Tab>
</Tabs>


# SSE with Match Statement
Source: https://docs.cartesia.ai/examples/tts-sse-with-match

SSE streaming using match statement.

```python theme={null}
def tts_sse_with_match(client: Cartesia) -> None:
    """SSE streaming using match statement."""
    stream = client.tts.generate_sse(
        model_id="sonic-3",
        transcript="Hello, world!",
        voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
        output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
    )

    import datetime
    filename = f"tts_sse_with_match_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

    with open(filename, "wb") as f:
        for event in stream:
            if event.type == "chunk":
                # Audio chunk - event.audio contains bytes
                if event.audio:
                    f.write(event.audio)
                    process_audio(event.audio)
            elif event.type == "timestamps":
                # Word timestamps - event.word_timestamps
                process_timestamps(event.word_timestamps)
            elif event.type == "done":
                # Stream complete
                break
            elif event.type == "error":
                # Error occurred
                raise Exception(event.error)

    print(f"Saved audio to {filename}")
    print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
```

From [cartesia-python/examples/examples.py:151](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L151)

## Run this example

```sh theme={null}
cd cartesia-python
CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_match
```


# SSE with Phoneme Timestamps
Source: https://docs.cartesia.ai/examples/tts-sse-with-phoneme-timestamps

SSE streaming with phoneme timestamps.

```python theme={null}
def tts_sse_with_phoneme_timestamps(client: Cartesia) -> None:
    """SSE streaming with phoneme timestamps."""
    stream = client.tts.generate_sse(
        model_id="sonic-3",
        transcript="Hello, world!",
        voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
        output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        add_phoneme_timestamps=True,
    )

    import datetime
    filename = f"tts_sse_with_phoneme_timestamps_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

    with open(filename, "wb") as f:
        for event in stream:
            if event.type == "phoneme_timestamps":
                pt = event.phoneme_timestamps
                if pt:
                    print(f"Phonemes: {pt.phonemes}, Starts: {pt.start}, Ends: {pt.end}")
            elif event.type == "chunk":
                if event.audio:
                    f.write(event.audio)
            elif event.type == "done":
                break
            elif event.type == "error":
                raise Exception(f"Error: {event.error}")

    print(f"Saved audio to {filename}")
    print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
```

From [cartesia-python/examples/examples.py:120](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L120)

## Run this example

```sh theme={null}
cd cartesia-python
CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_phoneme_timestamps
```


# SSE with Word Timestamps
Source: https://docs.cartesia.ai/examples/tts-sse-with-timestamps

SSE streaming with word timestamps.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_sse_with_timestamps(client: Cartesia) -> None:
        """SSE streaming with word timestamps."""
        stream = client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            add_timestamps=True,
        )

        import datetime
        filename = f"tts_sse_with_timestamps_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            for event in stream:
                if event.type == "timestamps":
                    wt = event.word_timestamps
                    if wt:
                        print(f"Words: {wt.words}, Starts: {wt.start}, Ends: {wt.end}")
                elif event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:89](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L89)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_sse_with_timestamps_async(client: AsyncCartesia) -> None:
        """SSE streaming with word timestamps."""
        stream = await client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            add_timestamps=True,
        )

        filename = f"tts_sse_timestamps_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            async for event in stream:
                if event.type == "timestamps":
                    wt = event.word_timestamps
                    if wt:
                        print(f"Words: {wt.words}, Starts: {wt.start}, Ends: {wt.end}")
                elif event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:76](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L76)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_timestamps
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_sse_with_timestamps_async
    ```
  </Tab>
</Tabs>


# WebSocket Basic
Source: https://docs.cartesia.ai/examples/tts-websocket-basic

Basic WebSocket usage with websocket_connect() context manager.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_basic(client: Cartesia) -> None:
        """Basic WebSocket usage with websocket_connect() context manager."""
        with client.tts.websocket_connect() as connection:
            connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "voice-id"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            import datetime
            filename = f"tts_websocket_basic_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.done:
                        break

            print(f"Saved audio to {filename}")
            print(f"Play with:\n  $ ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:196](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L196)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_basic_async(client: AsyncCartesia) -> None:
        """Basic WebSocket usage with websocket_connect() context manager."""
        async with client.tts.websocket_connect() as connection:
            await connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            filename = f"tts_ws_basic_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.done:
                        break
            
            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:109](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L109)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketBasic(client: Cartesia): Promise<void> {
      /** Basic WebSocket usage with websocket_connect() context manager. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const filename = `tts_websocket_basic_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ws.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      })) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with:\n  $ ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:48](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L48)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_basic
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_basic_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketBasic
    ```
  </Tab>
</Tabs>


# WebSocket Concurrent Contexts
Source: https://docs.cartesia.ai/examples/tts-websocket-concurrent-contexts

Two contexts on one connection, each using ctx.receive() to get their own audio.

<Tabs>
  <Tab title="Python">
    Since sync code can't receive from both contexts concurrently, we collect
    them sequentially — but the lazy-routing in receive() ensures that events
    consumed while reading context 1 are queued for context 2 (and vice-versa).

    ```python theme={null}
    def tts_websocket_concurrent_contexts(client: Cartesia) -> None:
        """Two contexts on one connection, each using ctx.receive() to get their own audio.

        Since sync code can't receive from both contexts concurrently, we collect
        them sequentially — but the lazy-routing in receive() ensures that events
        consumed while reading context 1 are queued for context 2 (and vice-versa).
        """
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx1 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )
            ctx2 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )

            # Send to both contexts before receiving
            ctx1.push("Context one is speaking now. This is a longer transcript to ensure that audio chunks from both contexts are interleaved on the wire. The quick brown fox jumps over the lazy dog.")
            ctx1.no_more_inputs()

            ctx2.push("Context two has a different message. We want to verify that the routing logic correctly separates the audio streams. Pack my box with five dozen liquor jugs.")
            ctx2.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # Receive from ctx1 — any ctx2 events read from the wire get queued
            filename1 = f"tts_concurrent_ctx1_{timestamp}.pcm"
            with open(filename1, "wb") as f:
                for response in ctx1.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            # Receive from ctx2 — picks up queued events first
            filename2 = f"tts_concurrent_ctx2_{timestamp}.pcm"
            with open(filename2, "wb") as f:
                for response in ctx2.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved context 1 audio to {filename1}")
            print(f"Saved context 2 audio to {filename2}")
            print(f"Play with:")
            print(f"  ffplay -f f32le -ar 44100 {filename1}")
            print(f"  ffplay -f f32le -ar 44100 {filename2}")
    ```

    From [cartesia-python/examples/examples.py:375](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L375)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_concurrent_contexts_async(client: AsyncCartesia) -> None:
        """Two contexts on one connection, each using ctx.receive() to get their own audio."""
        from cartesia.resources.tts import AsyncWebSocketContext

        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx1 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )
            ctx2 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )

            # Send to both contexts
            await ctx1.push("Context one is speaking now. This is a longer transcript to ensure that audio chunks from both contexts are interleaved on the wire. The quick brown fox jumps over the lazy dog.")
            await ctx1.no_more_inputs()

            await ctx2.push("Context two has a different message. We want to verify that the routing logic correctly separates the audio streams. Pack my box with five dozen liquor jugs.")
            await ctx2.no_more_inputs()

            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # Receive concurrently via tasks, writing to files
            async def collect(ctx: AsyncWebSocketContext, filename: str) -> None:
                with open(filename, "wb") as f:
                    async for response in ctx.receive():
                        if response.type == "chunk" and response.audio:
                            f.write(response.audio)

            filename1 = f"tts_concurrent_async_ctx1_{timestamp}.pcm"
            filename2 = f"tts_concurrent_async_ctx2_{timestamp}.pcm"

            await asyncio.gather(
                collect(ctx1, filename1),
                collect(ctx2, filename2),
            )

            print(f"Saved context 1 audio to {filename1}")
            print(f"Saved context 2 audio to {filename2}")
            print(f"Play with:")
            print(f"  ffplay -f f32le -ar 44100 {filename1}")
            print(f"  ffplay -f f32le -ar 44100 {filename2}")
    ```

    From [cartesia-python/examples/async\_examples.py:288](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L288)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketConcurrentContexts(client: Cartesia): Promise<void> {
      /** Two contexts on one connection, each using ctx.receive() to get their own audio. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx1 = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      const ctx2 = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      // Send to both contexts before receiving.
      await ctx1.push({
        transcript:
          'Context one is speaking now. This is a longer transcript to ensure that ' +
          'audio chunks from both contexts are interleaved on the wire. ' +
          'The quick brown fox jumps over the lazy dog.',
      });
      await ctx1.no_more_inputs();

      await ctx2.push({
        transcript:
          'Context two has a different message. We want to verify that the routing ' +
          'logic correctly separates the audio streams. ' +
          'Pack my box with five dozen liquor jugs.',
      });
      await ctx2.no_more_inputs();

      const ts = timestamp();

      async function collect(ctx: { receive: typeof ctx1.receive }, filename: string): Promise<void> {
        const file = fs.createWriteStream(filename);
        for await (const event of ctx.receive()) {
          if (event.type === 'chunk' && event.audio) {
            file.write(event.audio);
          }
        }
        file.end();
      }

      const filename1 = `tts_concurrent_ctx1_${ts}.pcm`;
      const filename2 = `tts_concurrent_ctx2_${ts}.pcm`;

      await Promise.all([collect(ctx1, filename1), collect(ctx2, filename2)]);

      ws.close();
      console.log(`Saved context 1 audio to ${filename1}`);
      console.log(`Saved context 2 audio to ${filename2}`);
      console.log('Play with:');
      console.log(`  ffplay -f f32le -ar 44100 ${filename1}`);
      console.log(`  ffplay -f f32le -ar 44100 ${filename2}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:239](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L239)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_concurrent_contexts
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_concurrent_contexts_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketConcurrentContexts
    ```
  </Tab>
</Tabs>


# WebSocket Continuations
Source: https://docs.cartesia.ai/examples/tts-websocket-continuations

Streaming a transcript split into multiple parts, using continuations.

<Tabs>
  <Tab title="Python">
    Useful for streaming transcripts generated by an LLM.

    ```python theme={null}
    def tts_websocket_continuations(client: Cartesia) -> None:
        """Streaming a transcript split into multiple parts, using continuations.
        Useful for streaming transcripts generated by an LLM."""
        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format={
                    "container": "raw",
                    "encoding": "pcm_f32le",
                    "sample_rate": 44100,
                },
            )

            for part in ["The road ", "goes ever ", "on and ", "on."]:
                ctx.push(part)

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_websocket_continuations_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with:\n  $ ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:222](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L222)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_continuations_async(client: AsyncCartesia) -> None:
        """Streaming a transcript split into multiple parts, using continuations."""
        transcripts = ["The only thing we have to fear ", "is ", "fear itself."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context()

            for transcript in transcripts:
                await ctx.send(
                    model_id="sonic-3",
                    transcript=transcript,
                    voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                    output_format=output_format,
                    continue_=True,
                )

            await ctx.no_more_inputs()

            filename = f"tts_ws_continuations_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:131](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L131)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketContinuations(client: Cartesia): Promise<void> {
      /** Streaming a transcript split into multiple parts, using continuations. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      for (const part of ['The road ', 'goes ever ', 'on and ', 'on.']) {
        await ctx.push({ transcript: part });
      }
      await ctx.no_more_inputs();

      const filename = `tts_websocket_continuations_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with:\n  $ ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:73](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L73)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_continuations
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_continuations_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketContinuations
    ```
  </Tab>
</Tabs>


# WebSocket Emotion Control
Source: https://docs.cartesia.ai/examples/tts-websocket-emotion

Demonstrates changing emotion mid-stream using generation_config.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_emotion(client: Cartesia) -> None:
        """Demonstrates changing emotion mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending neutral text...")
            ctx.push("Well maybe if you just ")

            print("Sending angry text...")
            ctx.push("loosen up a little!", generation_config={"emotion": "angry"})

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_emotion_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:313](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L313)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_emotion_async(client: AsyncCartesia) -> None:
        """Demonstrates changing emotion mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending neutral text...")
            await ctx.push("Well maybe if you just ")

            print("Sending angry text...")
            await ctx.push("loosen up a little!", generation_config={"emotion": "angry"})

            await ctx.no_more_inputs()

            import datetime
            filename = f"tts_emotion_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:228](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L228)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketEmotion(client: Cartesia): Promise<void> {
      /** Demonstrates changing emotion mid-stream using generation_config. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      console.log('Sending neutral text...');
      await ctx.push({ transcript: 'Well maybe if you just ' });

      console.log('Sending angry text...');
      await ctx.send({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        transcript: 'loosen up a little!',
        continue: true,
        generation_config: { emotion: 'angry' },
      });

      await ctx.no_more_inputs();

      const filename = `tts_emotion_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:157](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L157)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_emotion
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_emotion_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketEmotion
    ```
  </Tab>
</Tabs>


# WebSocket Flushing
Source: https://docs.cartesia.ai/examples/tts-websocket-flushing

Demonstrates manual flushing to separate audio from different transcripts.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_flushing(client: Cartesia) -> None:
        """Demonstrates manual flushing to separate audio from different transcripts."""
        transcripts = ["Stay hungry, ", "stay foolish."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )  # Auto-generates context_id

            # 1. Send first transcript
            print("Sending first transcript...")
            ctx.push(transcripts[0])

            # 2. Flush! This forces all buffered audio for the first transcript to be generated
            # and increments the flush_id counter on the server.
            print("Flushing...")
            ctx.flush()

            # 3. Send second transcript
            print("Sending second transcript...")
            ctx.push(transcripts[1])

            ctx.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # We'll save audio to separate files based on flush_id
            files: dict[int, IO[bytes]] = {}

            for response in ctx.receive():
                # Log every response, but redact audio data to avoid swamping the console.
                loggable = {k: ("[...]" if k == "data" else v) for k, v in response.model_dump().items()}
                print(f"Event: {loggable}")

                if response.type == "chunk" and response.audio:
                    # Get flush_id from response (defaults to 0 if not present)
                    flush_id = response.flush_id or 0

                    if flush_id not in files:
                        filename = f"tts_flush_{flush_id}_{timestamp}.pcm"
                        files[flush_id] = open(filename, "wb")

                    files[flush_id].write(response.audio)

            # Close all open files
            for f in files.values():
                f.close()

            print("\nFinished.")
            print("You can play the generated audio files with these commands:")
            for flush_id, f in files.items():
                print(f"  Flush ID {flush_id}: ffplay -f f32le -ar 44100 {f.name}")
    ```

    From [cartesia-python/examples/examples.py:255](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L255)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_flushing_async(client: AsyncCartesia) -> None:
        """Demonstrates manual flushing to separate audio from different transcripts."""
        transcripts = ["First transcript.", "Second transcript."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context()

            # 1. Send first transcript
            print("Sending first transcript...")
            await ctx.send(
                model_id="sonic-3",
                transcript=transcripts[0],
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
            )

            # 2. Flush!
            print("Flushing...")
            await ctx.send(
                model_id="sonic-3",
                transcript="",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
                flush=True,
            )

            # 3. Send second transcript
            print("Sending second transcript...")
            await ctx.send(
                model_id="sonic-3",
                transcript=transcripts[1],
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
            )

            await ctx.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
            
            files: dict[int, IO[bytes]] = {}

            async for response in ctx.receive():
                if response.type == "chunk" and response.audio:
                    flush_id = response.flush_id or 0

                    if flush_id not in files:
                        filename = f"tts_flush_async_{flush_id}_{timestamp}.pcm"
                        files[flush_id] = open(filename, "wb")
                        print(f"Created new file for flush_id {flush_id}: {filename}")

                    files[flush_id].write(response.audio)

                elif response.type == "flush_done":
                    print(f"Flush done received for flush_id: {response.flush_id}")

            for f in files.values():
                f.close()

            print("\nFinished.")
            print("You can play the generated audio files with these commands:")
            for flush_id, f in files.items():
                print(f"  Flush ID {flush_id}: ffplay -f f32le -ar 44100 {f.name}")
    ```

    From [cartesia-python/examples/async\_examples.py:160](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L160)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketFlushing(client: Cartesia): Promise<void> {
      /** Demonstrates manual flushing to separate audio from different transcripts. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      // 1. Send first transcript
      console.log('Sending first transcript...');
      await ctx.push({ transcript: 'Stay hungry, ' });

      // 2. Flush — forces all buffered audio for the first transcript to be generated.
      console.log('Flushing...');
      await ctx.flush();

      // 3. Send second transcript
      console.log('Sending second transcript...');
      await ctx.push({ transcript: 'stay foolish.' });

      await ctx.no_more_inputs();

      const ts = timestamp();
      const files: Map<number, fs.WriteStream> = new Map();

      for await (const event of ctx.receive()) {
        // Log every response, but redact audio data to avoid swamping the console.
        const loggable = { ...(event as any) };
        if (loggable.data) loggable.data = '[...]';
        console.log('Event:', JSON.stringify(loggable));

        if (event.type === 'chunk' && event.audio) {
          const flushId = (event as any).flush_id ?? 0;
          if (!files.has(flushId)) {
            const name = `tts_flush_${flushId}_${ts}.pcm`;
            files.set(flushId, fs.createWriteStream(name));
          }
          files.get(flushId)!.write(event.audio);
        }
      }

      for (const f of files.values()) f.end();
      ws.close();

      console.log('\nFinished. Play the generated audio files with:');
      for (const [flushId, f] of files) {
        console.log(`  Flush ID ${flushId}: ffplay -f f32le -ar 44100 ${(f as any).path}`);
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:104](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L104)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_flushing
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_flushing_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketFlushing
    ```
  </Tab>
</Tabs>


# WebSocket Low-Latency Playback
Source: https://docs.cartesia.ai/examples/tts-websocket-low-latency

Play audio chunks as they arrive for lowest latency.

```typescript theme={null}
async function ttsWebsocketLowLatency(client: Cartesia): Promise<void> {
  /** Play audio chunks as they arrive for lowest latency. */
  const sampleRate = 44100;
  const audioCtx = new AudioContext({ sampleRate });
  let nextStartTime = audioCtx.currentTime;

  const ws = await client.tts.websocket();

  for await (const event of ws.generate({
    model_id: 'sonic-3',
    transcript: 'Low latency streaming. Each chunk plays as soon as it arrives.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: sampleRate },
  })) {
    if (event.type === 'chunk' && event.audio) {
      const floats = new Float32Array(
        event.audio.buffer,
        event.audio.byteOffset,
        event.audio.byteLength / 4,
      );

      const audioBuffer = audioCtx.createBuffer(1, floats.length, sampleRate);
      audioBuffer.getChannelData(0).set(floats);

      const source = audioCtx.createBufferSource();
      source.buffer = audioBuffer;
      source.connect(audioCtx.destination);

      // Schedule this chunk right after the previous one
      const startTime = Math.max(nextStartTime, audioCtx.currentTime);
      source.start(startTime);
      nextStartTime = startTime + audioBuffer.duration;
    }
  }

  ws.close();
}
```

From [cartesia-js/examples/browser\_examples.ts:127](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L127)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# WebSocket Response Handling
Source: https://docs.cartesia.ai/examples/tts-websocket-response-handling

WebSocket response type handling.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_response_handling(client: Cartesia) -> None:
        """WebSocket response type handling."""
        with client.tts.websocket_connect() as connection:
            connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "voice-id"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            import datetime
            filename = f"tts_websocket_response_handling_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.type == "timestamps":
                        process_timestamps(response.word_timestamps)
                    elif response.type == "done" or response.done:
                        break
                    elif response.type == "error":
                        raise Exception(response.error)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:427](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L427)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketResponseHandling(client: Cartesia): Promise<void> {
      /** WebSocket response type handling. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const filename = `tts_websocket_response_handling_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ws.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        add_timestamps: true,
      })) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        } else if (event.type === 'timestamps') {
          const wt = (event as any).word_timestamps;
          if (wt) {
            console.log(`Words: ${wt.words}, Starts: ${wt.start}, Ends: ${wt.end}`);
          }
        } else if (event.type === 'error') {
          throw new Error(JSON.stringify(event));
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:298](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L298)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_response_handling
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketResponseHandling
    ```
  </Tab>
</Tabs>


# WebSocket Speed Control
Source: https://docs.cartesia.ai/examples/tts-websocket-speed

Demonstrates changing speed mid-stream using generation_config.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_speed(client: Cartesia) -> None:
        """Demonstrates changing speed mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending normal speed text...")
            ctx.push("I am speaking at a normal pace. ")

            print("Sending fast speed text...")
            ctx.push("But now I am speaking much faster!", generation_config={"speed": 1.5})

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_speed_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:344](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L344)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_speed_async(client: AsyncCartesia) -> None:
        """Demonstrates changing speed mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending normal speed text...")
            await ctx.push("I am speaking at a normal pace. ")

            print("Sending fast speed text...")
            await ctx.push("But now I am speaking much faster!", generation_config={"speed": 1.5})

            await ctx.no_more_inputs()

            import datetime
            filename = f"tts_speed_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:258](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L258)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketSpeed(client: Cartesia): Promise<void> {
      /** Demonstrates changing speed mid-stream using generation_config. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      console.log('Sending normal speed text...');
      await ctx.push({ transcript: 'I am speaking at a normal pace. ' });

      console.log('Sending fast speed text...');
      await ctx.send({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        transcript: 'But now I am speaking much faster!',
        continue: true,
        generation_config: { speed: 1.5 },
      });

      await ctx.no_more_inputs();

      const filename = `tts_speed_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:198](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L198)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_speed
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_speed_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketSpeed
    ```
  </Tab>
</Tabs>


# WebSocket Stream to Web Audio
Source: https://docs.cartesia.ai/examples/tts-websocket-stream-audio

Stream audio from a WebSocket and play it in real-time with Web Audio API.

```typescript theme={null}
async function ttsWebsocketStreamAudio(client: Cartesia): Promise<void> {
  /** Stream audio from a WebSocket and play it in real-time with Web Audio API. */
  const sampleRate = 44100;
  const audioCtx = new AudioContext({ sampleRate });

  const ws = await client.tts.websocket();

  const chunks: Float32Array[] = [];

  for await (const event of ws.generate({
    model_id: 'sonic-3',
    transcript: 'This is being streamed in real time from a WebSocket connection.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: sampleRate },
  })) {
    if (event.type === 'chunk' && event.audio) {
      // event.audio is a raw buffer of f32le samples
      const floats = new Float32Array(
        event.audio.buffer,
        event.audio.byteOffset,
        event.audio.byteLength / 4,
      );
      chunks.push(floats);
    }
  }

  ws.close();

  // Combine all chunks into a single AudioBuffer and play
  const totalSamples = chunks.reduce((sum, c) => sum + c.length, 0);
  const audioBuffer = audioCtx.createBuffer(1, totalSamples, sampleRate);
  const channelData = audioBuffer.getChannelData(0);

  let offset = 0;
  for (const chunk of chunks) {
    channelData.set(chunk, offset);
    offset += chunk.length;
  }

  const source = audioCtx.createBufferSource();
  source.buffer = audioBuffer;
  source.connect(audioCtx.destination);
  source.start();
}
```

From [cartesia-js/examples/browser\_examples.ts:78](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L78)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Clone a Voice
Source: https://docs.cartesia.ai/examples/voices-clone

Clone a voice from an audio clip.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_clone(client: Cartesia) -> Any:
        """Clone a voice from an audio clip."""
        with open("sample.wav", "rb") as clip:
            voice = client.voices.clone(
                clip=clip,
                name="My Voice",
                description="A custom voice",
                language="en",
            )
        return voice
    ```

    From [cartesia-python/examples/examples.py:474](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L474)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesClone(client: Cartesia): Promise<void> {
      /** Clone a voice from an audio clip. */
      const clip = fs.createReadStream('sample.wav');
      const voice = await client.voices.clone({
        clip,
        name: 'My Voice',
        description: 'A custom voice',
        language: 'en',
      });
      console.log('Cloned voice:', voice.id);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:348](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L348)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_clone
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesClone
    ```
  </Tab>
</Tabs>


# Delete a Voice
Source: https://docs.cartesia.ai/examples/voices-delete

Delete a voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_delete(client: Cartesia, voice_id: str) -> None:
        """Delete a voice."""
        client.voices.delete(voice_id)
    ```

    From [cartesia-python/examples/examples.py:495](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L495)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesDelete(client: Cartesia): Promise<void> {
      /** Delete a voice. */
      await client.voices.delete('voice-id');
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:368](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L368)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_delete
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesDelete
    ```
  </Tab>
</Tabs>


# Get a Voice
Source: https://docs.cartesia.ai/examples/voices-get

Get a specific voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_get(client: Cartesia) -> Any:
        """Get a specific voice."""
        voice = client.voices.get("voice-id")
        return voice
    ```

    From [cartesia-python/examples/examples.py:468](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L468)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesGet(client: Cartesia): Promise<void> {
      /** Get a specific voice. */
      const voice = await client.voices.get('6ccbfb76-1fc6-48f7-b71d-91ac6298247b');
      console.log(voice.name);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:342](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L342)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_get
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesGet
    ```
  </Tab>
</Tabs>


# List Voices
Source: https://docs.cartesia.ai/examples/voices-list

List voices with pagination.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_list(client: Cartesia) -> None:
        """List voices with pagination."""
        voices = client.voices.list(limit=10)
        for voice in voices:
            print(voice.name)
    ```

    From [cartesia-python/examples/examples.py:461](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L461)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesList(client: Cartesia): Promise<void> {
      /** List voices with pagination. */
      for await (const voice of client.voices.list({ limit: 10 })) {
        console.log(voice.name);
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:335](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L335)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_list
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesList
    ```
  </Tab>
</Tabs>


# List Voices to DOM
Source: https://docs.cartesia.ai/examples/voices-list-to-dom

Fetch voices and display them in a <ul> element.

```typescript theme={null}
async function voicesListToDOM(client: Cartesia): Promise<void> {
  /** Fetch voices and display them in a <ul> element. */
  const ul = document.createElement('ul');

  for await (const voice of client.voices.list({ limit: 20 })) {
    const li = document.createElement('li');
    li.textContent = `${voice.name} (${voice.language})`;
    ul.appendChild(li);
  }

  document.body.appendChild(ul);
}
```

From [cartesia-js/examples/browser\_examples.ts:169](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L169)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Update a Voice
Source: https://docs.cartesia.ai/examples/voices-update

Update a voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_update(client: Cartesia, voice_id: str) -> None:
        """Update a voice."""
        client.voices.update(
            voice_id,
            name="Updated Name",
            description="Updated description",
        )
    ```

    From [cartesia-python/examples/examples.py:486](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L486)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesUpdate(client: Cartesia): Promise<void> {
      /** Update a voice. */
      await client.voices.update('voice-id', {
        name: 'Updated Name',
        description: 'Updated description',
      });
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:360](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L360)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_update
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesUpdate
    ```
  </Tab>
</Tabs>


# Authenticate your client applications
Source: https://docs.cartesia.ai/get-started/authenticate-your-client-applications

Secure client access to Cartesia APIs using Access Tokens

You may want to make Cartesia API requests directly from your client application, eg, a web app. However, shipping your API key to the app is not secure, as a malicious user could extract your API key and issue API requests billed to your account.

Access Tokens provide a secure way to authenticate client-side requests to Cartesia's APIs without
exposing your API key.

<Note>
  Access Tokens are used in contexts like web apps which should not be bundled with an API key. For
  trusted contexts like server applications, local scripts, or iPython notebooks, you should simply
  use API keys.
</Note>

## Prerequisites

Before implementing Access Tokens:

1. Configure your server with a Cartesia API key
2. Implement user authentication in your application
3. Establish secure client-server communication

### Available Grants

Access Tokens support granular permissions through grants. Both TTS and STT grants are optional:

**TTS Grant**: With `grants: { tts: true }`, clients have access to:

* `/tts/bytes` - Synchronous TTS generation streamed with chunked encoding
* `/tts/sse` - Server-sent events for streaming
* `/tts/websocket` - WebSocket-based streaming

**STT Grant**: With `grants: { stt: true }`, clients have access to:

* `/stt/websocket` - WebSocket-based speech-to-text streaming
* `/stt` - Batch speech-to-text processing
* `/audio/transcriptions` - OpenAI-compatible transcription endpoint

**Agents Grant**: With `grants: { agent: true }`, clients have access to:

* the Agents websocket calling endpoint

You can request multiple grants in a single token:

```json theme={null}
grants: { tts: true, stt: true, agent: false }
```

## Implementation Guide

### 1. Token Generation (Server-side)

Make a request to generate a new access token:

<CodeGroup>
  ```bash cURL lines theme={null}
  # TTS and STT access
  curl --location 'https://api.cartesia.ai/access-token' \
    -H 'Cartesia-Version: 2025-04-16' \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer sk_car_...' \
    -d '{ "grants": {"tts": true, "stt": true}, "expires_in": 60}'

  # TTS-only access
  curl --location 'https://api.cartesia.ai/access-token' \
    -H 'Cartesia-Version: 2025-04-16' \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer sk_car_...' \
    -d '{ "grants": {"tts": true}, "expires_in": 60}'
  ```

  ```javascript JavaScript lines theme={null}
  import { CartesiaClient } from "@cartesia/cartesia-js";

  const client = new CartesiaClient({ apiKey: "YOUR_API_KEY" });

  // TTS and STT access
  await client.auth.accessToken({
    grants: {
      tts: true,
      stt: true
    },
    expires_in: 60
  });

  // TTS-only access
  await client.auth.accessToken({
    grants: {
      tts: true
    },
    expires_in: 60
  });
  ```

  ```python Python lines theme={null}
  from cartesia import Cartesia

  client = Cartesia(
    token="YOUR_API_KEY"
  )

  # TTS and STT access
  response = client.auth.access_token(
    grants={"tts": True, "stt": True}, # Grant both permissions
    expires_in=60 # Token expires in 60 seconds
  )

  # TTS-only access
  response = client.auth.access_token(
    grants={"tts": True}, # Grant TTS permissions only
    expires_in=60 # Token expires in 60 seconds
  )

  # The response will contain the access token
  print(f"Access Token: {response.token}")
  ```
</CodeGroup>

#### Example Implementation

```typescript lines theme={null}
// TTS and STT access
const response = await fetch("https://api.cartesia.ai/access-token", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Cartesia-Version": "2025-04-16",
    Authorization: "Bearer <your_api_key>",
  },
  body: JSON.stringify({
    grants: { tts: true, stt: true },
    expires_in: 60, // 1 minute
  }),
});

// TTS-only access
const responseTTS = await fetch("https://api.cartesia.ai/access-token", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Cartesia-Version": "2025-04-16",
    Authorization: "Bearer <your_api_key>",
  },
  body: JSON.stringify({
    grants: { tts: true },
    expires_in: 60, // 1 minute
  }),
});

const { token } = await response.json();
```

For detailed API specifications, see the [Token API Reference](/api-reference/auth/access-token).

### 2. Token Storage (Client-side)

Store the token securely, such as setting HTTP-only cookie with matching token expiration. The cookie should be `httpOnly`, `secure`, and `sameSite: "strict"`.

### 3. Making Authenticated Requests

```typescript lines theme={null}
// Using TTS with access token
const ttsResponse = await fetch("https://api.cartesia.ai/tts/bytes", {
  headers: {
    Authorization: `Bearer ${accessToken}`,
    "Content-Type": "application/json",
  },
  // ... request configuration
});

// Using STT with access token
const sttResponse = await fetch("https://api.cartesia.ai/stt", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${accessToken}`,
  },
  body: formData, // multipart/form-data with audio file
});
```

### 4. Token Refresh Strategy

Proactively refresh the token in your app before they expire.

## Security Best Practices

### Essential Guidelines

* ✅ Generate tokens server-side only
* ✅ Use short token lifetimes (minutes)
* ✅ Implement automatic token refresh
* ✅ Store tokens in HTTP-only cookies
* ✅ Enable secure and SameSite cookie flags

### Security Don'ts

* ❌ Never store tokens in localStorage/sessionStorage
* ❌ Never log tokens or display them in the UI
* ❌ Never transmit tokens over non-HTTPS connections

### Token Lifecycle Management

1. Generate new token upon user authentication
2. Implement automatic refresh before expiration
3. Handle token expiration gracefully

## Additional Resources

* [API Reference](/api-reference/auth/access-token) - Access Token generation endpoint documentation


# Welcome to Cartesia
Source: https://docs.cartesia.ai/get-started/overview

Our API enables developers to build real-time, multimodal AI experiences that feel natural and responsive.

<Frame>
  <img alt="" />
</Frame>

The Cartesia API is the fastest, most emotive, ultra-realistic voice AI platform. Purpose-built for developers, it serves state-of-the-art models for both text-to-speech and speech-to-text, enabling seamless conversational AI experiences.

## Sonic Models for Text-to-Speech

Sonic models take text input and and stream back ultra-realistic speech in response. They can also clone voices, with full control over pronunciation and accent.

**Sonic 3 is the world's fastest, most emotive, ultra-realistic text-to-speech model.** It can stream out the first byte of audio in just 90ms, making it perfect for real-time and conversational experiences as well as dubbing, narration, AI avatars, and more. (To put things into perspective, 90ms is about twice as fast as the blink of an eye.)

**If real-time performance is your top priority,** Sonic Turbo offers even better performance, streaming out the first byte of audio in just 40ms.

Learn more about available Sonic model variants and their capabilities in the [TTS Models](../build-with-cartesia/tts-models/latest) section.

## Ink Models for Speech-to-Text

Ink models provide streaming speech-to-text transcription optimized for real-time voice applications.

**Ink-Whisper**, our debut model, is specifically engineered for conversational AI—handling telephony artifacts, background noise, accents, and proper nouns that typically challenge standard STT systems.

Ink-Whisper uses advanced dynamic chunking to process variable-length audio segments, reducing errors and hallucinations during pauses or audio gaps. At just \$0.13/hour, it's the most affordable streaming STT model available.

Learn more about the Ink model and its capabilities in the [STT Models](../build-with-cartesia/stt-models) section.

## Support

<CardGroup>
  <Card title="Discord" icon="discord" href="https://discord.gg/cartesia">
    Join our Discord server to chat with the Cartesia team, engage with the community, and get help with your projects.
  </Card>

  <Card title="Email" icon="envelope" href="mailto:support@cartesia.ai">
    Email us at [support@cartesia.ai](mailto:support@cartesia.ai) to get help with integrating Cartesia, your account, or billing.
  </Card>
</CardGroup>


# Realtime Text to Speech Quickstart
Source: https://docs.cartesia.ai/get-started/realtime-text-to-speech-quickstart

Stream text to Cartesia over a WebSocket and receive audio in real time.

Using the Cartesia Websocket API allows you to simultaneously stream text input and audio output.  This is best for realtime use cases such as voice agents when text is generated incrementally, as from an LLM.

Stream text in chunks to the Cartesia and receive audio chunks in real time. This is ideal when text is generated incrementally, such as from an LLM.

## Prerequisites

* A Cartesia API key. [Create one here](https://play.cartesia.ai/keys), then add it to your `.bashrc` or `.zshrc`:

  ```sh theme={null}
  export CARTESIA_API_KEY=<your api key here>
  ```

* `ffplay` (part of FFmpeg), used to play audio output:

  <Tabs>
    <Tab title="macOS">
      ```sh theme={null}
      brew install ffmpeg
      ```
    </Tab>

    <Tab title="Ubuntu">
      ```sh theme={null}
      sudo apt install ffmpeg
      ```
    </Tab>
  </Tabs>

## Stream text and play audio

<Tabs>
  <Tab title="Python">
    <Steps>
      <Step title="Install the SDK">
        ```sh theme={null}
        pip install 'cartesia[websockets]'
        ```
      </Step>

      <Step title="Stream text over a WebSocket">
        ```python realtime-tts.py theme={null}
        from cartesia import Cartesia
        import subprocess
        import os

        client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY"))

        print("Starting ffplay to play streaming audio output...")
        player = subprocess.Popen(
            ["ffplay", "-f", "f32le", "-ar", "44100", "-probesize", "32", "-analyzeduration", "0", "-nodisp", "-autoexit", "-loglevel", "quiet", "-"],
            stdin=subprocess.PIPE,
            bufsize=0,
        )

        print("Connecting to Cartesia via websockets...")
        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "f786b574-daa5-4673-aa0c-cbe3e8534c02"},
                output_format={
                    "container": "raw",
                    "encoding": "pcm_f32le",
                    "sample_rate": 44100,
                },
            )

            print("Sending chunked text input...")
            for part in ["Hi there! ", "Welcome to ", "Cartesia Sonic."]:
                ctx.push(part)

            ctx.no_more_inputs()

            for response in ctx.receive():
                if response.type == "chunk" and response.audio:
                    print(f"Received audio chunk ({len(response.audio)} bytes)")
                    # Here we pipe audio to ffplay. In a production app you might play audio in
                    # a client, or forward it to another service, eg, a telephony provider.
                    player.stdin.write(response.audio)
                elif response.type == "done":
                    break

        player.stdin.close()
        player.wait()
        ```
      </Step>

      <Step title="Run the quickstart">
        ```sh theme={null}
        python3 realtime-tts.py
        ```

        This will stream text inputs to Cartesia, and play the streaming audio output using `ffplay`. (Make sure your device volume is turned on!)
      </Step>
    </Steps>
  </Tab>

  <Tab title="JavaScript">
    <Steps>
      <Step title="Install the SDK">
        ```sh theme={null}
        npm install @cartesia/cartesia-js ws
        ```

        <Info>
          In the browser, you don't need the `ws` package — the native WebSocket API is used instead. However, you will need to use ephemeral access tokens for authentication. See [Authenticate Your Client Applications](/get-started/authenticate-your-client-applications).
        </Info>
      </Step>

      <Step title="Stream text over a WebSocket">
        Create a file named `realtime-tts.js` with the following code:

        ```js realtime-tts.js theme={null}
        import Cartesia from "@cartesia/cartesia-js";
        import { spawn } from "child_process";

        const client = new Cartesia({ apiKey: process.env["CARTESIA_API_KEY"] });

        console.log("Starting ffplay to play streaming audio output...");
        const player = spawn("ffplay", ["-f", "f32le", "-ar", "44100", "-probesize", "32", "-analyzeduration", "0", "-nodisp", "-autoexit", "-loglevel", "quiet", "-"], {
          stdio: ["pipe", "ignore", "ignore"],
        });

        console.log("Connecting to Cartesia via websockets...");
        const ws = await client.tts.websocket();

        const ctx = ws.context({
          model_id: "sonic-3",
          voice: { mode: "id", id: "f786b574-daa5-4673-aa0c-cbe3e8534c02" },
          output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: 44100 },
        });

        console.log("Sending chunked text input...");
        const transcriptChunks = ["Hi there! ", "Welcome to ", "Cartesia Sonic."]
        for (const part of transcriptChunks) {
          await ctx.push({ transcript: part });
        }

        await ctx.no_more_inputs();

        for await (const event of ctx.receive()) {
          if (event.type === "chunk" && event.audio) {
            console.log("Received audio chunk (%d bytes)", event.audio.length);
            // Here we pipe audio to ffplay. In a production app you might play audio in
            // a client, or forward it to another service, eg, a telephony provider.
            player.stdin.write(event.audio);
          } else if (event.type === "done") {
            break;
          }
        }
        player.stdin.end();
        ws.close();
        ```
      </Step>

      <Step title="Run the quickstart">
        ```sh theme={null}
        node realtime-tts.js
        ```

        This will stream text inputs to Cartesia, and play the streaming audio output using `ffplay`. (Make sure your device volume is turned on!)
      </Step>
    </Steps>
  </Tab>
</Tabs>

## How it works

The WebSocket connection manages multiple *contexts*, each representing an independent, continuous stream of speech. Cartesia context is exactly like an LLM context: on our servers, we store the previously-generated speech so that new speech matches it in tone.

To summarize, here's what our code does, after establishing a Websocket connection:

1. **Create a context** with `context()`.
2. **Push text** incrementally with `push()`. Each chunk continues seamlessly from the previous one using [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).
3. **Signal completion** with `no_more_inputs()` to tell the model no more text is coming.
4. **Receive audio** chunks as they are generated.

This maps directly to LLM token streaming — push each token or sentence fragment as it arrives, and audio begins streaming back even if the full text is not yet available.

## What's next

<CardGroup>
  <Card title="Stream inputs using continuations" icon="waveform" href="/build-with-cartesia/capability-guides/stream-inputs-using-continuations">
    Deep dive into context management and buffering.
  </Card>

  <Card title="Choose a Voice" icon="microphone" href="/build-with-cartesia/capability-guides/choosing-a-voice">
    Browse voices and learn how to pick the right one for your use case.
  </Card>

  <Card title="Choosing TTS parameters" icon="sliders" href="/build-with-cartesia/capability-guides/choosing-tts-parameters">
    Pick the right output format, sample rate, and encoding for your use case.
  </Card>
</CardGroup>


# LiveKit
Source: https://docs.cartesia.ai/integrations/live-kit


<Frame>
  <img alt="LiveKit Agents logo" />
</Frame>

**LiveKit** is a WebRTC-first platform for realtime **video, voice, and data** in your product. **LiveKit Agents** is its framework for conversational agents.

**Cartesia** integrates in two ways: **LiveKit Inference** (hosted **cartesia/sonic-3** and related model IDs in the agent runtime; keys and pricing are through **LiveKit**—see [LiveKit’s Cartesia TTS guide](https://docs.livekit.io/agents/models/tts/inference/cartesia)) and the open-source **[livekit-plugins-cartesia](https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-cartesia)** Python package for **TTS and STT** using your **Cartesia** credentials from the worker.

# Demo

Here's a demo of a voice assistant built with LiveKit Agents and Cartesia:

<Card title="LiveKit Cartesia Demo" icon="solid link" href="https://cartesia-assistant.vercel.app/">
  Try out the LiveKit Cartesia demo.
</Card>

The source code for this demo is available [here](https://github.com/livekit-examples/cartesia-voice-agent)


# Overview
Source: https://docs.cartesia.ai/integrations/overview

Partner integrations for Cartesia TTS and STT in your own app—not Cartesia-hosted agents.

Cartesia provides first-party speech APIs and SDKs, and integrates with many other products and developer frameworks. The pages in this section describe each path at a high level; detailed setup usually lives in partner documentation and repositories.

## Prerequisites

You’ll need these for almost every integration below. Individual pages also list extras (partner accounts, runtimes, SDK installs).

* **[Cartesia API key](https://play.cartesia.ai/keys)** — create and manage keys in the Playground.
* **A voice** — pick one in the Playground or API; see [Choosing a voice](/build-with-cartesia/capability-guides/choosing-a-voice) and [Voice IDs](/build-with-cartesia/tts-models/voice-ids).

## Integrations

<CardGroup>
  <Card title="LiveKit" icon="circle" href="/integrations/live-kit">
    Realtime rooms and agents—Cartesia via LiveKit Inference or the Cartesia plugin.
  </Card>

  <Card title="Pipecat" icon="cat" href="/integrations/pipecat">
    Python voice and multimodal agents with official Cartesia TTS/STT services.
  </Card>

  <Card title="Twilio" icon="phone" href="/integrations/twilio">
    Programmable Voice and Media Streams with Cartesia TTS (Node walkthrough).
  </Card>

  <Card title="Tencent RTC" icon="tencent-weibo" href="/integrations/tencent-rtc">
    TRTC realtime media with Cartesia for conversational AI workloads.
  </Card>

  <Card title="Thoughtly" icon="phone" href="/integrations/thoughtly">
    No-code phone agents; Cartesia is the default voice stack for new agents.
  </Card>

  <Card title="Rasa" icon="robot" href="/integrations/rasa">
    Rasa Pro voice assistants with Cartesia as the TTS backend.
  </Card>

  <Card title="Vision Agents (by Stream)" icon="camera" href="/integrations/vision-agents-by-stream">
    Stream’s Vision Agents framework with a Cartesia TTS plugin.
  </Card>

  <Card title="MCP" icon="comment" href="/tools/ai/mcp">
    `cartesia-mcp` for Cursor, Claude Desktop, and other MCP clients.
  </Card>
</CardGroup>


# Pipecat
Source: https://docs.cartesia.ai/integrations/pipecat


<Frame>
  <img alt="Pipecat logo" />
</Frame>

## Overview

[**Pipecat**](https://www.pipecat.ai/) is an open-source Python framework for realtime **voice** agents.

Building voice agents requires the creation and orchestration of pipelines, media and communication transports (such as Daily or LiveKit), and pluggable AI models.

**Cartesia** is available as a first-party provider plugin for **[TTS and STT services](https://github.com/pipecat-ai/pipecat/tree/main/src/pipecat/services/cartesia)** in the Pipecat repo.

## Prerequisites

Pipecat’s examples require a recent Python installation (see the Pipecat repo's [root-level README](https://github.com/pipecat-ai/pipecat/tree/main#prerequisites) for current prerequisites).

Install the **`pipecat-ai`** Python package with the **`cartesia`** extra for TTS/STT (bracket syntax):

```
pip install "pipecat-ai[cartesia,...]"

# or

uv add "pipecat-ai[cartesia,...]"
```

You'd also need to choose the **transport** extras your sample needs - you can do this by matching whatever the upstream README lists for that example.

## Getting Started - TTS (Websockets)

Pipecat's getting-started example provides you with a small, copy-friendly path to wire Cartesia TTS into a Pipecat [TTS WebSocket API](https://docs.cartesia.ai/api-reference/tts/websocket), and:

<Card title="Cartesia & Pipecat | Getting Started" icon="github" href="https://github.com/pipecat-ai/pipecat/tree/main/examples/getting-started">
  Getting-started examples in the Pipecat repo.
</Card>

## Getting Started - TTS and STT (Websockets & HTTP)

For smaller voice-focused samples using Cartesia STT and TTS you can choose between two transports - WebSockets or HTTP:

<CardGroup>
  <Card title="Pipecat & Cartesia Voice (WebSockets)" icon="github" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-cartesia.py">
    Voice bot using Cartesia STT & TTS over WebSocket.
  </Card>

  <Card title="Pipecat & Cartesia Voice (HTTP)" icon="github" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-cartesia-http.py">
    Same flow using Cartesia STT & TTS over HTTP.
  </Card>
</CardGroup>

## Orchestrated Conversational AI

For a fuller example app that shows an end to end Voice Agent experience (VAD -> STT -> LLM -> TTS) orchestrated with Pipecat, see StudyPal:

<Card title="Pipecat > StudyPal" icon="github" href="https://github.com/pipecat-ai/pipecat-examples/tree/main/studypal">
  StudyPal example in the pipecat-examples repo.
</Card>


# Rasa
Source: https://docs.cartesia.ai/integrations/rasa


**Rasa** is an open dialogue stack; **voice streaming with Cartesia** is documented for **Rasa Pro** (commercial) assistants. Configure a voice channel in **`credentials.yml`** with `tts: name: cartesia` and **`CARTESIA_API_KEY`** per Rasa’s speech-integrations reference. Start with their walkthrough, then use the reference for parameters (`model_id`, `voice`, multilingual `language_map`, etc.):

<Card title="Tutorial: Build a Voice Agent with Rasa and Cartesia" href="https://rasa.com/blog/building-a-voice-bot-with-rasa-and-cartesia-a-technical-tutorial/">
  Full tutorial for building a voice agent with Rasa and Cartesia.
</Card>

For implementation details, see their documentation:

<Card title="Rasa > Docs > Speech integrations (Cartesia)" href="https://rasa.com/docs/reference/integrations/speech-integrations/#cartesia-tts">
  Rasa reference for Cartesia TTS in voice channels.
</Card>


# Air-Gapped Deployments
Source: https://docs.cartesia.ai/self-hosted/air-gapped

Deploy Cartesia without internet connectivity to licensing servers

For deployments without internet connectivity to Cartesia's licensing servers, you can run in air-gapped mode. This mode uses an offline license file instead of real-time authentication.

<Note>Download your offline license file from the [on-prem portal](https://play.cartesia.ai/on-prem). See [Provisioned Resources](/self-hosted/provisioned-resources) for details.</Note>

## Configuration

<Tabs>
  <Tab title="Terraform">
    ```hcl theme={null}
    # In your .tfvars file
    authenticate               = false
    license_proxy_persistence  = true   # Required for air-gapped mode
    ```
  </Tab>

  <Tab title="Helm">
    ```yaml theme={null}
    infra:
      authenticate: false
    licenseProxy:
      persistence:
        enabled: true
        storageClass: gp2  # Use appropriate storage class for your cluster
    ```
  </Tab>
</Tabs>

## Loading a License

In air-gapped mode, the `/license` endpoint is exposed for license management.

### Via Port-Forward

```bash theme={null}
kubectl port-forward svc/cartesia-license-proxy 8080:8080 -n cartesia
```

In another terminal:

```bash theme={null}
curl -X POST http://localhost:8080/license -d '<license-json>'
```

### Via Ingress

If ingress is enabled:

```bash theme={null}
curl -X POST https://<your-domain>/license -d '<license-json>'
```

## Retrieving Audit Logs

The `/audit` endpoint is available in air-gapped mode for retrieving usage audit logs:

```bash theme={null}
curl -X GET https://<your-domain>/audit --output audits.tar
```

These audit logs contain usage metadata for billing reconciliation. No transcript data is included, which you can validate by looking at the contents of the output.


# Architecture
Source: https://docs.cartesia.ai/self-hosted/architecture

Overview of the core components in a Cartesia self-hosted deployment.

Cartesia's self-hosted services support a configurable trade-off between latency and throughput for both TTS and STT deployments.

<Frame>
  <img alt="Self-hosted Architecture" />
</Frame>

## Core Components

### API Server

The API Server is the entrypoint for all requests for your self-hosted Cartesia Service. It handles incoming REST API requests and WebSocket connections.

### PubSub Controller (NATS)

We leverage an async communication protocol between the API server and the model containers to manage smooth low latency request handling. This design allows :

* Model containers to leave and join the cluster freely.
* Efficient stateful management of long running request lifecycles.
* Coordination between the API server and Model containers for the lowest latency pathways for a request.

### Model Workers (Engine)

Cartesia provides batched engine workers for both TTS and STT. The core parameter to customize here is the `batch_size (B)`. We'll discuss tradeoffs
for this and other parameters in the Performance Tuning sections.

### License Proxy Server

We deploy a single service which talks to our cloud environment for authenticating and ensuring license validity of the self-hosted deployment.  We
do this for several reasons, primarily: this becomes the only service making outbound calls, thus making it easier to configure network security
policies.

Proxy allows you to choose the level of isolation you want:

* `Connected`: The deployment validates licensing by pinging our cloud periodically and sends telemetry regarding usage.
* `Air-gapped`: Completely isolated offering, where you work with an offline license.  In air-gapped mode, we work with you directly to get usage
  information via audit-logs.

For most customers, we recommend deploying in `Connected` mode, however if you have need for completely isolated deployments,
our GTM team can work with you in setting things up.

For both `Connected` and `Air-gapped` mode, we have grace periods configured, so we don't immediately terminate the operations on getting disconnected or license expiring.


# Autoscaling
Source: https://docs.cartesia.ai/self-hosted/auto-scaling


## Pod Auto-Scaling (KEDA)

KEDA ScaledObjects use Prometheus-based metrics with two triggers:

| Trigger     | Metric                                            | Threshold | Condition               |
| ----------- | ------------------------------------------------- | --------- | ----------------------- |
| Worker Load | inferno\_worker\_load / inferno\_worker\_capacity | 0.8 (80%) | Always active           |
| Queue-based | api\_queue\_size / capacity (overflow mode)       | 1.0       | Only when minReplicas=0 |
| Queue-based | api\_unserviceable\_requests\_size                | 0.9       | Only when minReplicas=0 |

Scaling behavior:

* Polling interval: 15 seconds
* Scale-up stabilization: 30 seconds
* Scale-down stabilization: 900 seconds (15 min)
* Scale-down policy: Remove 1 pod per 60 seconds

## Cluster/Node Auto-Scaling

<Tabs>
  <Tab title="AWS EKS">
    Uses the Cluster Autoscaler:

    * Scan interval: 10 seconds
    * Scale-down delay: 10 minutes after node add
    * Scale-down unneeded time: 10 minutes
    * Expander: least-waste (bin-packing)
    * Metric: Pending pods that can't be scheduled due to insufficient resources
  </Tab>

  <Tab title="GCP GKE">
    Uses the Native Autoscaler:

    * Profile: BALANCED
    * Resource limits: CPU (1-128), Memory (1-512GB), nvidia-l4 GPUs (0-8)
    * Metric: Pending pods + resource utilization
  </Tab>
</Tabs>

## Metrics Used for Scaling

The autoscaling triggers above use [Prometheus metrics](/self-hosted/metrics) exposed by the application. See the [Metrics and Monitoring](/self-hosted/metrics) page for the full list of available metrics.


# Changelog
Source: https://docs.cartesia.ai/self-hosted/changelog

Release history for Cartesia self-hosted deployments

## sonic-20260310

<AccordionGroup>
  <Accordion title="Add voices API">
    New `POST /onprem/add-voices` endpoint to migrate voices from the Cartesia cloud to your self-hosted deployment. Supports up to 50 voices per request.

    See [Managing Artifacts](/self-hosted/managing-artifacts) for details.
  </Accordion>

  <Accordion title="Add pronunciation dictionaries API">
    New `POST /onprem/add-pdict` endpoint to migrate pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment. Supports up to 50 dictionaries per request.

    See [Managing Artifacts](/self-hosted/managing-artifacts) for details.
  </Accordion>

  <Accordion title="Hot reload">
    New artifacts (voices, migration files) are picked up automatically without requiring a rollout. Enabled by default.

    ```hcl theme={null}
    enable_hot_reload = false  # to disable
    ```

    <Warning>
      Hot reload does not support PVC voices. Migrations with `include_loras: true` require a restart of the worker pods.
    </Warning>
  </Accordion>
</AccordionGroup>


# Cloud Service Provisioning
Source: https://docs.cartesia.ai/self-hosted/cloud-service-provisioning

Deploy Cartesia using Amazon SageMaker Jumpstart

Amazon SageMaker Jumpstart provides the quickest path to deploying Cartesia's self-hosted solution with managed infrastructure, automatic scaling, and integrated monitoring. This deployment method is ideal for teams new to self-hosted AI or those wanting managed infrastructure.

To get started, visit the [Sonic 3 on AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-w2bmik3jypagm) to subscribe.

## Overview

SageMaker Jumpstart deployment offers:

* **Managed Infrastructure**: AWS handles server provisioning and maintenance
* **Automatic Scaling**: Built-in auto-scaling based on demand
* **Integrated Monitoring**: CloudWatch integration for metrics and logging
* **Pay-per-use**: Cost optimization through on-demand resource allocation
* **Quick Setup**: Deploy in minutes using pre-configured notebooks

## Prerequisites

### AWS Account Requirements

* AWS account with SageMaker access
* Sufficient service limits for GPU instances (ml.g6e.xlarge)
* IAM role with Sagemaker Full Access and Marketplace Subscription Access (ViewSubscriptions, Unsubscribe, Subscribe)
* VPC configuration (optional, for private deployment)

## Getting Started

To get started with deploying an inference endpoint for Sonic 3 on Sagemaker, please refer to [the steps in this notebook](https://github.com/cartesia-ai/cartesia-aws/blob/main/Sonic-3-Jumpstart.ipynb)

## Inference Setup

Sonic 3 supports only real time inference on Sagemaker. Please select `ml.g6e.xlarge` as your inference endpoint instance type. Each instance is capable of serving 8 concurrent requests. In order to get the best performance, Sagemaker suggests that you reuse the client-to-SageMaker connection, as it can save the time to re-establish the connection. In boto3, you can configure max\_pool\_connections . Multiple requests will reuse the connections, which avoids the cost of establishing new TCP/TLS connections for each request.

## Inputs and Outputs

### Input Summary

The response streaming endpoint takes in a JSON object as the input that specifies the transcript, voice, language, and output format for the generation

### Input Parameters

| Parameter                   | Description                                                                                                                                                                                                                                                                                                                           | Type      | Required |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | -------- |
| `context_id`                | A unique ID provided by the client to identify the request. It can be any string value and helps with tracking or debugging.                                                                                                                                                                                                          | `string`  | Yes      |
| `transcript`                | The text that will be converted into speech. You can include additional controls (e.g., emotion, speed, volume) as supported by Sonic 3 models.<br /><a href="https://docs.cartesia.ai/build-with-cartesia/sonic-3/volume-speed-emotion">Docs</a>                                                                                     | `string`  | Yes      |
| `language`                  | The language code of the transcript text.<br /><br />Supported codes:<br />`en`, `fr`, `de`, `es`, `pt`, `zh`, `ja`, `hi`, `it`, `ko`, `nl`, `pl`, `ru`, `sv`, `tr`, `tl`, `bg`, `ro`, `ar`, `cs`, `el`, `fi`, `hr`, `ms`, `sk`, `da`, `ta`, `uk`, `hu`, `no`, `vi`, `bn`, `th`, `he`, `ka`, `id`, `te`, `gu`, `kn`, `ml`, `mr`, `pa` | `string`  | Yes      |
| `output_format`             | Must match the `raw` option from the Cartesia TTS SSE API. Only `raw` is supported.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-output-format">Docs</a>                                                                                                                                                         | `string`  | Yes      |
| `voice`                     | Matches the `voice` field from the Cartesia TTS SSE API. Only **mode = `id`** is supported.<br /><br />Example:<br />`{ "mode": "id", "id": "voice_123" }`<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-voice">Docs</a>                                                                                          | `object`  | Yes      |
| `generation_config`         | Optional configuration object matching the API schema.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-generation-config">Docs</a>                                                                                                                                                                                  | `object`  | No       |
| `add_timestamps`            | Whether to include word-level timestamps in the output.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-add-timestamps">Docs</a>                                                                                                                                                                                    | `boolean` | No       |
| `add_phoneme_timestamps`    | Whether to include phoneme-level timestamps in the output.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-add-phoneme-timestamps">Docs</a>                                                                                                                                                                         | `boolean` | No       |
| `use_normalized_timestamps` | Whether timestamps should be normalized (0–1 range).<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-use-normalized-timestamps">Docs</a>                                                                                                                                                                            | `boolean` | No       |

### Data Sample

```json theme={null}
{
    "context_id": "0",
    "transcript": "The detective burst through the door. 'We've got maybe five minutes before they realize we're here, so listen carefully and listen well: <speed ratio='1.5'/> the artifact is hidden beneath the old courthouse, exactly three feet below the cornerstone, and <volume ratio='0.5'/>whatever you do, DO NOT touch it with your bare hands!' She paused, catching her breath. 'Now... here's the important part... <speed ratio='0.6'/>you need to... very slowly... very carefully... wrap it in the copper wire first... then the silk cloth... then seal it in the lead box.' <volume ratio='2.0'/> Footsteps echoed in the hallway. 'GO GO GO! They're coming up the stairs RIGHT NOW!'",
    "language": "en",
    "output_format": {
        "container": "raw",
        "sample_rate": 44100,
        "encoding": "pcm"
    },
    "voice_id": {
        "mode": "id",
        "id": "bf0a246a-8642-498a-9950-80c35e9276b5"
    }
}
```

### Output Details

#### Output Events

Sagemaker sends back the response events in a [Response Stream](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_ResponseStream.html). The payload is sent to you as base 64 encoded blobs. Due to Sagemaker limitation, it may truncate one event into several segements. Or API always attach a linebreak to the end of each complete event, such that you can reconciliate them on client side. Each event we send back is a json object that contains the generated audio chunk and some metadatas. The event can be one of the following types, identified by `event.type`:

##### Chunk Event

A chunk event always contains at most 20 ms worth of audio chunk in the output format and sample rate you specified.

| Parameter       | Description                                                                                                            | Type     | Required |
| --------------- | ---------------------------------------------------------------------------------------------------------------------- | -------- | -------- |
| `type`          | The type of response event. For chunk events, this value is always `"chunk"`.                                          | `string` | Yes      |
| `context_id`    | Optional identifier for the response context. Useful for correlating responses with requests or sessions.              | `string` | No       |
| `status_code`   | The HTTP-like status code representing the success or error state of the chunk event.                                  | `int`    | Yes      |
| `done`          | Indicates whether this is the final chunk (`true`) or if more chunks are expected (`false`).                           | `bool`   | Yes      |
| `data`          | The base 64 encoded chunk of audio data. Each chunk represents a portion of the full audio output.                     | `string` | Yes      |
| `sampling_rate` | The sampling rate (in Hz) of the audio data in this chunk (e.g., `44100` or `8000`).                                   | `int`    | Yes      |
| `step_time`     | The time (in seconds) representing the generation step for this chunk, useful for synchronization or latency tracking. | `float`  | Yes      |

##### Done Event

A done event signals the completion of the generation. Done events are identified by `event.type == "done"` and `event.done == True`.

##### Timestamp Event

A **timestamp event** provides timing information for recognized words or tokens.

| Parameter         | Description                                                                        | Type                | Required |
| ----------------- | ---------------------------------------------------------------------------------- | ------------------- | -------- |
| `type`            | The response type. Always `"timestamps"`.                                          | `string`            | Yes      |
| `context_id`      | Optional identifier correlating this timestamp event with its request/session.     | `string`            | No       |
| `status_code`     | Status code indicating success or failure.                                         | `int`               | Yes      |
| `done`            | Indicates whether this is the final timestamp event.                               | `bool`              | Yes      |
| `word_timestamps` | A dictionary describing word-level timestamps (format may vary by implementation). | `dict<string, any>` | Yes      |

##### Phoneme Timestamp Event

A **phoneme timestamp event** provides timing data at the phoneme level, typically for detailed speech analysis.

| Parameter            | Description                                                            | Type                | Required |
| -------------------- | ---------------------------------------------------------------------- | ------------------- | -------- |
| `type`               | The response type. Always `"phoneme_timestamps"`.                      | `string`            | Yes      |
| `context_id`         | Optional identifier for correlating this event with a request/session. | `string`            | No       |
| `status_code`        | Processing status code.                                                | `int`               | Yes      |
| `done`               | Indicates whether this is the final phoneme timestamp event.           | `bool`              | Yes      |
| `phoneme_timestamps` | A dictionary containing phoneme-level timing information.              | `dict<string, any>` | Yes      |

## Error Handling

If an error occurs during the generation type, Sagemaker will send back the error as a [Model Error](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_ResponseElements:~:text=Status%20Code%3A%20500-,ModelError,-Model%20\(owned%20by\)). To handle the error, you may inspect the `OriginalStatusCode` field of the error object (See examples for error handling in python).

### 422 Errors

A 422 error indicates that your input is not of the correct format. You may see more details in the `Message` field.

### 429 Errors

A 429 error indicates that the model container you are hitting does not have capacity to serve requests at the point. Our models serve at most 4 concurrent generation requests at a time. If you are running multiple inference container replicas, we suggest that you use load-aware routing in sagemaker by configuring the parameters `RoutingConfig` inside the `ProductionVariants` configuration, Set it to `LEAST_OUTSTANDING_REQUESTS` for optimal load distribution.

## Container Logs

You should be able to see container logs in cloudwatch. Most logs should be emitted with a request id. The server side request id is of the format `{uuid}-{client supplied context id}`.


# Docker
Source: https://docs.cartesia.ai/self-hosted/docker-compose

Deploy Cartesia on bare-metal or VM nodes using Docker Compose or Docker Swarm

<Note>Docker Compose and Docker Swarm deployment are currently in **beta**. Connect with the Cartesia team for support.</Note>

Deploy Cartesia TTS on a **single machine** with Docker Compose, or across a **multi-node cluster** with Docker Swarm.

|                 | Docker Compose                                       | Docker Swarm                              |
| --------------- | ---------------------------------------------------- | ----------------------------------------- |
| **Nodes**       | Single host                                          | Multiple hosts (managers + workers)       |
| **GPU scaling** | Multiple workers via `WORKER_REPLICAS` (one per GPU) | Workers scheduled on labeled GPU nodes    |
| **MIG support** | Auto-detected via `--mig` flag                       | Per-node via node labels and `--mig` flag |
| **Networking**  | Bridge (default)                                     | Overlay (Swarm-managed)                   |

## Prerequisites

* One or more machines with Docker installed (your user must be in the `docker` group)
* **Compose only:** Docker Compose V2 (`docker compose`)
* **Swarm only:** nodes meet Docker's [Swarm networking requirements](https://docs.docker.com/engine/swarm/networking/)
* At least one NVIDIA GPU with drivers installed. MIG (Multi-Instance GPU) partitioning is supported on compatible NVIDIA GPUs
* GPU nodes have the **nvidia Docker runtime set as default** (see below)
* The `cartesia-kube` repo downloaded as described in [Downloading cartesia-kube](/self-hosted/getting-started#downloading-kube)
* A Cartesia API key file (`container_key`) and a GCS service account JSON file, provided during onboarding

### GPU runtime check

On each GPU node, verify the NVIDIA runtime:

```bash theme={null}
nvidia-smi

docker info | grep "Default Runtime"
# Expected: Default Runtime: nvidia

docker run --rm nvidia/cuda:12.3.1-base-ubuntu22.04 nvidia-smi
```

If `nvidia` is not the default runtime, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) and run:

```bash theme={null}
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker
```

**If using MIG:** After enabling MIG and creating instances on the host, verify they are visible:

```bash theme={null}
nvidia-smi -L
# Each MIG instance appears as a MIG-... UUID line beneath its parent GPU.
# The deploy script reads these UUIDs automatically — no manual configuration required.
```

<Note>MIG must be enabled and instances created on the host before deploying. Recreating MIG instances generates new UUIDs; redeploy the stack if this happens.</Note>

***

## Step 1 — Prepare secrets

Place these files on the host (Compose) or **manager node** (Swarm):

* `container_key` — file containing your Cartesia API key
* `service-account.json` — GCS service account JSON with `roles/artifactregistry.reader` (image pull) and `roles/storage.objectViewer` (GCS sync)

Make the deploy script executable:

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    chmod +x local/scripts/deploy-compose.sh
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    chmod +x local/scripts/deploy-swarm.sh
    ```
  </Tab>
</Tabs>

***

## Step 2 — Initialize the cluster (Swarm only)

Skip this step if you are using Docker Compose.

On the **manager node**:

```bash theme={null}
docker swarm init --advertise-addr <MANAGER_IP>
```

Copy the `docker swarm join` command from the output. On **each additional node**, run:

```bash theme={null}
docker swarm join --token <TOKEN> <MANAGER_IP>:2377
```

Label each node from the manager. Use `docker node ls` to list node IDs:

```bash theme={null}
docker node update --label-add cpu=true <node-id>   # CPU services (API, NATS, etc.)
docker node update --label-add gpu=true <node-id>   # Standard GPU workers
```

**If using MIG:** Label MIG-enabled nodes with `mig=true` and a comma-separated list of their MIG instance UUIDs (obtained from `nvidia-smi -L` on that node). Do **not** apply `gpu=true` to MIG nodes.

```bash theme={null}
docker node update --label-add mig=true <node-id>
docker node update --label-add 'mig.uuids=MIG-<uuid1>,MIG-<uuid2>' <node-id>
```

Mixed clusters with both standard GPU nodes and MIG nodes are supported — the deploy script handles scheduling for both automatically.

***

## Step 3 — Configure environment

Set [environment variables](#configuration) before deploying. Use a `.env` file in `local/` (see `local/.env.example`) or export them in your shell.

```bash theme={null}
export IMAGE_REGISTRY="YOUR_IMAGE_REGISTRY"
export RELEASE_TAG="YOUR_RELEASE_TAG"
export MODEL_NAME="YOUR_MODEL_NAME"

export CONTAINER_KEY_FILE=/path/to/cartesia-api-key
export GCS_SA_FILE=/path/to/service-account.json

# Optional
export WORKER_REPLICAS=1
export WORKER_CAPACITY=4
export BUCKET_NAME=""
export CLUSTER_NAME="cartesia-compose"   # or "cartesia-swarm"
export USE_MIG=0                         # set to 1 to enable MIG mode (or pass --mig to the deploy script)
```

See [Configuration](#configuration) for full details on each variable.

***

## Step 4 — Deploy

<Tabs>
  <Tab title="Compose">
    From the repo root:

    ```bash theme={null}
    # Standard deployment
    ./local/scripts/deploy-compose.sh

    # With MIG support (auto-detects MIG instances via nvidia-smi)
    ./local/scripts/deploy-compose.sh --mig
    ```

    When `--mig` is used, the script auto-detects MIG instance UUIDs from `nvidia-smi`, generates a per-slice worker configuration, and scales the standard worker to zero.
  </Tab>

  <Tab title="Swarm">
    On the **manager node**:

    ```bash theme={null}
    # Standard deployment
    ./local/scripts/deploy-swarm.sh

    # With MIG support (reads UUIDs from node labels)
    ./local/scripts/deploy-swarm.sh --mig
    ```

    This will:

    1. Verify that nodes are labeled (fails with instructions if not).
    2. Create encrypted Swarm secrets from your key and service account files.
    3. Deploy all services. With `--mig`, one dedicated worker service is created per MIG instance, each pinned to its node.
  </Tab>
</Tabs>

<Warning>
  TTS workers take a few minutes to load the model into GPU memory. During this time, TTS requests will return errors even though containers appear healthy. Wait for the ready signal:

  <Tabs>
    <Tab title="Compose">
      ```bash theme={null}
      cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs -f tts-worker 2>&1 | grep -i "ready"
      ```
    </Tab>

    <Tab title="Swarm">
      ```bash theme={null}
      docker service logs cartesia_tts-worker -f 2>&1 | grep -i "ready"
      ```
    </Tab>
  </Tabs>
</Warning>

***

## Step 5 — Verify

Check that services are running:

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml ps
    ```

    If deployed with MIG, verify each worker sees exactly one MIG device:

    ```bash theme={null}
    # List all running services (MIG workers appear as tts-worker-mig-0, tts-worker-mig-1, etc.)
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml -f docker-compose.mig.generated.yaml ps
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    docker stack services cartesia
    ```

    If deployed with MIG, verify MIG worker services are scheduled and running:

    ```bash theme={null}
    docker stack ps cartesia --filter 'name=cartesia_tts-worker-mig'
    ```
  </Tab>
</Tabs>

Test the API:

```bash theme={null}
curl http://localhost:5000/status
```

Test TTS:

```bash theme={null}
curl -s -X POST "http://localhost:5000/tts/bytes" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Cartesia-Version: 2024-06-10" \
  -d '{
    "model_id": "sonic-3",
    "transcript": "Hello from Cartesia.",
    "voice": {"mode": "id", "id": "00510a15-4216-4fdc-a0ab-05d74cd9f795"},
    "language": "en",
    "output_format": {"container": "mp3", "sample_rate": 44100, "bit_rate": 128000}
  }' --output test.mp3
```

***

## Troubleshooting

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    cd local

    docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs api
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs tts-worker

    # Restart everything
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml down
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml up -d
    ```

    If the API exits with `no servers available for connection` (NATS not ready), restart the API after the stack is up:

    ```bash theme={null}
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml up -d && docker compose -f docker-compose.base.yaml -f docker-compose.yaml restart api
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    docker stack ps cartesia --no-trunc

    docker service logs cartesia_api
    docker service logs cartesia_tts-worker

    # Restart the stack
    docker stack rm cartesia
    sleep 10
    cd local && docker stack deploy --with-registry-auth -c docker-compose.base.yaml -c docker-compose.swarm.yaml cartesia
    ```
  </Tab>
</Tabs>

***

## Configuration

Set these environment variables before running the deploy script. You receive `IMAGE_REGISTRY`, `RELEASE_TAG`, and `MODEL_NAME` from Cartesia during onboarding. If you mirror images into your own registry, use your mirror URL for `IMAGE_REGISTRY`.

### Required

| Variable             | Description                                                        |
| -------------------- | ------------------------------------------------------------------ |
| `IMAGE_REGISTRY`     | Container image registry URL (Cartesia registry or your mirror).   |
| `RELEASE_TAG`        | Image tag for the release you are deploying (updates per release). |
| `MODEL_NAME`         | TTS model identifier for the worker image.                         |
| `CONTAINER_KEY_FILE` | Path to file containing your Cartesia API key.                     |
| `GCS_SA_FILE`        | Path to GCS service account JSON file.                             |

### Optional

| Variable            | Default                               | Description                                                                                                                     |
| ------------------- | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| `WORKER_REPLICAS`   | `1`                                   | Number of TTS worker containers. For Compose, set to your GPU count on the host. For Swarm, scale to match your GPU node count. |
| `WORKER_CAPACITY`   | `4`                                   | Max concurrent TTS requests per worker. Lower if you run out of GPU memory.                                                     |
| `BUCKET_NAME`       | *(empty)*                             | GCS bucket for migrations/LoRAs. Leave empty to disable sync.                                                                   |
| `CLUSTER_NAME`      | `cartesia-compose` / `cartesia-swarm` | Identifier for logs and metrics.                                                                                                |
| `GCS_SYNC_INTERVAL` | `300`                                 | GCS sync interval in seconds.                                                                                                   |
| `USE_MIG`           | `0`                                   | Set to `1` to enable MIG mode.                                                                                                  |


# Getting Started
Source: https://docs.cartesia.ai/self-hosted/getting-started

Prerequisites and initial setup for Cartesia self-hosted deployments

# Prerequisites

Before deploying Cartesia's self-hosted solution, you'll need:

## Enterprise Contract

Cartesia's self-hosted products generally require an enterprise contract. Please reach out to [support@cartesia.ai](mailto:support@cartesia.ai) to request a conversation with our Go-to-Market team.

## Infrastructure

### Hardware Requirements

Cartesia models require GPUs running NVidia devices from the Ampere family or newer, with at least 24GB GPU Memory. We'll provide more specifics
depending on how you run your GPU clusters.  See [Hardware Selection](/self-hosted/hardware-selection) for more details.

### Deployment Options

You can deploy a self-hosted Cartesia cluster in one of 3 ways that we provide today:

* Via Helm Charts on a Managed Kubernetes Cluster with the right hardware.
* Via Docker Compose / Docker Swarm on bare-metal or VM nodes (beta).
* Via managed endpoints on Sagemaker Jumpstart.

Since all of our code executes in containers, you can go with a much more customized deployment option as well.

# Setup Stages

<Note>We highly recommend trying out our cloud offering first, since you can test your application and integrate it without all the work required for self-hosting.</Note>

<Steps>
  <Step title="Create Cartesia Account">
    Sign up at [play.cartesia.ai](https://play.cartesia.ai) and create an API key.
    Navigate to [play.cartesia.ai/keys](https://play.cartesia.ai/keys) and select your organization.
  </Step>

  <Step title="Request Enterprise Access">
    Contact [support@cartesia.ai](mailto:support@cartesia.ai) for getting enterprise access.

    If you're deploying on [AWS Sagemaker](/self-hosted/cloud-service-provisioning), you can request directly on the cloud platform itself.
  </Step>

  <Step title="Choose Deployment Method">
    Select your preferred deployment approach based on your infrastructure:

    * [**Managed Kubernetes**](/self-hosted/managed-kubernetes)
    * [**Docker**](/self-hosted/docker-compose) (beta)
    * [**Cloud Service Provisioning**](/self-hosted/cloud-service-provisioning)

    Depending on how you're deploying, you'll also decide on the hardware at this stage.
  </Step>

  <Step title="Deploy">
    Once approved, you'll receive access to:

    * Google Cloud Storage bucket containing cartesia-kube and related artifacts (Docker images, voices, LoRA weights)
    * Private Docker registry credentials
    * Helm chart repositories
    * Terraform configuration examples
    * Deployment documentation and support
    * An offline license (required if you are doing an [air-gapped deployment](/self-hosted/air-gapped))

    See [Provisioned Resources](/self-hosted/provisioned-resources) for a full breakdown of what's included and how to access each resource, including downloading cartesia-kube.

    Download cartesia-kube from the GCS bucket and follow the guide for your chosen deployment method to get up and running. The provided configurations work out of the box, but can be customized to fit your infrastructure needs.
  </Step>

  <Step title="Post Deployment">
    Post deployment, we provide some resources to validate and benchmark your deployment on your own hardware. See [Testing and Benchmarking](/self-hosted/testing-and-benchmarking).
    If you're looking to setup monitoring on the deployment, checkout [Metrics](/self-hosted/metrics)
  </Step>
</Steps>


# Hardware Selection
Source: https://docs.cartesia.ai/self-hosted/hardware-selection


Cartesia's models are portable enough to run on widely available GPU hardware.

In the table below we show the recommended concurrency for our TTS and STT model workers.

| GPU  | Sonic Concurrency | Ink Concurrency |
| ---- | ----------------- | --------------- |
| A10G | 4                 |                 |
| L40S | 4                 |                 |
| A100 | 4                 |                 |
| H100 | 8                 | 16              |

See [Metrics](/self-hosted/metrics) for more details on performance metrics.

When choosing hardware you need to consider the tradeoffs between latency (TTFA), and throughput.
See the table below for the metrics on the different set of GPUs we test on:

<Tabs>
  <Tab title="H100">
    | Concurrency | TTFA (ms) | RTF Avg | RTF P95 | Throughput (chars/s) |
    | ----------- | --------- | ------- | ------- | -------------------- |
    | 1           | 95        | 0.20    | 0.25    | 30                   |
    | 2           | 115       | 0.25    | 0.35    | 50                   |
    | 4           | 165       | 0.30    | 0.55    | 90                   |
    | 8           | 280       | 0.40    | 0.70    | 165                  |
  </Tab>

  <Tab title="L40s">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 90              | 0.20          | 0.20          | 50                   |
    | 2           | 120             | 0.25          | 0.25          | 90                   |
    | 4           | 180             | 0.30          | 0.45          | 145                  |
    | 8           | 185             | 0.30          | 0.55          | 180                  |
  </Tab>

  <Tab title="A100">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 130             | 0.30          | 0.30          | 45                   |
    | 2           | 180             | 0.30          | 0.35          | 70                   |
    | 4           | 280             | 0.40          | 0.40          | 120                  |
    | 8           | 260             | 0.40          | 0.60          | 135                  |
  </Tab>

  <Tab title="A10g">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 140             | 0.30          | 0.30          | 40                   |
    | 2           | 205             | 0.35          | 0.35          | 60                   |
    | 4           | 335             | 0.45          | 0.50          | 100                  |
    | 8           | 600             | 0.65          | 0.70          | 155                  |
  </Tab>
</Tabs>

With these you'll setup your per worker configurations.  For handling your application's scaling requirements, you'll need to configure autoscaling behavior.  See [autoscaling](/self-hosted/auto-scaling) for more details.


# Introduction
Source: https://docs.cartesia.ai/self-hosted/introduction


Cartesia's models can be self-hosted into customer provisioned cloud environments, such as GCP, AWS, or on-premise data centers.

## Why Self-Host

Cartesia's public API is globally available for the lowest latency, complete with GDPR, SOC 2 Type II, PCI Level 1,
and HIPAA compliance with enterprise contract options for Service Level Agreements (SLA) and Business Associate Agreement (BAA), and more.

However certain use cases may still warrant Self-Hosted Voice AI and Cartesia supports both private cloud and on-premise hosting options.
In those circumstances we recommend a self-hosted offering that is feature complete and as performant as the cloud offering.

### Colocation

With self-hosted deployments, you can choose to colocate your Voice AI models with other offerings
and establish your own SLAs around uptime and throughput. Colocated TTS would save a lot on network latencies depending on where
your datacenters are located.

### Isolation (Single Tenant)

Even though we provide a tenant level isolation in our cloud offering, nothing will beat the isolation you can achieve by self-hosting.

### Security

Self-hosted deployments allow you to maintain tight security posture without running Voice AI traffic over the internet to our public APIs. The self-hosted deployments will only contact
the Cartesia server to authenticate model access and report usage information. Usage information is limited to metadata such as character count and voice id, and does not contain any transcript information.
We also support [air-gapped deployments](/self-hosted/air-gapped) where there's no contact to our cloud, instead your deployment works with an offline license.

### Sovereignty

You can choose to host your Voice AI offering in any geographic region with GPU availability to meet jurisdictional requirements.

## Supported Products

| Product       | Support                   |
| ------------- | ------------------------- |
| Sonic 2       | Kubernetes                |
| Sonic 3       | Kubernetes, AWS SageMaker |
| Ink Whisper   | Kubernetes                |
| Voice Agents  | Not supported             |
| Voice Cloning | Not supported             |


# Managed Kubernetes
Source: https://docs.cartesia.ai/self-hosted/managed-kubernetes

Deploy Cartesia on AWS EKS and GCP GKE

Cartesia provides Terraform configurations that deploy both infrastructure and the application, or you can deploy the Helm chart directly to an existing cluster.

<Note>Complete configurations are provided at deployment time by your Cartesia representative.</Note>

## Terraform Deployment

Terraform creates the cluster, networking, GPU drivers, and deploys Cartesia via Helm.
This is the fastest way for you to get started with self-hosting Cartesia.

<Note>Download cartesia-kube from the GCS bucket as described in [Downloading cartesia-kube](/self-hosted/provisioned-resources#deployment-configurations).</Note>

```bash theme={null}
# Download and extract cartesia-kube from GCS (see Downloading cartesia-kube guide)
cd cartesia-kube

# Copy example config for your platform
cp aws-terraform.tfvars.example aws-terraform.tfvars  # or gcp-terraform.tfvars.example

# Deploy from the platform directory
cd infra/aws/cartesia-eks  # or infra/gcp/cartesia-gke
terraform init
terraform apply -var-file="../../../aws-terraform.tfvars" \
                -var "cartesia_api_key=$CARTESIA_API_KEY" \
                -var "service_account_json=$(cat /path/to/service-account.json)"
```

### Configuration

<Tabs>
  <Tab title="AWS EKS">
    ```hcl theme={null}
    region = "us-west-2"
    name = "cartesia-production"

    eks_admin_users = ["arn:aws:iam::123456789:user/admin"]

    node_groups = {
      default = {
        ami_type = "AL2023_x86_64_STANDARD"
        instance_types = ["m7a.4xlarge"]
        min_size = 1
        max_size = 3
        desired_size = 1
      }
      gpu = {
        ami_type = "AL2023_x86_64_NVIDIA"
        instance_types = ["g5.2xlarge", "g5.4xlarge"]
        min_size = 1
        max_size = 5
        desired_size = 2
        disk_size = 100
        labels = { "nvidia.com/gpu.present" = "true" }
      }
    }

    # Ingress (optional)
    enable_ingress = true
    ingress_route = "api.cartesia.yourdomain.com"
    certificate_arn = "arn:aws:acm:us-west-2:123456789:certificate/abc123"

    # Hot reload (enabled by default)
    enable_hot_reload = true
    ```
  </Tab>

  <Tab title="GCP GKE">
    ```hcl theme={null}
    project_id = "your-gcp-project"
    region = "us-central1"
    zone = "us-central1-a"
    name = "cartesia-production"

    gke_admin_users = ["user@yourdomain.com"]

    node_pools = {
      default = {
        machine_type = "e2-standard-8"
        min_count = 1
        max_count = 3
        initial_node_count = 1
      }
      gpu = {
        machine_type = "g2-standard-8"
        accelerator_type = "nvidia-l4"
        accelerator_count = 1
        min_count = 1
        max_count = 5
        initial_node_count = 2
        disk_size_gb = 100
      }
    }

    # Ingress (optional)
    enable_ingress = true
    ingress_route = "api.cartesia.yourdomain.com"

    # Hot reload (enabled by default)
    enable_hot_reload = true
    ```
  </Tab>
</Tabs>

See [Managing Artifacts](/self-hosted/managing-artifacts) for details on hot reload and adding voices and pronunciation dictionaries to your deployment.

### Worker Configuration

Workers are defined in your tfvars file:

```hcl theme={null}
workers = [
  {
    name = "tts-worker"
    workerArgs = {
      model = "<model-name>"
      image = "cartesia-sonic-<model-name>"
      gpuType = "nvidia.com/gpu"
      capacity = 4
      operation = "TTS"
      useCB = true
      useLora = false
    }
    autoscaling = {
      enabled = true
      threshold = 0.6
      minReplicas = 1
      maxReplicas = 10
    }
  }
]
```

All the model workers have the images with prefix `cartesia-sonic-` followed by the specific model name. For instance, sonic-3 would use `cartesia-sonic-rosy-dragon`.

## Helm-Only Deployment

For existing Kubernetes clusters, deploy the Helm chart directly.

### 1. Install Prerequisites

If you want autoscaling and metrics, install KEDA and Prometheus first:

```bash theme={null}
# Prometheus
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace
```

### 2. Create Secrets

```bash theme={null}
kubectl create namespace cartesia

kubectl create secret docker-registry gar-pull-secret \
  --namespace cartesia \
  --docker-server=us-docker.pkg.dev \
  --docker-username=_json_key \
  --docker-password="$(cat /path/to/service-account.json)"
```

### 3. Configure values.yaml

```yaml theme={null}
clusterName: cartesia-production

infra:
  provider: gcp  # or aws
  authenticate: true
  imageRegistry: us-docker.pkg.dev/cartesia-external/self-serve
  imagePullSecret: gar-pull-secret
  gcsSecretName: gar-pull-secret
  serviceAccount: cartesia-image-sa

release:
  version: "1.0.0"
  releaseTag: "sonic-20251118"

filesystem:
  storageClass:
    name: standard-rwo

ingress:
  enabled: true
  routes:
    - api.cartesia.yourdomain.com
  globalStaticIpName: cartesia-ingress-ip  # GKE only

metrics:
  enabled: true

legacyComponents:
  enabled: false

workers:
  - name: tts-worker
    workerArgs:
      model: <model-name>
      image: cartesia-sonic-<model-name>
      gpuType: nvidia.com/gpu
      capacity: 4
      operation: TTS
      useCB: true
      useLora: false
    autoscaling:
      enabled: true
      threshold: "0.6"
      minReplicas: 1
      maxReplicas: 10
```

### 4. Deploy

```bash theme={null}
cd cartesia-kube/cartesia
helm upgrade --install cartesia . \
  --values values.yaml \
  --namespace cartesia
```

### Verify

```bash theme={null}
kubectl get pods -n cartesia
kubectl get ingress -n cartesia
```

## Autoscaling

Cartesia supports two levels of autoscaling for Kubernetes deployments.

### Cluster Autoscaler

Scales nodes based on pending pods. Enable in your tfvars:

```hcl theme={null}
enable_cluster_autoscaler = true
```

Node groups/pools will scale within their configured `min_size`/`max_size` bounds when pods can't be scheduled due to insufficient resources.

### Pod Autoscaler (KEDA)

Scales worker pods based on load metrics. Enable in your tfvars:

```hcl theme={null}
enable_pod_autoscaler = true
enable_metrics = true  # Required for KEDA
```

KEDA uses two scaling triggers:

* **Queue depth** - Scales when unserviceable requests accumulate
* **Worker load** - Scales when GPU utilization exceeds threshold

### Per-Worker Scaling

Each worker can have its own scaling configuration:

```hcl theme={null}
workers = [
  {
    name = "tts-worker"
    workerArgs = { ... }
    autoscaling = {
      enabled = true
      threshold = 0.6      # Scale up when load > 60%
      minReplicas = 1
      maxReplicas = 10
    }
  }
]
```

Or in Helm values.yaml:

```yaml theme={null}
workers:
  - name: tts-worker
    workerArgs: { ... }
    autoscaling:
      enabled: true
      threshold: "0.6"
      minReplicas: 1
      maxReplicas: 10
```

### Scaling Behavior

* **Scale up**: 30 second stabilization window
* **Scale down**: 900 second (15 min) stabilization window to avoid flapping
* Workers scale independently based on their individual load


# Managing Artifacts
Source: https://docs.cartesia.ai/self-hosted/managing-artifacts

Add voices and pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment

<Note>
  Hot reload and the on-prem migration APIs (`add-voices`, `add-pdict`) require release tag `sonic-20260310` or later.
</Note>

## Hot reload

New voice artifacts are picked up automatically by your self-hosted deployment without requiring an API server restart. Hot reload is enabled by default.

When a migration file lands in your GCS bucket, the API server detects and applies it automatically. No API server restarts or Helm upgrades are needed.

To disable hot reload, set `enable_hot_reload` to `false` in your tfvars — see [Managed Kubernetes](/self-hosted/managed-kubernetes) for full configuration.

```hcl theme={null}
enable_hot_reload = false
```

<Warning>
  Hot reload does not support PVC voices. If you migrate voices with `include_loras: true`, you must restart the worker pods for the LoRA checkpoints to take effect.
</Warning>

## Adding voices

Add voices from the Cartesia voice library to your self-hosted deployment using the `POST /onprem/add-voices` endpoint. You can migrate up to 50 voices per request. The migration runs asynchronously — voices typically become available on your self-hosted deployment within 4–5 minutes.

```bash theme={null}
curl -X POST "https://api.cartesia.ai/onprem/add-voices" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_ids": ["a0e99841-438c-4a64-b679-ae501e7d6091"],
    "model_id": "sonic-3",
    "include_loras": true
  }'
```

<Note>
  This endpoint is called against the **Cartesia cloud API** (`api.cartesia.ai`), not your self-hosted deployment. Your API key must belong to an organization with an active on-prem contract.
</Note>

### Request body

| Parameter       | Type       | Required | Description                                                                       |
| --------------- | ---------- | -------- | --------------------------------------------------------------------------------- |
| `voice_ids`     | `string[]` | Yes      | Voice IDs or aliases to add. Maximum 50 per request.                              |
| `model_id`      | `string`   | Yes      | The model the voices will be used with (e.g., `"sonic-3"`, `"sonic-english"`).    |
| `include_loras` | `boolean`  | No       | Set to `true` to include LoRA checkpoints for cloned voices. Defaults to `false`. |

### Headers

| Header             | Required | Description                                   |
| ------------------ | -------- | --------------------------------------------- |
| `X-API-Key`        | Yes      | Your Cartesia API key.                        |
| `Cartesia-Version` | No       | API version header. Defaults to `2025-04-16`. |

### Error responses

| Status | Condition                                                                    |
| ------ | ---------------------------------------------------------------------------- |
| `400`  | Missing or empty `voice_ids`, missing `model_id`, or more than 50 voice IDs. |
| `403`  | No on-prem access, or a requested voice is not accessible.                   |
| `422`  | Malformed request body.                                                      |
| `500`  | Internal server error.                                                       |

## Verifying a voice

After migration completes, verify a voice is available on your self-hosted deployment with `GET /voices/<id>`.

```bash theme={null}
curl -X GET "http://<your-host>:<port>/voices/<voice-id>" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```

## Adding pronunciation dictionaries

Add pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment using the `POST /onprem/add-pdict` endpoint. You can migrate up to 50 dictionaries per request. The migration runs asynchronously — dictionaries typically become available on your self-hosted deployment within 4–5 minutes.

```bash theme={null}
curl -X POST "https://api.cartesia.ai/onprem/add-pdict" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pronunciation_dict_ids": ["a0e99841-438c-4a64-b679-ae501e7d6091"]
  }'
```

<Note>
  This endpoint is called against the **Cartesia cloud API** (`api.cartesia.ai`), not your self-hosted deployment. Your API key must belong to an organization with an active on-prem contract, and must own each dictionary being migrated.
</Note>

### Request body

| Parameter                | Type       | Required | Description                                                                                      |
| ------------------------ | ---------- | -------- | ------------------------------------------------------------------------------------------------ |
| `pronunciation_dict_ids` | `string[]` | Yes      | Pronunciation dictionary IDs to add. Maximum 50 per request. Duplicates are removed server-side. |

### Headers

| Header             | Required | Description                                   |
| ------------------ | -------- | --------------------------------------------- |
| `X-API-Key`        | Yes      | Your Cartesia API key.                        |
| `Cartesia-Version` | No       | API version header. Defaults to `2025-04-16`. |

### Error responses

| Status | Condition                                                                |
| ------ | ------------------------------------------------------------------------ |
| `400`  | Missing or empty `pronunciation_dict_ids`, or more than 50 entries.      |
| `403`  | No on-prem access, or a requested dictionary is not owned by the caller. |
| `404`  | A requested dictionary ID does not exist.                                |
| `422`  | Malformed request body.                                                  |
| `500`  | Internal server error.                                                   |

## Verifying a pronunciation dictionary

After migration completes, verify a dictionary is available on your self-hosted deployment with `GET /pronunciation-dicts/<id>`.

```bash theme={null}
curl -X GET "http://<your-host>:<port>/pronunciation-dicts/<dict-id>" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```


# Metrics and Monitoring
Source: https://docs.cartesia.ai/self-hosted/metrics


Cartesia's inference cluster includes support for [Prometheus](https://prometheus.io/), an open source
metrics and monitoring solution.

All metrics are scraped every 5 seconds via PodMonitor on port 8080 `/metrics`.

## Prometheus Metrics

| Metric Name                       | Description                                                               | Normal Range                                          |
| --------------------------------- | ------------------------------------------------------------------------- | ----------------------------------------------------- |
| `inferno_worker_load`             | # of concurrent chunks the worker is processing now                       | \< Capacity                                           |
| `inferno_worker_capacity`         | # of concurrent chunks a worker can process                               | [hardware](/self-hosted/hardware-selection) dependent |
| `inferno_worker_ttfa`             | Time to First Audio                                                       | \< 200 ms                                             |
| `inferno_worker_rtf`              | [Real time factor](https://openvoice-tech.net/index.php/Real-time-factor) | \< 1                                                  |
| `api_queue_size`                  | Request queue size per offering                                           | Low                                                   |
| `api_unserviceable_requests_size` | Unserviceable requests count                                              | 0                                                     |


# Provisioned Resources
Source: https://docs.cartesia.ai/self-hosted/provisioned-resources

Reference for all resources provisioned as part of your self-hosted deployment

When your enterprise contract is finalized, Cartesia provisions the following resources for your account. All provisioned resources are available for download from the [on-prem portal](https://play.cartesia.ai/on-prem).

<Note>The on-prem portal is only accessible under the organization that has on-prem enabled. If you don't see it, switch to that organization in the account switcher.</Note>

## Service Account

A service account is created for your account, this service account has the following accesses:

* Access to a private artifact registry, which is used to host cartesia provided container images.
* Access to a common storage bucket: `gs://cartesia-onprem` containing the deployment configurations.
* Access to a private storage bucket: `gs://cartesia-{{name}}` used for hosting customer specific artifacts.

Download the JSON key for this service account from the [on-prem portal](https://play.cartesia.ai/on-prem).

Activate the service account before accessing resources hosted on GCloud:

```bash theme={null}
gcloud auth activate-service-account --key-file=/path/to/service-account.json
gsutil ls gs://cartesia-onprem/  # Verify access
```

## Deployment Configurations

The `cartesia-onprem` bucket contains versioned repository `cartesia-kube` which holds all of our deployment configurations.

```
gs://cartesia-onprem/
  cartesia-kube/
    latest/
      cartesia-kube-latest.tar.gz   # Latest release archive
      VERSION                        # Current version string
    releases/
      <version>/
        SHA256SUMS                   # Checksums for verification
```

<Note>Voice model files and LoRA weights are provided in a separate bucket or as part of `cartesia-kube`. Your Cartesia representative will confirm the exact paths during onboarding.</Note>

Download and verify the latest release:

```bash theme={null}
BUCKET="cartesia-onprem"

gsutil cp gs://${BUCKET}/cartesia-kube/latest/cartesia-kube-latest.tar.gz .
gsutil cp gs://${BUCKET}/cartesia-kube/latest/VERSION .

LATEST_VERSION=$(cat VERSION)
gsutil cp gs://${BUCKET}/cartesia-kube/releases/${LATEST_VERSION}/SHA256SUMS .

sha256sum -c SHA256SUMS  # macOS: shasum -a 256 -c SHA256SUMS
tar -xzf cartesia-kube-latest.tar.gz
```

Once extracted, `cartesia-kube` contains everything needed for all deployment methods:

```
cartesia-kube/
  benchmarking/          # Load testing and benchmarking tools
  cartesia/              # Helm chart + Docker Compose configs
    scripts/
      swarm/             # Docker Swarm deploy scripts
    templates/           # Kubernetes resource templates
      autoscaler/
      resources/
      services/
  infra/                 # Terraform configs
    aws/
      cartesia-eks/      # EKS deployment
    gcp/
      cartesia-gke/      # GKE deployment
```

## Container Registry

Images are hosted at `us-docker.pkg.dev/cartesia-external/self-serve` and tagged with a release tag (e.g. `sonic-20251118`). The full image reference format is:

```
us-docker.pkg.dev/cartesia-external/self-serve/<image-name>:<release-tag>
```

### Images

| Image Name                   | Description                        |
| ---------------------------- | ---------------------------------- |
| `cartesia-api`               | API server                         |
| `cartesia-license-proxy`     | License validation and enforcement |
| `cartesia-sonic-rosy-dragon` | TTS worker — sonic-3               |
| `cartesia-sonic-royal-plant` | TTS worker — sonic-2               |
| `cartesia-sonic-voice-clone` | TTS worker — voice cloning         |

NATS uses a public image and does not need to be pulled from the Cartesia registry.

### Listing Available Tags

List available image tags sorted by most recent:

```bash theme={null}
gcloud artifacts docker images list \
  us-docker.pkg.dev/cartesia-external/self-serve/cartesia-api \
  --include-tags \
  --sort-by="~UPDATE_TIME"
```

Replace `cartesia-sonic-rosy-dragon` with any image name from the table above. The `~` prefix sorts in descending order, showing the latest tags first.

### Mirroring to a Private Registry

For air-gapped or network-restricted environments, mirror images to your own registry before deployment.

Authenticate Docker with the service account:

```bash theme={null}
cat /path/to/service-account.json | \
  docker login -u _json_key --password-stdin https://us-docker.pkg.dev
```

Pull, retag, and push each image. For example:

```bash theme={null}
CARTESIA_REGISTRY="us-docker.pkg.dev/cartesia-external/self-serve"
PRIVATE_REGISTRY="your-registry.example.com/cartesia"
RELEASE_TAG="sonic-20251118"
IMAGE="cartesia-api"

docker pull ${CARTESIA_REGISTRY}/${IMAGE}:${RELEASE_TAG}
docker tag  ${CARTESIA_REGISTRY}/${IMAGE}:${RELEASE_TAG} ${PRIVATE_REGISTRY}/${IMAGE}:${RELEASE_TAG}
docker push ${PRIVATE_REGISTRY}/${IMAGE}:${RELEASE_TAG}
```

Repeat for each image in the table above.

Then set `infra.imageRegistry` (Helm) to your private registry URL.


# Testing and Benchmarking
Source: https://docs.cartesia.ai/self-hosted/testing-and-benchmarking

Validate and benchmark your Cartesia self-hosted deployment

Once your deployment is running, you can test it using the following commands. Ensure you have network access to your service via port-forwarding or an ingress.

## List Voices

```bash theme={null}
curl "http://<your-host>:<port>/voices" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```

## Text-to-Speech

```bash theme={null}
curl -X POST "http://<your-host>:<port>/tts/bytes" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "sonic-2",
    "transcript": "Hello, this is a test of the Cartesia text-to-speech API.",
    "voice": {
      "mode": "id",
      "id": "bf0a246a-8642-498a-9950-80c35e9276b5"
    },
    "output_format": {
      "container": "wav",
      "encoding": "pcm_f32le",
      "sample_rate": 44100
    },
    "language": "en"
  }' > output.wav
```

## Benchmarking

We provide a benchmarking tool in the [cartesia-kube](https://github.com/cartesia-ai/cartesia-kube) repository for measuring TTS performance metrics like TTFA and latency.

```bash theme={null}
cd cartesia-kube/benchmarking

export CARTESIA_API_KEY="your-api-key"
export CARTESIA_API_URL="wss://your-ingress-host"

# Run with default concurrency (4)
uv run tts_benchmark.py

# Run with custom concurrency
uv run tts_benchmark.py --concurrency 8
```

See the [benchmarking README](https://github.com/cartesia-ai/cartesia-kube/tree/main/benchmarking) for detailed usage and output format.


# MCP
Source: https://docs.cartesia.ai/tools/ai/mcp


The **`cartesia-mcp`** package exposes Cartesia through the **Model Context Protocol (MCP)** so MCP-capable clients—**Cursor**, **Claude Desktop**, **OpenAI Agents**, and similar—can list voices, run **TTS**, and use other Cartesia-backed tools via the protocol instead of custom scripts.

You need a [Cartesia API key](https://play.cartesia.ai/keys). The [PyPI package](https://pypi.org/project/cartesia-mcp/) currently requires **Python 3.13 or newer** as its minimum; confirm the supported version on PyPI before you install.

**Installation**, the **uvx** shortcut, and **MCP client configuration** (executable path, environment variables, Claude Desktop or Cursor) are documented in the **[cartesia-mcp](https://github.com/cartesia-ai/cartesia-mcp)** README so setup stays in sync with releases.

<Card title="cartesia-mcp" icon="github" href="https://github.com/cartesia-ai/cartesia-mcp">
  The official Cartesia MCP Server
</Card>


# JavaScript/TypeScript
Source: https://docs.cartesia.ai/tools/client-libraries/javascript-typescript

The library that powers the Cartesia Playground.

<Card title="Cartesia JS" icon="github" href="https://github.com/cartesia-ai/cartesia-js">
  The Official TS/JS client for the Cartesia API.
</Card>


# Python
Source: https://docs.cartesia.ai/tools/client-libraries/python

The official Python library for the Cartesia API.

<Card title="Cartesia Python" icon="github" href="https://github.com/cartesia-ai/cartesia-python">
  The official Python client for the Cartesia API.
</Card>


# Delete Agent
Source: https://docs.cartesia.ai/api-reference/agents/agents/delete

/latest.yml DELETE /agents/{agent_id}


# Get Agent
Source: https://docs.cartesia.ai/api-reference/agents/agents/get

/latest.yml GET /agents/{agent_id}
Returns the details of a specific agent. To create an agent, use the CLI or the Playground for the best experience and integration with Github.


# List Agents
Source: https://docs.cartesia.ai/api-reference/agents/agents/list

/latest.yml GET /agents
Lists all agents associated with your account.


# List Phone Numbers
Source: https://docs.cartesia.ai/api-reference/agents/agents/phone-numbers

/latest.yml GET /agents/{agent_id}/phone-numbers
List the phone numbers associated with an agent. Currently, you can only have one phone number per agent and these are provisioned by Cartesia.


# List Templates
Source: https://docs.cartesia.ai/api-reference/agents/agents/templates

/latest.yml GET /agents/templates
List of public, Cartesia-provided agent templates to help you get started.


# Update Agent
Source: https://docs.cartesia.ai/api-reference/agents/agents/update

/latest.yml PATCH /agents/{agent_id}


# Download Call Audio
Source: https://docs.cartesia.ai/api-reference/agents/calls/download-call-audio

/latest.yml GET /agents/calls/{call_id}/audio
The downloaded audio file is in .wav format. This endpoint streams the audio file content (WAV format) to the client.


# Get Call
Source: https://docs.cartesia.ai/api-reference/agents/calls/get-call

/latest.yml GET /agents/calls/{call_id}


# Get Call Runtime Logs
Source: https://docs.cartesia.ai/api-reference/agents/calls/get-call-logs

/latest.yml GET /agents/calls/{call_id}/logs
Returns the runtime logs for a specific call. These are the logs produced by your agent's code during the call. Logs may not be available if the call is still in progress or if they have been removed due to data retention settings.


# List Calls
Source: https://docs.cartesia.ai/api-reference/agents/calls/list-calls

/latest.yml GET /agents/calls
Lists calls sorted by start time in descending order for a specific agent. `agent_id` is required and if you want to include `transcript` in the response, add `expand=transcript` to the request. This endpoint is paginated.


# Get Deployment
Source: https://docs.cartesia.ai/api-reference/agents/deployments/get-deployment

/latest.yml GET /agents/deployments/{deployment_id}
Get a deployment by its ID.


# List Deployments
Source: https://docs.cartesia.ai/api-reference/agents/deployments/list-deployments

/latest.yml GET /agents/{agent_id}/deployments
List of all deployments associated with an agent.


# Add Metric to Agent
Source: https://docs.cartesia.ai/api-reference/agents/metrics/add-metric-to-agent

/latest.yml POST /agents/{agent_id}/metrics/{metric_id}
Add a metric to an agent. Once the metric is added, it will be run on all calls made to the agent automatically from that point onwards.


# Create Metric
Source: https://docs.cartesia.ai/api-reference/agents/metrics/create-metric

/latest.yml POST /agents/metrics
Create a new metric.


# Export Metric Results as CSV
Source: https://docs.cartesia.ai/api-reference/agents/metrics/export-metric-results

/latest.yml GET /agents/metrics/results/export
Export metric results to a CSV file. This endpoint streams at most 100k results as the CSV file directly to the client. Use the optional filters to narrow down the results to export.


# Get Metric
Source: https://docs.cartesia.ai/api-reference/agents/metrics/get-metric

/latest.yml GET /agents/metrics/{metric_id}
Get a metric by its ID.


# List Metric Results
Source: https://docs.cartesia.ai/api-reference/agents/metrics/list-metric-results

/latest.yml GET /agents/metrics/results
Paginated list of metric results. Filter results using the query parameters,


# List Metrics
Source: https://docs.cartesia.ai/api-reference/agents/metrics/list-metrics

/latest.yml GET /agents/metrics
List of all LLM-as-a-Judge metrics owned by your account.


# Remove Metric from Agent
Source: https://docs.cartesia.ai/api-reference/agents/metrics/remove-metric-from-agent

/latest.yml DELETE /agents/{agent_id}/metrics/{metric_id}
Remove a metric from an agent. Once the metric is removed, it will no longer be run on all calls made to the agent automatically from that point onwards. Existing metric results will remain.


# API Status and Version
Source: https://docs.cartesia.ai/api-reference/api-status/get

/latest.yml GET /


# Generate a New Access Token
Source: https://docs.cartesia.ai/api-reference/auth/access-token

/latest.yml POST /access-token
Generates a new Access Token for the client. These tokens are short-lived and should be used to make requests to the API from authenticated clients.


# Create
Source: https://docs.cartesia.ai/api-reference/datasets/create

/latest.yml POST /datasets/
Create a new dataset


# Delete
Source: https://docs.cartesia.ai/api-reference/datasets/delete

/latest.yml DELETE /datasets/{id}
Delete a dataset


# Delete file
Source: https://docs.cartesia.ai/api-reference/datasets/delete-file

/latest.yml DELETE /datasets/{id}/files/{fileID}
Remove a file from a dataset


# Get
Source: https://docs.cartesia.ai/api-reference/datasets/get

/latest.yml GET /datasets/{id}
Retrieve a specific dataset by ID


# List
Source: https://docs.cartesia.ai/api-reference/datasets/list

/latest.yml GET /datasets/
Paginated list of datasets


# List files
Source: https://docs.cartesia.ai/api-reference/datasets/list-files

/latest.yml GET /datasets/{id}/files
Paginated list of files in a dataset


# Update
Source: https://docs.cartesia.ai/api-reference/datasets/update

/latest.yml PATCH /datasets/{id}
Update an existing dataset


# Upload file
Source: https://docs.cartesia.ai/api-reference/datasets/upload-file

/latest.yml POST /datasets/{id}/files
Upload a new file to a dataset


# Create
Source: https://docs.cartesia.ai/api-reference/fine-tunes/create

/latest.yml POST /fine-tunes/
Create a new fine-tune


# Delete
Source: https://docs.cartesia.ai/api-reference/fine-tunes/delete

/latest.yml DELETE /fine-tunes/{id}
Delete a fine-tune


# Get
Source: https://docs.cartesia.ai/api-reference/fine-tunes/get

/latest.yml GET /fine-tunes/{id}
Retrieve a specific fine-tune by ID


# List
Source: https://docs.cartesia.ai/api-reference/fine-tunes/list

/latest.yml GET /fine-tunes/
Paginated list of all fine-tunes for the authenticated user


# List Voices
Source: https://docs.cartesia.ai/api-reference/fine-tunes/list-voices

/latest.yml GET /fine-tunes/{id}/voices
List all voices created from a fine-tune


# Infill (Bytes)
Source: https://docs.cartesia.ai/api-reference/infill/bytes

/latest.yml POST /infill/bytes
Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.

**The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.**

At least one of `left_audio` or `right_audio` must be provided.

As with all generative models, there's some inherent variability, but here's some tips we recommend to get the best results from infill:
- Use longer infill transcripts
  - This gives the model more flexibility to adapt to the rest of the audio
- Target natural pauses in the audio when deciding where to clip
  - This means you don't need word-level timestamps to be as precise
- Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
  - This helps the model generate more natural transitions


# Create
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/create

/latest.yml POST /pronunciation-dicts/
Create a new pronunciation dictionary


# Delete
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/delete

/latest.yml DELETE /pronunciation-dicts/{id}
Delete a pronunciation dictionary


# Get
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/get

/latest.yml GET /pronunciation-dicts/{id}
Retrieve a specific pronunciation dictionary by ID


# List
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/list

/latest.yml GET /pronunciation-dicts/
List all pronunciation dictionaries for the authenticated user


# Update
Source: https://docs.cartesia.ai/api-reference/pronunciation-dicts/update

/latest.yml PATCH /pronunciation-dicts/{id}
Update a pronunciation dictionary


# Speech-to-Text (Streaming)
Source: https://docs.cartesia.ai/api-reference/stt/stt

This endpoint creates a bidirectional WebSocket connection for real-time speech transcription.

Our STT endpoint enables sending in a stream of audio as bytes, and provides transcription results as they become available.

**Usage Pattern**:

1. Connect to the WebSocket with appropriate query parameters
2. Send audio chunks as binary WebSocket messages in the specified encoding format
3. Receive transcription messages as JSON with word-level timestamps
4. Send `finalize` as a text message to flush any remaining audio (receives `flush_done` acknowledgment)
5. Send `done` as a text message to close the session cleanly (receives `done` acknowledgment and closes)

**Performance Recommendation**: For best performance, it is recommended to resample audio before streaming and send audio chunks in `pcm_s16le` format at 16kHz sample rate.

**Pricing**: Speech-to-text streaming is priced at **1 credit per 1 second** of audio streamed in.

For WebSocket connection limits, see the [concurrency limits and timeouts](/use-the-api/concurrency-limits-and-timeouts) page.


# Speech-to-Text (Batch)
Source: https://docs.cartesia.ai/api-reference/stt/transcribe

/latest.yml POST /stt
Transcribes audio files into text using Cartesia's Speech-to-Text API.

Upload an audio file and receive a complete transcription response. Supports arbitrarily long audio files with automatic intelligent chunking for longer audio.

**Supported audio formats:** flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm

**Response format:** Returns JSON with transcribed text, duration, and language. Include `timestamp_granularities: ["word"]` to get word-level timestamps.
**Pricing:** Batch transcription is priced at **1 credit per 2 seconds** of audio processed.

<Note>
For migrating from the OpenAI SDK, see our [OpenAI Whisper to Cartesia Ink Migration Guide](/use-the-api/migrate-from-open-ai).
</Note>


# Text to Speech (Bytes)
Source: https://docs.cartesia.ai/api-reference/tts/bytes

/latest.yml POST /tts/bytes


# Text to Speech (SSE)
Source: https://docs.cartesia.ai/api-reference/tts/sse

/latest.yml POST /tts/sse


# Text to Speech (WebSocket)
Source: https://docs.cartesia.ai/api-reference/tts/websocket

This endpoint creates a bidirectional WebSocket connection. The connection supports multiplexing, so you can send multiple requests and receive the corresponding responses in parallel.

The WebSocket API is built around contexts:

- When you send a generation request, you pass a `context_id`. Further inputs on the same `context_id` will [continue the generation](/build-with-cartesia/capability-guides/stream-inputs-using-continuations), maintaining prosody.
- Responses for a context contain the `context_id` you passed in so that you can match requests and responses.

Read the guide [on working with contexts](/use-the-api/tts-websocket/contexts) to learn more.

For the best performance, we recommend the following usage pattern:

1. **Do many generations over a single WebSocket**. Just use a separate context for each generation. The WebSocket scales up to dozens of concurrent generations.
2. **Set up the WebSocket before the first generation**. This ensures you don’t incur latency when you start generating speech.
3. **Include necessary spaces and punctuation**: This allows Sonic to generate speech more accurately and with better prosody.

For conversational agent use cases, we recommend the following usage pattern:

1. **Each turn in a conversation should correspond to a context**: For example, if you are using Sonic to power a voice agent, each turn in the conversation should be a new context.
2. **Start a new context for interruptions**: If the user interrupts the agent, start a new context for the agent’s response.

To learn more about managing concurrent generations and WebSocket connection limits, see the [concurrency limits and timeouts](/use-the-api/concurrency-limits-and-timeouts) page.


# Voice Changer (Bytes)
Source: https://docs.cartesia.ai/api-reference/voice-changer/bytes

/latest.yml POST /voice-changer/bytes
Takes an audio file of speech, and returns an audio file of speech spoken with the same intonation, but with a different voice.

This endpoint is priced at 15 characters per second of input audio.


# Voice Changer (SSE)
Source: https://docs.cartesia.ai/api-reference/voice-changer/sse

/latest.yml POST /voice-changer/sse


# Clone Voice
Source: https://docs.cartesia.ai/api-reference/voices/clone

/latest.yml POST /voices/clone
Clone a high similarity voice from an audio clip. Clones are more similar to the source clip, but may reproduce background noise. For these, use an audio clip about 5 seconds long.


# Delete Voice
Source: https://docs.cartesia.ai/api-reference/voices/delete

/latest.yml DELETE /voices/{id}


# Get Voice
Source: https://docs.cartesia.ai/api-reference/voices/get

/latest.yml GET /voices/{id}


# List Voices
Source: https://docs.cartesia.ai/api-reference/voices/list

/latest.yml GET /voices


# Localize Voice
Source: https://docs.cartesia.ai/api-reference/voices/localize

/latest.yml POST /voices/localize
Create a new voice from an existing voice localized to a new language and dialect.


# Update Voice
Source: https://docs.cartesia.ai/api-reference/voices/update

/latest.yml PATCH /voices/{id}
Update the name, description, and gender of a voice. To set the gender back to the default, set the gender to `null`. If gender is not specified, the gender will not be updated.


# Set up an organization
Source: https://docs.cartesia.ai/enterprise/set-up-an-organization


Organization workspaces enable seamless collaboration between multiple team members. All users in an organization share the same view of resources, including voices, API keys, and datasets. The only exceptions are playground generation history and starred voices, which remain private to each individual user.

By default, your Cartesia account initializes as an organization workspace on the Free subscription plan with a limit of one member.

<Warning>
  To invite team members, you must first upgrade your organization to the
  Startup tier or higher. After upgrading, you can invite unlimited users at no
  additional cost.
</Warning>

## Manage your organization

<Steps>
  <Step title="Upgrade your current organization">
    Organizations must be upgraded to the Startup tier or above before team members can be invited. Each workspace has its own billing and credit limits, so make sure you are on the intended organization before proceeding to upgrade your subscription.

    <Frame>
      <img alt="Upgrade organization" />
    </Frame>
  </Step>

  <Step title="Invite your team">
    Once you've upgraded your organization, you can use the "Manage" button in the workspace switcher to manage it:

    <Frame>
      <img alt="Organization manage button in switcher" />
    </Frame>

    This pops up a modal where you can change your profile and invite your team:

    <Frame>
      <img alt="Organization manager modal" />
    </Frame>

    There are two membership types in an organizaton:

    1. Admin: has the ability to manage the organization profile, invitations, and members.
    2. Member: can use all functionality included in the subscription, but cannot alter organization settings.

    <Frame>
      <img alt="Organization membership types" />
    </Frame>

    You can invite unlimited team members in an organization once it is on Startup tier or higher.
  </Step>

  <Step title="Create voices, API keys, and other resources in your organization">
    Once your organization is upgraded, voices, Line agents, API keys, and other resources will be available to all users in the organization.
  </Step>
</Steps>

## Create additional organizations

If you want separate workspaces on different subscriptions, you can create another organization by going to the playground at [https://play.cartesia.ai](https://play.cartesia.ai), selecting the workspace switcher, and clicking **Create organization**.

<Frame>
  <img alt="Create organization" />
</Frame>

This will bring up a dialog where you can name your organization and upload a logo.

<Frame>
  <img alt="Organization creation dialog" />
</Frame>

Please reach out to us at [support@cartesia.ai](mailto:support@cartesia.ai) if you run into any troubles with your organization.


# Set up SSO
Source: https://docs.cartesia.ai/enterprise/set-up-sso


We support Single-Sign On (SSO) for customers on the Enterprise plan via SAML. This integration is processed through our identity provider, [Clerk](https://clerk.com).

## Set up SSO with Okta

1. Send us your SSO domain.
2. We will send you a service provider configuration, which consists of a single-sign on URL and an audience URI (SP entity ID).
3. Follow steps 2, 3, 4, and 5 in [the Clerk SSO guide](https://clerk.com/docs/authentication/enterprise-connections/saml/okta), and send us the metadata URL you get from step 6.1.

After you are done, we will complete the remaining SSO setup and send you a confirmation that SSO is enabled for your organization.


# Generate to File
Source: https://docs.cartesia.ai/examples/tts-generate-to-file

Use generate() and write_to_file() to write a wav file.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_generate_to_file(client: Cartesia) -> None:
        """Use generate() and write_to_file() to write a wav file."""
        response = client.tts.generate(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        response.write_to_file("output.wav")
        print(f"Saved audio to output.wav")
        print(f"Play with: ffplay -f wav output.wav")
    ```

    From [cartesia-python/examples/examples.py:30](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L30)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsGenerateToFile(client: Cartesia): Promise<void> {
      /** Use generate() and write_to_file() to write a wav file. */
      const response = await client.tts.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'wav', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      const buffer = Buffer.from(await response.arrayBuffer());
      fs.writeFileSync('output.wav', buffer);
      console.log('Saved audio to output.wav');
      console.log('Play with: ffplay -f wav output.wav');
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:29](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L29)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_generate_to_file
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsGenerateToFile
    ```
  </Tab>
</Tabs>


# SSE Streaming
Source: https://docs.cartesia.ai/examples/tts-sse-basic

Basic SSE streaming.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_sse_basic(client: Cartesia) -> None:
        """Basic SSE streaming."""
        stream = client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )

        import datetime
        filename = f"tts_sse_basic_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            for event in stream:
                if event.type == "chunk":
                    # v3.x returns raw bytes in event.audio
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:62](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L62)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_sse_basic_async(client: AsyncCartesia) -> None:
        """Basic SSE streaming."""
        stream = await client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )

        filename = f"tts_sse_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            async for event in stream:
                if event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:52](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L52)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_basic
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_sse_basic_async
    ```
  </Tab>
</Tabs>


# SSE with Match Statement
Source: https://docs.cartesia.ai/examples/tts-sse-with-match

SSE streaming using match statement.

```python theme={null}
def tts_sse_with_match(client: Cartesia) -> None:
    """SSE streaming using match statement."""
    stream = client.tts.generate_sse(
        model_id="sonic-3",
        transcript="Hello, world!",
        voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
        output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
    )

    import datetime
    filename = f"tts_sse_with_match_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

    with open(filename, "wb") as f:
        for event in stream:
            if event.type == "chunk":
                # Audio chunk - event.audio contains bytes
                if event.audio:
                    f.write(event.audio)
                    process_audio(event.audio)
            elif event.type == "timestamps":
                # Word timestamps - event.word_timestamps
                process_timestamps(event.word_timestamps)
            elif event.type == "done":
                # Stream complete
                break
            elif event.type == "error":
                # Error occurred
                raise Exception(event.error)

    print(f"Saved audio to {filename}")
    print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
```

From [cartesia-python/examples/examples.py:151](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L151)

## Run this example

```sh theme={null}
cd cartesia-python
CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_match
```


# SSE with Phoneme Timestamps
Source: https://docs.cartesia.ai/examples/tts-sse-with-phoneme-timestamps

SSE streaming with phoneme timestamps.

```python theme={null}
def tts_sse_with_phoneme_timestamps(client: Cartesia) -> None:
    """SSE streaming with phoneme timestamps."""
    stream = client.tts.generate_sse(
        model_id="sonic-3",
        transcript="Hello, world!",
        voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
        output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        add_phoneme_timestamps=True,
    )

    import datetime
    filename = f"tts_sse_with_phoneme_timestamps_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

    with open(filename, "wb") as f:
        for event in stream:
            if event.type == "phoneme_timestamps":
                pt = event.phoneme_timestamps
                if pt:
                    print(f"Phonemes: {pt.phonemes}, Starts: {pt.start}, Ends: {pt.end}")
            elif event.type == "chunk":
                if event.audio:
                    f.write(event.audio)
            elif event.type == "done":
                break
            elif event.type == "error":
                raise Exception(f"Error: {event.error}")

    print(f"Saved audio to {filename}")
    print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
```

From [cartesia-python/examples/examples.py:120](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L120)

## Run this example

```sh theme={null}
cd cartesia-python
CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_phoneme_timestamps
```


# SSE with Word Timestamps
Source: https://docs.cartesia.ai/examples/tts-sse-with-timestamps

SSE streaming with word timestamps.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_sse_with_timestamps(client: Cartesia) -> None:
        """SSE streaming with word timestamps."""
        stream = client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            add_timestamps=True,
        )

        import datetime
        filename = f"tts_sse_with_timestamps_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            for event in stream:
                if event.type == "timestamps":
                    wt = event.word_timestamps
                    if wt:
                        print(f"Words: {wt.words}, Starts: {wt.start}, Ends: {wt.end}")
                elif event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:89](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L89)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_sse_with_timestamps_async(client: AsyncCartesia) -> None:
        """SSE streaming with word timestamps."""
        stream = await client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            add_timestamps=True,
        )

        filename = f"tts_sse_timestamps_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            async for event in stream:
                if event.type == "timestamps":
                    wt = event.word_timestamps
                    if wt:
                        print(f"Words: {wt.words}, Starts: {wt.start}, Ends: {wt.end}")
                elif event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:76](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L76)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_timestamps
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_sse_with_timestamps_async
    ```
  </Tab>
</Tabs>


# WebSocket Basic
Source: https://docs.cartesia.ai/examples/tts-websocket-basic

Basic WebSocket usage with websocket_connect() context manager.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_basic(client: Cartesia) -> None:
        """Basic WebSocket usage with websocket_connect() context manager."""
        with client.tts.websocket_connect() as connection:
            connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "voice-id"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            import datetime
            filename = f"tts_websocket_basic_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.done:
                        break

            print(f"Saved audio to {filename}")
            print(f"Play with:\n  $ ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:196](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L196)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_basic_async(client: AsyncCartesia) -> None:
        """Basic WebSocket usage with websocket_connect() context manager."""
        async with client.tts.websocket_connect() as connection:
            await connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            filename = f"tts_ws_basic_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.done:
                        break
            
            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:109](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L109)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketBasic(client: Cartesia): Promise<void> {
      /** Basic WebSocket usage with websocket_connect() context manager. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const filename = `tts_websocket_basic_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ws.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      })) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with:\n  $ ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:48](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L48)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_basic
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_basic_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketBasic
    ```
  </Tab>
</Tabs>


# WebSocket Concurrent Contexts
Source: https://docs.cartesia.ai/examples/tts-websocket-concurrent-contexts

Two contexts on one connection, each using ctx.receive() to get their own audio.

<Tabs>
  <Tab title="Python">
    Since sync code can't receive from both contexts concurrently, we collect
    them sequentially — but the lazy-routing in receive() ensures that events
    consumed while reading context 1 are queued for context 2 (and vice-versa).

    ```python theme={null}
    def tts_websocket_concurrent_contexts(client: Cartesia) -> None:
        """Two contexts on one connection, each using ctx.receive() to get their own audio.

        Since sync code can't receive from both contexts concurrently, we collect
        them sequentially — but the lazy-routing in receive() ensures that events
        consumed while reading context 1 are queued for context 2 (and vice-versa).
        """
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx1 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )
            ctx2 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )

            # Send to both contexts before receiving
            ctx1.push("Context one is speaking now. This is a longer transcript to ensure that audio chunks from both contexts are interleaved on the wire. The quick brown fox jumps over the lazy dog.")
            ctx1.no_more_inputs()

            ctx2.push("Context two has a different message. We want to verify that the routing logic correctly separates the audio streams. Pack my box with five dozen liquor jugs.")
            ctx2.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # Receive from ctx1 — any ctx2 events read from the wire get queued
            filename1 = f"tts_concurrent_ctx1_{timestamp}.pcm"
            with open(filename1, "wb") as f:
                for response in ctx1.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            # Receive from ctx2 — picks up queued events first
            filename2 = f"tts_concurrent_ctx2_{timestamp}.pcm"
            with open(filename2, "wb") as f:
                for response in ctx2.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved context 1 audio to {filename1}")
            print(f"Saved context 2 audio to {filename2}")
            print(f"Play with:")
            print(f"  ffplay -f f32le -ar 44100 {filename1}")
            print(f"  ffplay -f f32le -ar 44100 {filename2}")
    ```

    From [cartesia-python/examples/examples.py:375](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L375)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_concurrent_contexts_async(client: AsyncCartesia) -> None:
        """Two contexts on one connection, each using ctx.receive() to get their own audio."""
        from cartesia.resources.tts import AsyncWebSocketContext

        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx1 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )
            ctx2 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )

            # Send to both contexts
            await ctx1.push("Context one is speaking now. This is a longer transcript to ensure that audio chunks from both contexts are interleaved on the wire. The quick brown fox jumps over the lazy dog.")
            await ctx1.no_more_inputs()

            await ctx2.push("Context two has a different message. We want to verify that the routing logic correctly separates the audio streams. Pack my box with five dozen liquor jugs.")
            await ctx2.no_more_inputs()

            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # Receive concurrently via tasks, writing to files
            async def collect(ctx: AsyncWebSocketContext, filename: str) -> None:
                with open(filename, "wb") as f:
                    async for response in ctx.receive():
                        if response.type == "chunk" and response.audio:
                            f.write(response.audio)

            filename1 = f"tts_concurrent_async_ctx1_{timestamp}.pcm"
            filename2 = f"tts_concurrent_async_ctx2_{timestamp}.pcm"

            await asyncio.gather(
                collect(ctx1, filename1),
                collect(ctx2, filename2),
            )

            print(f"Saved context 1 audio to {filename1}")
            print(f"Saved context 2 audio to {filename2}")
            print(f"Play with:")
            print(f"  ffplay -f f32le -ar 44100 {filename1}")
            print(f"  ffplay -f f32le -ar 44100 {filename2}")
    ```

    From [cartesia-python/examples/async\_examples.py:288](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L288)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketConcurrentContexts(client: Cartesia): Promise<void> {
      /** Two contexts on one connection, each using ctx.receive() to get their own audio. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx1 = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      const ctx2 = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      // Send to both contexts before receiving.
      await ctx1.push({
        transcript:
          'Context one is speaking now. This is a longer transcript to ensure that ' +
          'audio chunks from both contexts are interleaved on the wire. ' +
          'The quick brown fox jumps over the lazy dog.',
      });
      await ctx1.no_more_inputs();

      await ctx2.push({
        transcript:
          'Context two has a different message. We want to verify that the routing ' +
          'logic correctly separates the audio streams. ' +
          'Pack my box with five dozen liquor jugs.',
      });
      await ctx2.no_more_inputs();

      const ts = timestamp();

      async function collect(ctx: { receive: typeof ctx1.receive }, filename: string): Promise<void> {
        const file = fs.createWriteStream(filename);
        for await (const event of ctx.receive()) {
          if (event.type === 'chunk' && event.audio) {
            file.write(event.audio);
          }
        }
        file.end();
      }

      const filename1 = `tts_concurrent_ctx1_${ts}.pcm`;
      const filename2 = `tts_concurrent_ctx2_${ts}.pcm`;

      await Promise.all([collect(ctx1, filename1), collect(ctx2, filename2)]);

      ws.close();
      console.log(`Saved context 1 audio to ${filename1}`);
      console.log(`Saved context 2 audio to ${filename2}`);
      console.log('Play with:');
      console.log(`  ffplay -f f32le -ar 44100 ${filename1}`);
      console.log(`  ffplay -f f32le -ar 44100 ${filename2}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:239](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L239)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_concurrent_contexts
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_concurrent_contexts_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketConcurrentContexts
    ```
  </Tab>
</Tabs>


# WebSocket Continuations
Source: https://docs.cartesia.ai/examples/tts-websocket-continuations

Streaming a transcript split into multiple parts, using continuations.

<Tabs>
  <Tab title="Python">
    Useful for streaming transcripts generated by an LLM.

    ```python theme={null}
    def tts_websocket_continuations(client: Cartesia) -> None:
        """Streaming a transcript split into multiple parts, using continuations.
        Useful for streaming transcripts generated by an LLM."""
        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format={
                    "container": "raw",
                    "encoding": "pcm_f32le",
                    "sample_rate": 44100,
                },
            )

            for part in ["The road ", "goes ever ", "on and ", "on."]:
                ctx.push(part)

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_websocket_continuations_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with:\n  $ ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:222](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L222)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_continuations_async(client: AsyncCartesia) -> None:
        """Streaming a transcript split into multiple parts, using continuations."""
        transcripts = ["The only thing we have to fear ", "is ", "fear itself."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context()

            for transcript in transcripts:
                await ctx.send(
                    model_id="sonic-3",
                    transcript=transcript,
                    voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                    output_format=output_format,
                    continue_=True,
                )

            await ctx.no_more_inputs()

            filename = f"tts_ws_continuations_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:131](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L131)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketContinuations(client: Cartesia): Promise<void> {
      /** Streaming a transcript split into multiple parts, using continuations. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      for (const part of ['The road ', 'goes ever ', 'on and ', 'on.']) {
        await ctx.push({ transcript: part });
      }
      await ctx.no_more_inputs();

      const filename = `tts_websocket_continuations_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with:\n  $ ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:73](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L73)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_continuations
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_continuations_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketContinuations
    ```
  </Tab>
</Tabs>


# WebSocket Emotion Control
Source: https://docs.cartesia.ai/examples/tts-websocket-emotion

Demonstrates changing emotion mid-stream using generation_config.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_emotion(client: Cartesia) -> None:
        """Demonstrates changing emotion mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending neutral text...")
            ctx.push("Well maybe if you just ")

            print("Sending angry text...")
            ctx.push("loosen up a little!", generation_config={"emotion": "angry"})

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_emotion_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:313](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L313)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_emotion_async(client: AsyncCartesia) -> None:
        """Demonstrates changing emotion mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending neutral text...")
            await ctx.push("Well maybe if you just ")

            print("Sending angry text...")
            await ctx.push("loosen up a little!", generation_config={"emotion": "angry"})

            await ctx.no_more_inputs()

            import datetime
            filename = f"tts_emotion_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:228](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L228)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketEmotion(client: Cartesia): Promise<void> {
      /** Demonstrates changing emotion mid-stream using generation_config. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      console.log('Sending neutral text...');
      await ctx.push({ transcript: 'Well maybe if you just ' });

      console.log('Sending angry text...');
      await ctx.send({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        transcript: 'loosen up a little!',
        continue: true,
        generation_config: { emotion: 'angry' },
      });

      await ctx.no_more_inputs();

      const filename = `tts_emotion_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:157](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L157)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_emotion
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_emotion_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketEmotion
    ```
  </Tab>
</Tabs>


# WebSocket Flushing
Source: https://docs.cartesia.ai/examples/tts-websocket-flushing

Demonstrates manual flushing to separate audio from different transcripts.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_flushing(client: Cartesia) -> None:
        """Demonstrates manual flushing to separate audio from different transcripts."""
        transcripts = ["Stay hungry, ", "stay foolish."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )  # Auto-generates context_id

            # 1. Send first transcript
            print("Sending first transcript...")
            ctx.push(transcripts[0])

            # 2. Flush! This forces all buffered audio for the first transcript to be generated
            # and increments the flush_id counter on the server.
            print("Flushing...")
            ctx.flush()

            # 3. Send second transcript
            print("Sending second transcript...")
            ctx.push(transcripts[1])

            ctx.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # We'll save audio to separate files based on flush_id
            files: dict[int, IO[bytes]] = {}

            for response in ctx.receive():
                # Log every response, but redact audio data to avoid swamping the console.
                loggable = {k: ("[...]" if k == "data" else v) for k, v in response.model_dump().items()}
                print(f"Event: {loggable}")

                if response.type == "chunk" and response.audio:
                    # Get flush_id from response (defaults to 0 if not present)
                    flush_id = response.flush_id or 0

                    if flush_id not in files:
                        filename = f"tts_flush_{flush_id}_{timestamp}.pcm"
                        files[flush_id] = open(filename, "wb")

                    files[flush_id].write(response.audio)

            # Close all open files
            for f in files.values():
                f.close()

            print("\nFinished.")
            print("You can play the generated audio files with these commands:")
            for flush_id, f in files.items():
                print(f"  Flush ID {flush_id}: ffplay -f f32le -ar 44100 {f.name}")
    ```

    From [cartesia-python/examples/examples.py:255](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L255)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_flushing_async(client: AsyncCartesia) -> None:
        """Demonstrates manual flushing to separate audio from different transcripts."""
        transcripts = ["First transcript.", "Second transcript."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context()

            # 1. Send first transcript
            print("Sending first transcript...")
            await ctx.send(
                model_id="sonic-3",
                transcript=transcripts[0],
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
            )

            # 2. Flush!
            print("Flushing...")
            await ctx.send(
                model_id="sonic-3",
                transcript="",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
                flush=True,
            )

            # 3. Send second transcript
            print("Sending second transcript...")
            await ctx.send(
                model_id="sonic-3",
                transcript=transcripts[1],
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
            )

            await ctx.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
            
            files: dict[int, IO[bytes]] = {}

            async for response in ctx.receive():
                if response.type == "chunk" and response.audio:
                    flush_id = response.flush_id or 0

                    if flush_id not in files:
                        filename = f"tts_flush_async_{flush_id}_{timestamp}.pcm"
                        files[flush_id] = open(filename, "wb")
                        print(f"Created new file for flush_id {flush_id}: {filename}")

                    files[flush_id].write(response.audio)

                elif response.type == "flush_done":
                    print(f"Flush done received for flush_id: {response.flush_id}")

            for f in files.values():
                f.close()

            print("\nFinished.")
            print("You can play the generated audio files with these commands:")
            for flush_id, f in files.items():
                print(f"  Flush ID {flush_id}: ffplay -f f32le -ar 44100 {f.name}")
    ```

    From [cartesia-python/examples/async\_examples.py:160](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L160)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketFlushing(client: Cartesia): Promise<void> {
      /** Demonstrates manual flushing to separate audio from different transcripts. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      // 1. Send first transcript
      console.log('Sending first transcript...');
      await ctx.push({ transcript: 'Stay hungry, ' });

      // 2. Flush — forces all buffered audio for the first transcript to be generated.
      console.log('Flushing...');
      await ctx.flush();

      // 3. Send second transcript
      console.log('Sending second transcript...');
      await ctx.push({ transcript: 'stay foolish.' });

      await ctx.no_more_inputs();

      const ts = timestamp();
      const files: Map<number, fs.WriteStream> = new Map();

      for await (const event of ctx.receive()) {
        // Log every response, but redact audio data to avoid swamping the console.
        const loggable = { ...(event as any) };
        if (loggable.data) loggable.data = '[...]';
        console.log('Event:', JSON.stringify(loggable));

        if (event.type === 'chunk' && event.audio) {
          const flushId = (event as any).flush_id ?? 0;
          if (!files.has(flushId)) {
            const name = `tts_flush_${flushId}_${ts}.pcm`;
            files.set(flushId, fs.createWriteStream(name));
          }
          files.get(flushId)!.write(event.audio);
        }
      }

      for (const f of files.values()) f.end();
      ws.close();

      console.log('\nFinished. Play the generated audio files with:');
      for (const [flushId, f] of files) {
        console.log(`  Flush ID ${flushId}: ffplay -f f32le -ar 44100 ${(f as any).path}`);
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:104](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L104)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_flushing
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_flushing_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketFlushing
    ```
  </Tab>
</Tabs>


# WebSocket Response Handling
Source: https://docs.cartesia.ai/examples/tts-websocket-response-handling

WebSocket response type handling.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_response_handling(client: Cartesia) -> None:
        """WebSocket response type handling."""
        with client.tts.websocket_connect() as connection:
            connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "voice-id"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            import datetime
            filename = f"tts_websocket_response_handling_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.type == "timestamps":
                        process_timestamps(response.word_timestamps)
                    elif response.type == "done" or response.done:
                        break
                    elif response.type == "error":
                        raise Exception(response.error)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:427](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L427)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketResponseHandling(client: Cartesia): Promise<void> {
      /** WebSocket response type handling. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const filename = `tts_websocket_response_handling_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ws.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        add_timestamps: true,
      })) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        } else if (event.type === 'timestamps') {
          const wt = (event as any).word_timestamps;
          if (wt) {
            console.log(`Words: ${wt.words}, Starts: ${wt.start}, Ends: ${wt.end}`);
          }
        } else if (event.type === 'error') {
          throw new Error(JSON.stringify(event));
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:298](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L298)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_response_handling
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketResponseHandling
    ```
  </Tab>
</Tabs>


# WebSocket Speed Control
Source: https://docs.cartesia.ai/examples/tts-websocket-speed

Demonstrates changing speed mid-stream using generation_config.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_speed(client: Cartesia) -> None:
        """Demonstrates changing speed mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending normal speed text...")
            ctx.push("I am speaking at a normal pace. ")

            print("Sending fast speed text...")
            ctx.push("But now I am speaking much faster!", generation_config={"speed": 1.5})

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_speed_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:344](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L344)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_speed_async(client: AsyncCartesia) -> None:
        """Demonstrates changing speed mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending normal speed text...")
            await ctx.push("I am speaking at a normal pace. ")

            print("Sending fast speed text...")
            await ctx.push("But now I am speaking much faster!", generation_config={"speed": 1.5})

            await ctx.no_more_inputs()

            import datetime
            filename = f"tts_speed_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:258](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L258)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketSpeed(client: Cartesia): Promise<void> {
      /** Demonstrates changing speed mid-stream using generation_config. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      console.log('Sending normal speed text...');
      await ctx.push({ transcript: 'I am speaking at a normal pace. ' });

      console.log('Sending fast speed text...');
      await ctx.send({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        transcript: 'But now I am speaking much faster!',
        continue: true,
        generation_config: { speed: 1.5 },
      });

      await ctx.no_more_inputs();

      const filename = `tts_speed_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:198](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L198)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_speed
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_speed_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketSpeed
    ```
  </Tab>
</Tabs>


# Clone a Voice
Source: https://docs.cartesia.ai/examples/voices-clone

Clone a voice from an audio clip.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_clone(client: Cartesia) -> Any:
        """Clone a voice from an audio clip."""
        with open("sample.wav", "rb") as clip:
            voice = client.voices.clone(
                clip=clip,
                name="My Voice",
                description="A custom voice",
                language="en",
            )
        return voice
    ```

    From [cartesia-python/examples/examples.py:474](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L474)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesClone(client: Cartesia): Promise<void> {
      /** Clone a voice from an audio clip. */
      const clip = fs.createReadStream('sample.wav');
      const voice = await client.voices.clone({
        clip,
        name: 'My Voice',
        description: 'A custom voice',
        language: 'en',
      });
      console.log('Cloned voice:', voice.id);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:348](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L348)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_clone
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesClone
    ```
  </Tab>
</Tabs>


# Get a Voice
Source: https://docs.cartesia.ai/examples/voices-get

Get a specific voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_get(client: Cartesia) -> Any:
        """Get a specific voice."""
        voice = client.voices.get("voice-id")
        return voice
    ```

    From [cartesia-python/examples/examples.py:468](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L468)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesGet(client: Cartesia): Promise<void> {
      /** Get a specific voice. */
      const voice = await client.voices.get('6ccbfb76-1fc6-48f7-b71d-91ac6298247b');
      console.log(voice.name);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:342](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L342)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_get
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesGet
    ```
  </Tab>
</Tabs>


# List Voices
Source: https://docs.cartesia.ai/examples/voices-list

List voices with pagination.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_list(client: Cartesia) -> None:
        """List voices with pagination."""
        voices = client.voices.list(limit=10)
        for voice in voices:
            print(voice.name)
    ```

    From [cartesia-python/examples/examples.py:461](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L461)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesList(client: Cartesia): Promise<void> {
      /** List voices with pagination. */
      for await (const voice of client.voices.list({ limit: 10 })) {
        console.log(voice.name);
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:335](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L335)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_list
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesList
    ```
  </Tab>
</Tabs>


# Tencent RTC
Source: https://docs.cartesia.ai/integrations/tencent-rtc


<Frame>
  <img alt="Cartesia & Tencent" />
</Frame>

**Tencent Real-Time Communication (TRTC)** is Tencent Cloud’s stack for realtime audio and video—calls, live streaming, and conferencing.

**TRTC Conversational AI** is Tencent’s packaged stack for realtime voice agents. Tencent and Cartesia have a **public partnership** to combine TRTC networking with Cartesia **Sonic** TTS and **Ink-Whisper** STT for low-latency conversational AI (see Tencent’s [TRTC × Cartesia solution overview](https://trtc.tencentcloud.com/solutions/trtc-cartesia)). Integration steps and SDK details live in **Tencent’s** console and docs.

# Demo

Experience the TRTC × Cartesia voice assistant here:
[TRTC x Cartesia Demo](https://trtc.io/demo/homepage/#/cartesia)


# Thoughtly
Source: https://docs.cartesia.ai/integrations/thoughtly


<Frame>
  <div>
    <img alt="Thoughtly logo" />
  </div>
</Frame>

**Thoughtly** is a no-code platform for **inbound and outbound AI phone agents** (sales, support, routing): visual flows, CRM and calendar integrations, analytics, and telephony. Following the [Thoughtly × Cartesia partnership](https://www.thoughtly.com/blog/thoughtly-upgrades-its-voice-library-through-partnership-with-cartesia/), **new agents default to Cartesia voices** (low-latency TTS, expanded library, cloning from a short sample in-product); Thoughtly notes existing agents can keep prior voices during migration.

# Demo

<Card title="Thoughtly Cartesia Demo" icon="link" href="https://app.arcade.software/share/MaOO9bPhyHAP5ZdOq8Gt">
  See a demo of Cartesia on Thoughtly.
</Card>


# Integrate with Twilio
Source: https://docs.cartesia.ai/integrations/twilio

How to integrate Twilio with Cartesia to generate audio from text and send it as a voice call.

Use **Twilio Programmable Voice** with **Media Streams** so a phone call receives audio generated by **Cartesia TTS** over WebSockets. This walkthrough uses **Node.js**: a small server bridges Twilio’s stream to Cartesia and plays TTS audio on the callee’s line.

## Prerequisites

Before you begin, make sure you have the following:

1. [Node.js](https://nodejs.org/en/download) installed.
2. A [Twilio account](https://www.twilio.com/en-us/try-twilio). You will need your Account SID and Auth Token.
3. A [Cartesia API key](https://play.cartesia.ai/keys).
4. A phone number that you want to call.
5. A Twilio phone number to call from.
6. An [ngrok authtoken](https://dashboard.ngrok.com/get-started/your-authtoken) (a free account works).

## Get Started

<Steps>
  <Step title="Set Up Your Project">
    1. Create a new directory for your project and navigate to it in your terminal.
    2. Initialize a new Node.js project:
       ```bash lines theme={null}
       npm init -y
       ```
    3. Install the required dependencies:
       ```bash lines theme={null}
       npm install twilio ws http @ngrok/ngrok dotenv
       ```
  </Step>

  <Step title="Configure Environment Variables">
    Create a `.env` file in your project root and add the following:

    ```sh lines theme={null}
    TWILIO_ACCOUNT_SID="your_twilio_account_sid"
    TWILIO_AUTH_TOKEN="your_twilio_auth_token"
    CARTESIA_API_KEY="your_cartesia_api_key"
    NGROK_AUTHTOKEN="your_ngrok_authtoken"
    ```

    Replace the placeholder values with your actual credentials.
  </Step>

  <Step title="Create the Main Script">
    Create a file named `app.js` (or any name you prefer) and add the following code:

    ```javascript lines theme={null}
    const twilio = require('twilio');
    const WebSocket = require('ws');
    const http = require('http');
    const ngrok = require('@ngrok/ngrok');
    const dotenv = require('dotenv');
    const crypto = require('crypto');

    // Load environment variables
    dotenv.config({ override: true });

    // Function to get a value from environment variable or command line argument
    function getConfig(key, defaultValue = undefined) {
      return process.env[key] || process.argv.find(arg => arg.startsWith(`${key}=`))?.split('=')[1] || defaultValue;
    }

    // Configuration
    const config = {
        TWILIO_ACCOUNT_SID: getConfig('TWILIO_ACCOUNT_SID'),
        TWILIO_AUTH_TOKEN: getConfig('TWILIO_AUTH_TOKEN'),
        CARTESIA_API_KEY: getConfig('CARTESIA_API_KEY'),
        NGROK_AUTHTOKEN: getConfig('NGROK_AUTHTOKEN'),
    };

    // Validate required configuration
    const requiredConfig = ['TWILIO_ACCOUNT_SID', 'TWILIO_AUTH_TOKEN', 'CARTESIA_API_KEY', 'NGROK_AUTHTOKEN'];
    for (const key of requiredConfig) {
        if (!config[key]) {
            console.error(`Missing required configuration: ${key}`);
            process.exit(1);
        }
    }

    const client = twilio(config.TWILIO_ACCOUNT_SID, config.TWILIO_AUTH_TOKEN);
    ```
  </Step>

  <Step title="Configure Cartesia TTS">
    In the script, you'll find a configuration section for Cartesia TTS. Make sure to set the following variables according to your needs:

    ```javascript lines theme={null}
    const TTS_WEBSOCKET_URL = `wss://api.cartesia.ai/tts/websocket?cartesia_version=2025-03-01`;
    const modelId = 'sonic-3';
    const voice = {
        'mode': 'id',
        // You can check available voices using the Cartesia API or at https://play.cartesia.ai
        'id': "e07c00bc-4134-4eae-9ea4-1a55fb45746b"
    };
    const partialResponse = 'Hi there, my name is Cartesia. I hope youre having a great day!';
    ```
  </Step>

  <Step title="Set Up Twilio Calling">
    Configure your Twilio outbound and inbound numbers:

    ```javascript lines theme={null}
    const outbound = "+1234567890"; // Replace with the number you want to call
    const inbound = "+1234567890";  // Replace with your Twilio number
    ```
  </Step>

  <Step title="Implement Main Logic">
    The `main()` function orchestrates the entire process:

    1. Connects to the Cartesia TTS WebSocket
    2. Tests the TTS WebSocket
    3. Sets up a Twilio WebSocket server
    4. Creates an ngrok tunnel for the Twilio WebSocket
    5. Initiates the call using Twilio

    ```javascript expandable lines  theme={null}
    let ttsWebSocket;
    let callSid;
    let messageComplete = false;
    let audioChunksReceived = 0;

    function log(message) {
      console.log(`[${new Date().toISOString()}] ${message}`);
    }

    function connectToTTSWebSocket() {
      return new Promise((resolve, reject) => {
        log('Attempting to connect to TTS WebSocket');
        ttsWebSocket = new WebSocket(TTS_WEBSOCKET_URL, {
          headers: { 'X-Api-Key': config.CARTESIA_API_KEY }
        });

        ttsWebSocket.on('open', () => {
          log('Connected to TTS WebSocket');
          resolve(ttsWebSocket);
        });

        ttsWebSocket.on('error', (error) => {
          log(`TTS WebSocket error: ${error.message}`);
          reject(error);
        });

        ttsWebSocket.on('close', (code, reason) => {
          log(`TTS WebSocket closed. Code: ${code}, Reason: ${reason}`);
          reject(new Error('TTS WebSocket closed unexpectedly'));
        });
      });
    }

    function sendTTSMessage(message) {
      const textMessage = {
        'model_id': modelId,
        'transcript': message,
        'voice': voice,
        'output_format': {
          'container': 'raw',
          'encoding': 'pcm_mulaw',
          'sample_rate': 8000
        },
        // create a new context for each message since each is a complete transcript
        'context_id': crypto.randomUUID()
      };

      log(`Sending message to TTS WebSocket: ${message}`);
      ttsWebSocket.send(JSON.stringify(textMessage));
    }

    function testTTSWebSocket() {
      return new Promise((resolve, reject) => {
        const testMessage = 'This is a test message';
        let receivedAudio = false;

        sendTTSMessage(testMessage);

        const timeout = setTimeout(() => {
          if (!receivedAudio) {
            reject(new Error('Timeout: No audio received from TTS WebSocket'));
          }
        }, 10000); // 10 second timeout

        ttsWebSocket.on('message', (audioChunk) => {
          if (!receivedAudio) {
            log(audioChunk);
            log('Received audio chunk from TTS for test message');
            receivedAudio = true;
            clearTimeout(timeout);
            resolve();
          }
        });
      });
    }

    async function startCall(twilioWebsocketUrl) {
      try {
        log(`Initiating call with WebSocket URL: ${twilioWebsocketUrl}`);
        const call = await client.calls.create({
          twiml: `<Response><Connect><Stream url="${twilioWebsocketUrl}"/></Connect></Response>`,
          to: outbound,  // Replace with the phone number you want to call
          from: inbound  // Replace with your Twilio phone number
        });

        callSid = call.sid;
        log(`Call initiated. SID: ${callSid}`);
      } catch (error) {
        log(`Error initiating call: ${error.message}`);
        throw error;
      }
    }

    async function hangupCall() {
      try {
        log(`Attempting to hang up call: ${callSid}`);
        await client.calls(callSid).update({status: 'completed'});
        log('Call hung up successfully');
      } catch (error) {
        log(`Error hanging up call: ${error.message}`);
      }
    }

    function setupTwilioWebSocket() {
        return new Promise((resolve, reject) => {
          const server = http.createServer((req, res) => {
            log(`Received HTTP request: ${req.method} ${req.url}`);
            res.writeHead(200);
            res.end('WebSocket server is running');
          });

          const wss = new WebSocket.Server({ server });

          log('WebSocket server created');

          wss.on('connection', (twilioWs, request) => {
            log(`Twilio WebSocket connection attempt from ${request.socket.remoteAddress}`);

            let streamSid = null;

            twilioWs.on('message', (message) => {
              try {
                const msg = JSON.parse(message);
                log(`Received message from Twilio: ${JSON.stringify(msg)}`);

                if (msg.event === 'start') {
                  log('Media stream started');
                  streamSid = msg.start.streamSid;
                  log(`Stream SID: ${streamSid}`);
                  sendTTSMessage(partialResponse);
                } else if (msg.event === 'media' && !messageComplete) {
                  log('Received media event');
                } else if (msg.event === 'stop') {
                  log('Media stream stopped');
                  hangupCall();
                }
              } catch (error) {
                log(`Error processing Twilio message: ${error.message}`);
              }
            });

            twilioWs.on('close', (code, reason) => {
              log(`Twilio WebSocket disconnected. Code: ${code}, Reason: ${reason}`);
            });

            twilioWs.on('error', (error) => {
              log(`Twilio WebSocket error: ${error.message}`);
            });

            // Handle incoming audio chunks from TTS WebSocket
            ttsWebSocket.on('message', (audioChunk) => {
              log('Received audio chunk from TTS');
              try {
                if (streamSid) {
                  twilioWs.send(JSON.stringify({
                    event: 'media',
                    streamSid: streamSid,
                    media: {
                      payload: JSON.parse(audioChunk)['data']
                    }
                  }));

                  audioChunksReceived++;
                  log(`Audio chunks received: ${audioChunksReceived}`);

                  if (audioChunksReceived >= 50) {
                    messageComplete = true;
                    log('Message complete, preparing to hang up');
                    setTimeout(hangupCall, 2000);
                  }
                } else {
                  log('Warning: Received audio chunk but streamSid is not set');
                }
              } catch (error) {
                log(`Error sending audio chunk to Twilio: ${error.message}`);
              }
            });

            log('Twilio WebSocket connected and handlers set up');
          });

          wss.on('error', (error) => {
            log(`WebSocket server error: ${error.message}`);
          });

          server.listen(0, () => {
            const port = server.address().port;
            log(`Twilio WebSocket server is running on port ${port}`);
            resolve(port);
          });

          server.on('error', (error) => {
            log(`HTTP server error: ${error.message}`);
            reject(error);
          });
        });
      }

    async function setupNgrokTunnel(port) {
        try {
          const listener = await ngrok.forward({
            addr: port,
            authtoken: config.NGROK_AUTHTOKEN,
          });
          const wssUrl = listener.url().replace('https://', 'wss://');
          log(`ngrok tunnel established: ${wssUrl}`);
          return wssUrl;
        } catch (error) {
          log(`Error setting up ngrok tunnel: ${error.message}`);
          throw error;
        }
      }

    async function main() {
      try {
        log('Starting application');

        await connectToTTSWebSocket();
        log('TTS WebSocket connected successfully');

        await testTTSWebSocket();
        log('TTS WebSocket test passed successfully');

        const twilioWebsocketPort = await setupTwilioWebSocket();
        log(`Twilio WebSocket server set up on port ${twilioWebsocketPort}`);

        const twilioWebsocketUrl = await setupNgrokTunnel(twilioWebsocketPort);

        await startCall(twilioWebsocketUrl);
      } catch (error) {
        log(`Error in main function: ${error.message}`);
      }
    }

    // Run the script
    main();
    ```
  </Step>

  <Step title="Run the Application">
    To run the application, use the following command:

    ```bash lines theme={null}
    node app.js
    ```
  </Step>
</Steps>

## How It Works

1. The script establishes a connection to Cartesia's TTS WebSocket.
2. It sets up a local WebSocket server to communicate with Twilio.
3. An ngrok tunnel is created to expose the local WebSocket server to the internet.
4. A call is initiated using Twilio, connecting to the ngrok tunnel.
5. When the call connects, the script sends the predefined message to Cartesia's TTS.
6. Cartesia converts the text to speech and sends audio chunks back.
7. The script forwards these audio chunks to Twilio, which plays them on the call.

## Customization

* To change the spoken message, modify the `partialResponse` variable.
* Adjust the voice parameters in the `voice` object to change the TTS voice characteristics.
* Modify the `audioChunksReceived` threshold to control when the call should end.

## Troubleshooting

* If you encounter any issues, check the console logs for detailed error messages.
* Ensure all required environment variables are correctly set.
* If you see `invalid tunnel configuration`, make sure you're using the better supported `@ngrok/ngrok` package and not `ngrok`.


# Vision Agents by Stream
Source: https://docs.cartesia.ai/integrations/vision-agents-by-stream


<Frame>
  <img alt="Vision Agents logo" />
</Frame>

[Stream](https://getstream.io/) maintains **[Vision Agents](https://visionagents.ai)**—an open-source Python framework for voice- and vision-driven agents with realtime media over **Stream**’s WebRTC edge. Cartesia is supported as the **TTS** provider; install steps, environment variables, and parameters are in Stream’s **[Cartesia integration](https://visionagents.ai/integrations/cartesia)**.

You need a **Stream** developer account for realtime transport and a **Cartesia API key** for speech.

The ["Simple Agent"](https://github.com/GetStream/Vision-Agents/tree/main/examples/01_simple_agent_example) example in GitHub and the [voice](https://visionagents.ai/introduction/voice-agents) / [video](https://visionagents.ai/introduction/video-agents) intros are good starting points.

# Demo

<Card title="Vision Agents Cartesia Demo" icon="fa-solid fa-link" href="https://github.com/GetStream/Vision-Agents/tree/main/examples/01_simple_agent_example">
  Try out the Simple Agent Cartesia demo.
</Card>


# CLI documentation
Source: https://docs.cartesia.ai/line/cli


Create, deploy, and manage voice agents from the command line.

## Installation

<Warning>By running the quick install commands, you are accepting Cartesia's [Terms of Service (TOS)](https://cartesia.ai/legal/terms.html). Please make sure to review the full TOS here before proceeding.</Warning>

Install and download from our servers:

```zsh lines theme={null}
curl -fsSL https://cartesia.sh | sh
```

Update to the latest version:

```zsh lines theme={null}
cartesia update
```

## Quick Start

<Steps>
  <Step title="Login with API key">
    Authenticate with your Cartesia API key.
    To make an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys) and select your organization.

    ```zsh lines theme={null}
    cartesia auth login  # paste your API key when prompted
    ```
  </Step>

  <Step title="Clone an example agent">
    Clone an example agent from the Line repository.

    ```zsh lines theme={null}
    cartesia create my-agent
    # Choose any example you like.
    cd my-agent
    ```
  </Step>

  <Step title="Initialize your agent">
    Give your agent a name and link it to your organization.

    ```zsh lines theme={null}
    cartesia init
    ```
  </Step>

  <Step title="Deploy your agent">
    Deploy your agent to make it available in the playground.

    ```zsh lines theme={null}
    cartesia deploy
    ```
  </Step>
</Steps>

## Features

### Initialize a Project

Link any directory to a new or existing Cartesia agent:

```zsh lines theme={null}
cartesia init
```

Create a project from an example:

```zsh lines theme={null}
cartesia create
```

<Tip>
  Inside a project directory, the CLI auto-detects the agent. Run `cartesia status` to see the current agent ID.
</Tip>

### Chat with Your Agent

Test your agent's text reasoning locally.

Terminal 1. Run your text logic fastapi server:

```zsh lines theme={null}
PORT=8000 uv run python main.py
```

Terminal 2. Run the CLI to chat with your agent:

```zsh lines theme={null}
cartesia chat 8000
```

## Commands

### Authentication

To get an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys), select your organization, and generate a new key.

```zsh lines theme={null}
cartesia auth login
```

To validate the existing API key:

```zsh lines theme={null}
cartesia auth status
```

To logout (clears cached credentials):

```zsh lines theme={null}
cartesia auth logout
```

### Voice Agents

Deploy your agent to Cartesia cloud.

```zsh lines theme={null}
cartesia deploy
```

List out all the agents in your organization:

```zsh lines theme={null}
cartesia agents ls
```

#### Managed Deployments

Versions of your agent running on Cartesia's cloud. Each deployment rebuilds the environment, instantiates your project, and runs a health check.

To see all of your deployments:

```zsh lines theme={null}
cartesia deployments ls
```

Check the status of a deployment:

```zsh lines theme={null}
cartesia status [<deployment-id> or <agent-id>]
```

#### Self-Hosted Agent Code

While Cartesia's managed deployments are the simplest way to deploy low-latency voice agents, if you'd like to manage your own deployments of your agent code, you can pass us a URL for your agent to connect to during calls.

Connect an existing agent to your self-hosted code:

```zsh lines theme={null}
cartesia connect --agent-id <agent-id> --url https://my-agent.example.com
```

Or run without `--agent-id` to interactively select an existing agent or create a new one:

```zsh lines theme={null}
cartesia connect --url https://my-agent.example.com
```

Disconnect an agent from your self-hosted code:

```zsh lines theme={null}
cartesia disconnect --agent-id <agent-id>
```

### Environment Variables

Create, list, and remove environment variables for your agent.

Set environment variables for your agent:

```zsh lines theme={null}
cartesia env set API_KEY=FOOBAR MY_CONFIG=FOOBAZ
```

<Warning icon="lock">
  Environment variables are encrypted for storage and can only be accessed by your code.
</Warning>

Port environment variables from a `.env` file:

```zsh lines theme={null}
cartesia env set --from .env
```

```text .env theme={null}
API_KEY=FOOBAR
MY_CONFIG=FOOBAZ
```

Remove an environment variable:

```zsh lines theme={null}
cartesia env rm <env-var-name>
```

### Help Menu

For more details on any command:

```zsh lines theme={null}
cartesia --help
```


# Release Notes
Source: https://docs.cartesia.ai/line/developer-tools/release-notes

Updates to the Line SDK and platform.

## March 2026

Platform-wide API, PVC, and client library updates for this month are in [Changelog 2026](/changelog/2026) (March 2026).

***

## February 4, 2026

### AgentUpdateCall Output Event

Added `AgentUpdateCall` event for dynamically updating call configuration during a conversation:

```python theme={null}
from line.events import AgentUpdateCall

# In an agent's process method:
yield AgentUpdateCall(voice_id="5ee9feff-1265-424a-9d7f-8e4d431a12c7")
yield AgentUpdateCall(pronunciation_dict_id="dict-123")
```

| Field                   | Description                          |
| ----------------------- | ------------------------------------ |
| `voice_id`              | Updates the agent's voice            |
| `pronunciation_dict_id` | Updates the pronunciation dictionary |

All fields are optional—only set fields are updated. See [Events](/line/sdk/events#dynamic-configuration) for details.

***

## February 1, 2026

### Line SDK v0.2 — Major Release

We're releasing **Line SDK v0.2**, a complete redesign of the voice agent framework focused on simplicity, streaming performance, and seamless LLM integration. This release introduces a new async iterable architecture that replaces the previous event bus system.

<Warning>
  **Breaking Changes**: v0.2 is not backwards compatible with v0.1.x. See the [Migration Guide](#migration-guide-from-v0-1-x-to-v0-2) below for detailed upgrade instructions.
</Warning>

<Info>
  **What's changing?** Line SDK v0.2 makes it much simpler to build voice agents. Instead of manually wiring together multiple components (systems, bridges, nodes), you now write a single function that returns your agent. The SDK handles audio, interruptions, and conversation flow automatically.
</Info>

**Why upgrade?**

* **Faster development** — Build agents in hours instead of days with less boilerplate code
* **Easier maintenance** — Fewer moving parts means fewer bugs and simpler debugging
* **Better reliability** — Built-in error handling, retries, and fallback models
* **More flexibility** — Switch between 100+ AI providers (OpenAI, Anthropic, Google, etc.) without code changes
* **Powerful tools** — Add capabilities like web search, call transfers, and multi-agent handoffs with one line of code

***

## What's New in v0.2

### Simplified Agent Architecture

The new architecture replaces the `VoiceAgentSystem`, `Bus`, `Bridge`, and `ReasoningNode` pattern with a single async iterable function:

```python theme={null}
import os
from line import CallRequest
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import AgentEnv, VoiceAgentApp

async def get_agent(env: AgentEnv, call_request: CallRequest):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)
```

**Benefits:**

* Less boilerplate code
* No manual event routing or bridge configuration
* Automatic conversation history management
* Built-in interruption handling
* Quick, and easy tool definition

### Built-in LLM Support via LiteLLM

`LlmAgent` provides unified access to 100+ LLM providers through [LiteLLM](https://github.com/BerriAI/litellm):

```python theme={null}
# OpenAI
LlmAgent(model="gpt-5-nano", api_key=os.getenv("OPENAI_API_KEY"), ...)

# Anthropic
LlmAgent(model="anthropic/claude-haiku-4-5-20251001", api_key=os.getenv("ANTHROPIC_API_KEY"), ...)

# Google Gemini
LlmAgent(model="gemini/gemini-2.5-flash-preview-09-2025", api_key=os.getenv("GEMINI_API_KEY"), ...)

# With fallbacks
LlmAgent(
    model="gpt-5-nano",
    config=LlmConfig(fallbacks=["anthropic/claude-haiku-4-5-20251001", "gemini/gemini-2.5-flash-preview-09-2025"]),
    ...
)
```

### Declarative Tool System

Define agent capabilities using simple decorators. Three tool types cover all common scenarios:

| Tool Type       | Decorator           | What It Does                                                    | Example Use Case                                  |
| --------------- | ------------------- | --------------------------------------------------------------- | ------------------------------------------------- |
| **Loopback**    | `@loopback_tool`    | Fetches information, then the agent speaks the answer naturally | Looking up order status, checking account balance |
| **Passthrough** | `@passthrough_tool` | Takes an immediate action without additional AI processing      | Ending a call, transferring to a phone number     |
| **Handoff**     | `@handoff_tool`     | Transfers the conversation to a different specialized agent     | Routing to Spanish support, escalating to billing |

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool, passthrough_tool, handoff_tool
from line.events import AgentEndCall

@loopback_tool
async def get_weather(ctx, city: Annotated[str, "City name"]) -> str:
    """Get current weather for a city."""
    return f"72°F and sunny in {city}"

@passthrough_tool
async def end_call(ctx):
    """End the call."""
    yield AgentEndCall()

@handoff_tool
async def transfer_to_support(ctx, event):
    """Transfer to support agent."""
    async for output in support_agent.process(ctx.turn_env, event):
        yield output
```

### Background Tool Execution

Long-running tools can execute in the background without blocking the LLM:

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool(is_background=True)
async def check_bank_balance(ctx, account_id: Annotated[str, "Account ID"]):
    """Check account balance (may take a few seconds)."""
    yield "Checking your balance..."  # Immediate acknowledgment
    balance = await api.get_balance(account_id)  # Long operation
    yield f"Your balance is ${balance:.2f}"  # Triggers new LLM completion
```

### Built-in Tools

Common operations available out of the box:

```python theme={null}
from line.llm_agent import end_call, send_dtmf, transfer_call, web_search, agent_as_handoff

agent = LlmAgent(
    tools=[
        end_call,                    # End the call
        send_dtmf,                   # Send DTMF tones
        transfer_call,               # Transfer to phone number
        web_search,                  # Real-time web search
        agent_as_handoff(other_agent, name="transfer_to_billing"),
    ],
    ...
)
```

### Multi-Agent Workflows

Create sophisticated agent routing with `agent_as_handoff`:

```python theme={null}
spanish_agent = LlmAgent(
    model="gpt-5-nano",
    config=LlmConfig(system_prompt="Speak only in Spanish.", ...),
    ...
)

main_agent = LlmAgent(
    tools=[
        agent_as_handoff(
            spanish_agent,
            handoff_message="Transferring to Spanish support...",
            name="transfer_to_spanish",
            description="Transfer when user requests Spanish.",
        ),
    ],
    ...
)
```

### Structured Event System

Events are how your agent communicates with the outside world. **Output events** are actions your agent takes (speaking, ending calls). **Input events** are things that happen during a call (user speaks, call starts).

**Output Events** (agent → harness):

* `AgentSendText` — Send text to be spoken
* `AgentEndCall` — End the call
* `AgentTransferCall` — Transfer to another number
* `AgentSendDtmf` — Send DTMF tone
* `AgentToolCalled` / `AgentToolReturned` — Tool execution tracking
* `LogMetric` / `LogMessage` — Observability

**Input Events** (harness → agent):

* `CallStarted` / `CallEnded` — Call lifecycle
* `UserTurnStarted` / `UserTurnEnded` — User speaking
* `UserTextSent` / `UserDtmfSent` — User content
* `AgentHandedOff` — Handoff notification

All input events include a `history` field with the complete conversation context.

### Enhanced Configuration

Fine-tune how your agent thinks and responds. `LlmConfig` lets you control the AI's personality, response length, creativity, and reliability:

```python theme={null}
LlmConfig(
    system_prompt="You are a helpful assistant.",
    introduction="Hello! How can I help?",

    # Sampling parameters
    temperature=0.7,
    max_tokens=1024,
    top_p=0.95,

    # Resilience
    num_retries=2,
    fallbacks=["gpt-5-nano"],
    timeout=30.0,

    # Provider-specific options
    extra={"reasoning_effort": "high"},
)
```

***

## Migration Guide from v0.1.x to v0.2

This guide walks you through upgrading your existing v0.1.x agents to v0.2. The migration involves updating imports, simplifying your agent setup, and adopting the new tool system. Most agents can be migrated in under an hour.

### Overview of Changes

| v0.1.x                                | v0.2                                      |
| ------------------------------------- | ----------------------------------------- |
| `VoiceAgentSystem` + `Bus` + `Bridge` | `VoiceAgentApp` with `get_agent` callback |
| `ReasoningNode` subclasses            | `LlmAgent` or custom `Agent` protocol     |
| `call_handler(system, request)`       | `get_agent(env, request) -> Agent`        |
| Manual event routing                  | Automatic event dispatch with filters     |
| `process_context()` method            | `process(env, event)` async iterable      |

### Step 1: Update Imports

```python theme={null}
# v0.1.x
from line.voice_agent_app import VoiceAgentApp
from line.voice_agent_system import VoiceAgentSystem
from line.bridge import Bridge
from line.nodes import ReasoningNode
from line.events import (
    AgentSpeechSent,
    UserTranscriptionReceived,
    EndCall,
    TransferCall,
)

# v0.2
from line.voice_agent_app import VoiceAgentApp, AgentEnv
from line.llm_agent import LlmAgent, LlmConfig
from line.llm_agent import end_call, transfer_call, loopback_tool, passthrough_tool
from line.events import (
    AgentSendText,
    AgentEndCall,
    AgentTransferCall,
    UserTurnEnded,
    CallStarted,
)
```

### Step 2: Replace VoiceAgentSystem with get\_agent

In v0.1.x, event routing was configured manually via `bridge.on()`. In v0.2, event dispatch is automatic with customizable **run** and **cancel filters**.

<CodeGroup>
  ```python v0.1.x theme={null}
  from line.voice_agent_app import VoiceAgentApp
  from line.voice_agent_system import VoiceAgentSystem
  from line.bridge import Bridge
  from line.nodes import ReasoningNode
  from line.events import (
      UserTranscriptionReceived,
      UserStoppedSpeaking,
      DTMFInputEvent,
  )

  class MyReasoningNode(ReasoningNode):
      async def process_context(self, context):
          # Your LLM logic here
          response = await call_llm(context.messages)
          yield AgentResponse(content=response)

  async def call_handler(system: VoiceAgentSystem, call_request):
      node = MyReasoningNode(system_prompt="You are helpful.")
      bridge = Bridge(node)

      system.with_speaking_node(node, bridge)

      # Manual event routing with bridge.on()
      bridge.on(UserTranscriptionReceived).map(node.add_event)
      bridge.on(UserStoppedSpeaking).stream(node.generate).broadcast()

      # DTMF events required explicit routing
      bridge.on(DTMFInputEvent).map(node.handle_dtmf)

      await system.start()
      await system.send_initial_message("Hello!")
      await system.wait_for_shutdown()

  app = VoiceAgentApp(call_handler=call_handler)
  ```

  ```python v0.2 theme={null}
  import os
  from line import CallRequest
  from line.voice_agent_app import VoiceAgentApp, AgentEnv
  from line.llm_agent import LlmAgent, LlmConfig, end_call
  from line.events import (
      CallStarted,
      UserTurnEnded,
      UserDtmfSent,
      UserTurnStarted,
      CallEnded,
  )

  async def get_agent(env: AgentEnv, call_request: CallRequest):
      agent = LlmAgent(
          model="gpt-5-nano",
          api_key=os.getenv("OPENAI_API_KEY"),
          tools=[end_call],
          config=LlmConfig(
              system_prompt="You are helpful.",
              introduction="Hello!",
          ),
      )

      # Default: just return the agent (uses default filters)
      return agent

  async def get_agent_with_dtmf(env: AgentEnv, call_request: CallRequest):
      """Alternative: include DTMF events in processing."""
      agent = LlmAgent(...)

      # Return an AgentSpec tuple to customize filters
      run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded]
      cancel_filter = [UserTurnStarted]
      return (agent, run_filter, cancel_filter)

  app = VoiceAgentApp(get_agent=get_agent)
  ```
</CodeGroup>

#### Run and Cancel Filters

Filters control your agent's behavior during a call:

* **Run filters** determine what triggers your agent to respond (e.g., when the user finishes speaking)
* **Cancel filters** determine what interrupts your agent (e.g., when the user starts talking over the agent)

You can customize these by returning a tuple instead of just the agent:

```python theme={null}
from typing import Union, Tuple

AgentSpec = Union[Agent, Tuple[Agent, run_filter, cancel_filter]]
```

| Filter             | Purpose                                    | Default                                   |
| ------------------ | ------------------------------------------ | ----------------------------------------- |
| **run\_filter**    | Events that trigger agent processing       | `[CallStarted, UserTurnEnded, CallEnded]` |
| **cancel\_filter** | Events that cancel in-progress agent tasks | `[UserTurnStarted]`                       |

**Example: Agent that responds to DTMF input**

```python theme={null}
from line.events import (
    CallStarted, CallEnded, UserTurnEnded, UserTurnStarted, UserDtmfSent
)

async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)

    # Include UserDtmfSent in run_filter to process DTMF
    run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded]
    cancel_filter = [UserTurnStarted]

    return (agent, run_filter, cancel_filter)
```

**Example: Agent that doesn't get interrupted**

```python theme={null}
async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)

    # Empty cancel_filter = agent won't be interrupted
    run_filter = [CallStarted, UserTurnEnded, CallEnded]
    cancel_filter = []

    return (agent, run_filter, cancel_filter)
```

**Example: Custom filter function**

```python theme={null}
def my_run_filter(event: InputEvent) -> bool:
    """Only process events during business hours."""
    if isinstance(event, CallStarted):
        return is_business_hours()
    return isinstance(event, (UserTurnEnded, CallEnded))

async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)
    return (agent, my_run_filter, [UserTurnStarted])
```

### Step 3: Migrate Event Handling

<CodeGroup>
  ```python v0.1.x theme={null}
  # Event names
  AgentSpeechSent        # Agent spoke
  UserTranscriptionReceived  # User spoke
  EndCall                # End call
  TransferCall           # Transfer call

  # Manual event handling in ReasoningNode
  class MyNode(ReasoningNode):
      async def process_context(self, context):
          for event in context.events:
              if isinstance(event, UserTranscriptionReceived):
                  user_message = event.transcription
  ```

  ```python v0.2 theme={null}
  # Event names
  AgentSendText          # Output: send text to speak
  AgentTextSent          # Input: confirmation text was spoken
  UserTurnEnded          # Input: user finished speaking
  AgentEndCall           # Output: end call
  AgentTransferCall      # Output: transfer call

  # Events include history automatically
  async def process(self, env, event):
      if isinstance(event, UserTurnEnded):
          # Access user's message
          user_message = event.content[0].content

          # Access full conversation history
          for past_event in event.history:
              if isinstance(past_event, UserTextSent):
                  print(f"User previously said: {past_event.content}")
  ```
</CodeGroup>

### Step 4: Migrate Custom Tools

<CodeGroup>
  ```python v0.1.x theme={null}
  # Manual tool handling in ReasoningNode
  class MyNode(ReasoningNode):
      async def process_context(self, context):
          # Parse tool calls from LLM response
          if tool_call := extract_tool_call(response):
              result = await self.execute_tool(tool_call)
              # Manually add to context and call LLM again
              context.add_tool_result(result)
              response = await call_llm(context)
  ```

  ```python v0.2 theme={null}
  from typing import Annotated
  from line.llm_agent import loopback_tool, passthrough_tool
  from line.events import AgentSendText, AgentEndCall

  # Declarative tool definitions
  @loopback_tool
  async def get_account_balance(ctx, account_id: Annotated[str, "Account ID"]):
      """Look up account balance."""
      balance = await api.get_balance(account_id)
      return f"${balance:.2f}"

  @passthrough_tool
  async def end_call_with_message(ctx, message: Annotated[str, "Goodbye message"]):
      """End call with a custom message."""
      yield AgentSendText(text=message)
      yield AgentEndCall()

  # Tools are passed to LlmAgent
  agent = LlmAgent(
      tools=[get_account_balance, end_call_with_message],
      ...
  )
  ```
</CodeGroup>

### Step 5: Migrate Multi-Agent Patterns

<CodeGroup>
  ```python v0.1.x theme={null}
  # Manual agent switching
  class MainNode(ReasoningNode):
      def __init__(self, spanish_node):
          self.spanish_node = spanish_node
          self.use_spanish = False

      async def process_context(self, context):
          if self.should_switch_to_spanish(context):
              self.use_spanish = True
              # Complex manual state management
  ```

  ```python v0.2 theme={null}
  from line.llm_agent import agent_as_handoff

  spanish_agent = LlmAgent(
      model="gpt-5-nano",
      config=LlmConfig(system_prompt="Speak only in Spanish."),
      ...
  )

  main_agent = LlmAgent(
      tools=[
          agent_as_handoff(
              spanish_agent,
              handoff_message="Transferring...",
              name="transfer_to_spanish",
              description="Use when user requests Spanish.",
          ),
      ],
      ...
  )
  ```
</CodeGroup>

### Removed APIs

The following APIs from v0.1.x have been removed with no direct replacement:

| Removed               | Alternative                                  |
| --------------------- | -------------------------------------------- |
| `VoiceAgentSystem`    | Use `VoiceAgentApp` with `get_agent`         |
| `Bus`                 | Events are dispatched automatically          |
| `Bridge`              | Use run/cancel filters on `AgentSpec`        |
| `ReasoningNode`       | Use `LlmAgent` or implement `Agent` protocol |
| `ConversationHarness` | Handled internally by `ConversationRunner`   |
| `EventsRegistry`      | Use typed event classes directly             |

### Custom Agent Protocol

If you need custom logic beyond `LlmAgent`, implement the `Agent` protocol:

```python theme={null}
from typing import AsyncIterable
from line.events import (
    InputEvent,
    OutputEvent,
    AgentSendText,
    CallStarted,
    UserTurnEnded,
)

class CustomAgent:
    """Custom agent implementing the Agent protocol."""

    async def process(self, env, event: InputEvent) -> AsyncIterable[OutputEvent]:
        if isinstance(event, CallStarted):
            yield AgentSendText(text="Hello from custom agent!")
        elif isinstance(event, UserTurnEnded):
            # Your custom logic here
            user_message = event.content[0].content
            response = await your_custom_logic(user_message, event.history)
            yield AgentSendText(text=response)
```

***

## Breaking Changes Summary

This section provides a quick reference for all breaking changes. Use this as a checklist when migrating your code.

### Event Renames

| v0.1.x                      | v0.2                                               |
| --------------------------- | -------------------------------------------------- |
| `AgentSpeechSent`           | `AgentSendText` (output) / `AgentTextSent` (input) |
| `UserTranscriptionReceived` | `UserTextSent` / `UserTurnEnded`                   |
| `UserStartedSpeaking`       | `UserTurnStarted`                                  |
| `UserStoppedSpeaking`       | `UserTurnEnded`                                    |
| `AgentStartedSpeaking`      | `AgentTurnStarted`                                 |
| `AgentStoppedSpeaking`      | `AgentTurnEnded`                                   |
| `EndCall`                   | `AgentEndCall`                                     |
| `TransferCall`              | `AgentTransferCall`                                |
| `DTMFInputEvent`            | `UserDtmfSent`                                     |
| `DTMFOutputEvent`           | `AgentSendDtmf`                                    |

<Note>
  **Output vs. Input events**: `AgentSendText` is an output event you **yield** to make the agent speak. `AgentTextSent` is an input event you **receive** confirming what was spoken (appears in history).
</Note>

### Structural Changes

* **History in events**: All input events now include an optional `history` field with complete conversation context. When `history` is `None`, the event is inside a history list; when it contains a list, the event has full context attached.
* **Tool events**: `ToolCall`/`ToolResult` replaced with structured `AgentToolCalled`/`AgentToolReturned`
* **Event IDs**: All events now have stable `event_id` fields for tracking

### Configuration Changes

| v0.1.x                            | v0.2                                  |
| --------------------------------- | ------------------------------------- |
| `CallRequest.agent.system_prompt` | `LlmConfig.system_prompt`             |
| `CallRequest.agent.introduction`  | `LlmConfig.introduction`              |
| Manual LLM parameters             | `LlmConfig` with full LiteLLM support |

<Tip>
  Use `LlmConfig.from_call_request(call_request, fallback_system_prompt="...", fallback_introduction="...")` to automatically inherit configuration from the Cartesia Playground while providing sensible defaults. See [Agents documentation](/line/sdk/agents#accessing-call-metadata-in-your-agent-logic) for details.
</Tip>

***

## New Dependencies

v0.2 introduces the following dependencies:

```
litellm              # Multi-provider LLM support
pydantic             # Type validation for events
phonenumbers >= 9.0  # Phone number validation for transfer_call
```

Optional dependencies for examples:

```
exa-py               # Exa web search integration
duckduckgo-search    # Fallback web search
```

***

## Getting Help

* **Documentation**: [Line SDK Overview](/line/sdk/overview)
* **Examples**: [github.com/cartesia-ai/line/examples](https://github.com/cartesia-ai/line/tree/main/examples)
* **Support**: [support@cartesia.ai](mailto:support@cartesia.ai)


# Metrics
Source: https://docs.cartesia.ai/line/evaluations/metrics


The Line platform includes a suite of tools for evaluating how your Agent is performing, both during development phase and in production.
You have full control over how metrics for evaluating your agent are defined.

<Frame>
  <iframe />
</Frame>

## System Metrics

By default, all calls made by a Line Agent have a set of system metrics automatically calculated to help evaluate performance.

| System Metric                  | Description                                                                                                  |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------ |
| system\_call\_success          | A boolean status indicating if the call disconnects unexpectedly, for example due to reasoning code crashing |
| system\_text\_to\_speech\_ttfb | The time to first byte of audio generated by the TTS model on the first turn of the conversation             |

### LLM as a Judge

An LLM-as-a-Judge metric is created in the playground by setting a name and specifying a prompt. You can try out different prompts in
the playground against existing call transcripts by copying a call id into the metric creation field and clicking evaluate
to generate a sample output.

<Frame>
  <iframe />
</Frame>

<Tip title="Prompt Tips" icon="gavel">
  Write your LLM as a Judge metrics to return a single value and description
  field.
</Tip>

A metric name can only include lower case letters, digits, and ‘-’, ‘\_’, or ‘.’ characters so that you can manage it
from a cli. Metric names must also be unique within your organization.

## Assigning Metrics

Once a metric is created, it can be assigned to an Agent via the playground from the Agent page. All subsequent calls made
to or from that Agent will have metric results calculated and available to view in the console and API. Note
that when you assign a metric to an existing Agent, it won’t be automatically run on previous calls.

<Frame>
  <img alt="Assign a metric" />
</Frame>


# Metrics Results
Source: https://docs.cartesia.ai/line/evaluations/results

View the results from metrics run against all calls handled by your agent.

Metrics results are accessible via both API and the playground.

Each metric result contains relevant information to help you analyze your calls. Some fields include:

```
- metric_id
- metric_name
- agent_id
- call_id
- summary
- transcript
- deployment_id
- value
- status
```

To view the full schema, visit the API [List Metric Results](/api-reference/agents/metrics/list-metric-results).

## API

To get metrics via the API, you can specify a few filter parameters including `call_id`, `agent_id` and more. You can retrieve these metric results or export them into a CSV. [List Metric Results](/api-reference/agents/metrics/list-metric-results) and [Export Metric Results](/api-reference/agents/metrics/export-metric-results) have the same query parameters available and differ only in the response format.

#### Example Request for CSV Results

<CodeGroup>
  ```zsh cURL lines theme={null}
  curl --location 'https://api.cartesia.ai/agents/metrics/export?metric_id={metric_id}&limit=100&starting_after={previous_next_page_metric_id}' \
  --header 'Cartesia-Version: 2025-04-16' \
  --header 'Authorization: Bearer {YOUR_API_KEY}'
  ```

  ```python Python lines theme={null}
  import requests

  url = "https://api.cartesia.ai/agents/metrics/export"
  params = {
      "metric_id": "{metric_id}",
      "limit": 100,
      "starting_after": "{previous_next_page_metric_id}"
  }
  headers = {
      "Content-Type": "application/json",
      "Cartesia-Version": "2025-04-16",
      "Authorization": "Bearer <YOUR_API_KEY>"
  }

  response = requests.get(url, headers=headers, params=params)

  if response.status_code == 200:
      # Save CSV to file
      with open("metrics.csv", "w", encoding="utf-8") as f:
          f.write(response.text)
      print("CSV file saved as metrics.csv")
  else:
      print(f"Error {response.status_code}: {response.text}")
  ```

  ```typescript Javascript lines theme={null}
  const response = await fetch(
    "https://api.cartesia.ai/agents/metrics/export?metric_id={metric_id}&limit=100&starting_after={previous_next_page_metric_id}",
    {
      method: "GET",
      headers: {
        "Content-Type": "application/json",
        "Cartesia-Version": "2025-04-16",
        Authorization: "Bearer <your_api_key>",
      },
    }
  );
  ```
</CodeGroup>

## Console

Metrics are visible in the playground for a specific call record.


# Deployments
Source: https://docs.cartesia.ai/line/infrastructure/deployments


Deployments are instances of your agent running on Cartesia's servers.

<Frame>
  <img alt="Deployments" />
</Frame>

# State

Only deployments in the `ready` state can handle inbound or outbound calls. At any time, only one deployment is active.
Deployments that fail health checks will not receive traffic.

# Creating a deployment

Use `cartesia deploy` or push to a linked GitHub repository to create a deployment.

Cartesia servers:

1. Build the virtual environment
2. Load `main.py` and instantiate a FastAPI app
3. Run a health check
4. Set the deployment to `ready` and start receiving traffic

<Info>
  Line supports Python 3.9–3.13 (specify in `pyproject.toml`). FastAPI servers only; more frameworks coming soon.
</Info>

<Tip title="Pre-Call Initialization" icon="phone-volume">
  **Pre-Call Initialization**

  Inbound calls will ring for five seconds to allow your application logic to warm up any required resources and establish
  connections.
</Tip>


# Observability
Source: https://docs.cartesia.ai/line/infrastructure/observability

Get full visibility into how your Agent is performing.

Monitor every deployment and call.

<Frame>
  <iframe />
</Frame>

## Deployment

Each deployment generates a unique ID. View logs in the console.

<Frame>
  <img alt="Sample Deployment Logs" />
</Frame>

## Call Logs

You can click into a call and view any logging statements generated by your reasoning code.

<Frame>
  <iframe />
</Frame>

## Transcripts

Each call has a transcript with independently separated transcribed audio and text to be generated. When you export these
transcripts with the API or CLI, these include more granular turn level timestamps.

<Frame>
  <img alt="Sample Call Transcripts" />
</Frame>

## Loggable Events

Record events without tying them to tool calls.

### SDK

In the SDK, yield `LogMessage` events from your agent or tools to record custom events:

```python theme={null}
from line.events import LogMessage

@loopback_tool
async def process_order(ctx, order_id: Annotated[str, "Order ID"]):
    """Process a customer order."""
    result = await api.process_order(order_id)

    # Log a custom event
    yield LogMessage(
        name="order_processed",
        level="info",
        message=f"Processed order {order_id}",
        metadata={"status": result.status, "order_id": order_id}
    )

    return f"Order {order_id} processed: {result.status}"
```

Events are automatically sent to the platform when yielded.

### Websocket

If you're not using the SDK and instead just relying on the bare websocket, logging events will look like this:

```json theme={null}
{
  "type": "log_event",
  "event": "event_name",
  "metadata": {
    "key": "value"
  }
}
```

### Playground

You can view these events in the Playground under the `Transcript` tab of the call.

## Loggable Metrics

Record metrics at any point in your workflow.

### SDK

In the context of the SDK, we can log a metric by broadcasting the `LogMetric` event.
Here's a snippet from the form filling template that exhibits this:

```python theme={null}
# Record the answer in form manager
success = self.form_manager.record_answer(answer)

if success:
  # Log metric for the answered question
  if current_question:
    metric_name = current_question["id"]
    yield LogMetric(name=metric_name, value=answer)
    logger.info(f"📊 Logged metric: {metric_name}={answer}")
```

The user bridge is subscribed to the `LogMetric` event by default, and it will
log it over the websocket by default when it sees that `LogMetric` has been broadcast.

### Websocket

If you're not using the SDK and instead just relying on the bare websocket, logging metrics will look like this:

```json theme={null}
{
  "type": "log_metric",
  "name": "metric_name",
  "value": "metric_value"
}
```

### Playground

You can view these events in the Playground under the `Transcript` tab of the call.

<Frame>
  <img alt="Loggable Metrics in the Playground" />
</Frame>

## Call Recordings

Call recordings can be downloaded from the playground.

<Frame>
  <img alt="Sample Call Recordings" />
</Frame>

## Webhooks

Cartesia sends webhook events to your **HTTPS** endpoint throughout the call lifecycle. Expose **`POST`** + **`application/json`** and verify the **`x-webhook-secret`** header matches your stored secret.

<Frame>
  <img alt="Sample Call Webhooks" />
</Frame>

### Verify the webhook secret

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    if request.headers.get("x-webhook-secret") != os.environ["LINE_WEBHOOK_SECRET"]:
        return jsonify({"error": "unauthorized"}), 401
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    if (req.headers["x-webhook-secret"] !== process.env.LINE_WEBHOOK_SECRET)
      return res.status(401).json({ error: "unauthorized" });
    ```
  </Tab>
</Tabs>

### Event types

| Event                | When                           | Typed field |
| -------------------- | ------------------------------ | ----------- |
| `call_started`       | Call session begins            | `call`      |
| `call_completed`     | Call ends normally             | `call`      |
| `call_failed`        | Call ends with error           | `call`      |
| `call_turn`          | Each conversational turn       | `turn`      |
| `post_call_analysis` | After async analysis completes | `analysis`  |

### Envelope fields

Every webhook event includes these top-level fields:

| Field        | Description                   |
| ------------ | ----------------------------- |
| `type`       | Event type (see table above). |
| `call_id`    | Call identifier.              |
| `agent_id`   | Agent that handled the call.  |
| `webhook_id` | Webhook config id.            |
| `timestamp`  | RFC 3339 event time.          |

### `call`

Present on `call_started`, `call_completed`, and `call_failed` events. Matches the [GET /agents/calls/\{call\_id}](/api-reference/agents/calls/get-call) response. Some events (e.g. `call_started`) may omit fields like `end_time` that do not yet have a valid value.

| Field                     | Description                                                                                                                                    |
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `id`                      | Call identifier.                                                                                                                               |
| `agent_id` / `agent_name` | Agent details.                                                                                                                                 |
| `status`                  | `started`, `completed`, or `failed`.                                                                                                           |
| `start_time` / `end_time` | RFC 3339 timestamps.                                                                                                                           |
| `end_reason`              | Why the call ended (e.g. `client_hangup`, `agent_hangup`, `inactivity`). See [EndReason](/api-reference/agents/calls/get-call) for all values. |
| `transcript`              | Array of turns (see `turn` below).                                                                                                             |
| `telephony_params`        | `from`, `to`, `direction`, `call_sid`, `connection_type`.                                                                                      |
| `error_message`           | Error detail (failed calls only).                                                                                                              |
| `metadata`                | User-supplied metadata passed at call start.                                                                                                   |
| `summary`                 | Call summary (if available at event time).                                                                                                     |

### `turn`

Present on `call_turn` events. One turn per agent or user utterance.

| Field                               | Description                                           |
| ----------------------------------- | ----------------------------------------------------- |
| `role`                              | `assistant` or `user`.                                |
| `text`                              | Turn text.                                            |
| `start_timestamp` / `end_timestamp` | Seconds from call start.                              |
| `tts_ttfb`                          | Agent TTS time-to-first-byte (seconds), when present. |
| `tool_calls`                        | Tool calls made during this turn, when present.       |

### `analysis`

Present on `post_call_analysis` events. Sent after async analysis completes (currently summary generation; evaluations and metrics will be added here in the future).

| Field     | Description                |
| --------- | -------------------------- |
| `summary` | 1-2 sentence call summary. |

### Example: `call_completed`

```json theme={null}
{
  "type": "call_completed",
  "call_id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
  "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
  "webhook_id": "agent_webhook_P3MgdLf1cpaucZJ7xWehCC",
  "end_reason": "client_hangup",
  "timestamp": "2026-04-16T01:08:50.061907836Z",
  "call": {
    "id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
    "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
    "agent_name": "My Agent",
    "status": "completed",
    "start_time": "2026-04-16T01:08:37.413659Z",
    "end_time": "2026-04-16T01:08:50.036327Z",
    "end_reason": "client_hangup",
    "telephony_params": {
      "from": "websocket",
      "to": "agent_rwh4HGMgyhK7rM5ucVqbiC",
      "connection_type": "websocket"
    },
    "transcript": [
      {
        "role": "assistant",
        "text": "Hi there! How can I help you today?",
        "start_timestamp": 0.41,
        "end_timestamp": 3.2,
        "tts_ttfb": 0.065
      },
      {
        "role": "user",
        "text": "I want to schedule an appointment.",
        "start_timestamp": 3.5,
        "end_timestamp": 5.8
      }
    ]
  }
}
```

### Example: `post_call_analysis`

```json theme={null}
{
  "type": "post_call_analysis",
  "call_id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
  "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
  "webhook_id": "agent_webhook_P3MgdLf1cpaucZJ7xWehCC",
  "timestamp": "2026-04-16T01:08:50.955058787Z",
  "analysis": {
    "summary": "The caller requested to schedule an appointment. The agent confirmed availability and booked a slot."
  }
}
```

### Test your endpoint

```bash theme={null}
curl -sS -X POST "https://your-server.example/webhooks/cartesia" \
  -H "Content-Type: application/json" \
  -H "x-webhook-secret: YOUR_WEBHOOK_SECRET" \
  -d '{
    "type": "call_completed",
    "call_id": "ac_test_123",
    "agent_id": "agent_demo",
    "webhook_id": "agent_webhook_test",
    "timestamp": "2026-01-01T00:00:00.000000000Z",
    "call": {
      "id": "ac_test_123",
      "agent_id": "agent_demo",
      "agent_name": "Test Agent",
      "status": "completed",
      "end_reason": "client_hangup",
      "transcript": []
    }
  }'
```

<Note>
  For backwards compatibility, `call_completed` and `call_failed` events also include `body` (transcript array) and a top-level `end_reason`. These are deprecated — use `call.transcript` and `call.end_reason` instead.
</Note>


# Scaling
Source: https://docs.cartesia.ai/line/infrastructure/scaling


## Compute Resources

Each call has access to 1GB memory and 0.5 vCPU. Contact support to increase limits.

<Card title="Contact Support" href="https://cartesia.ai/contact" />

## Concurrency

Concurrent call limits by subscription tier:

| Subscription Tier | Concurrency Limit |
| ----------------- | ----------------- |
| Free              | 8                 |
| Pro               | 12                |
| Startup           | 20                |
| Scale             | 60                |

<Tip title="Outbound Concurrency" icon="dialpad">
  **Outbound Concurrency**

  When triggering outbound calls, you are limited to triggering one call per second while the overall concurrency limits still apply.
</Tip>


# Calls API
Source: https://docs.cartesia.ai/line/integrations/calls-api


Stream audio between your application and your voice agent via WebSocket. Use this for web apps, mobile apps, or to bridge your own telephony provider.

## Quick start

```javascript theme={null}
const ws = new WebSocket(
  `wss://api.cartesia.ai/agents/stream/${agentId}`,
  {
    headers: {
      Authorization: `Bearer ${accessToken}`,
      "Cartesia-Version": "2025-04-16",
    },
  }
);

// Initialize the stream
ws.onopen = () => {
  ws.send(JSON.stringify({
    event: "start",
    config: { input_format: "pcm_44100" },
  }));
};

// Handle agent audio
ws.onmessage = (msg) => {
  const data = JSON.parse(msg.data);
  if (data.event === "media_output") {
    playAudio(atob(data.media.payload));
  }
};

// Send user audio
function sendAudio(audioData) {
  ws.send(JSON.stringify({
    event: "media_input",
    stream_id: streamId,
    media: { payload: btoa(audioData) },
  }));
}
```

Get an access token from the `/access-token` [endpoint](/api-reference/auth/access-token#body-grants-agent). See [Authenticating Client Apps](/get-started/authenticate-your-client-applications) for details.

***

## Connection

Connect to the WebSocket endpoint:

```
wss://api.cartesia.ai/agents/stream/{agent_id}
```

**Headers:**

| Header             | Value            |
| ------------------ | ---------------- |
| `Authorization`    | `Bearer {token}` |
| `Cartesia-Version` | `2025-04-16`     |

## Protocol Overview

The WebSocket connection uses JSON messages for control events and base64-encoded audio for media.

The client sends a `start` event, the server responds with `ack`, then both sides exchange audio and control events until the connection closes.

## Client events

### Start Event

Initializes the audio stream configuration.

* `config` overrides your agent's default input audio settings
* `stream_id` is optional. If not provided, the server generates one and returns it in the `ack` event

**This must be the first message sent.**

```json theme={null}
{
  "event": "start",
  "stream_id": "unique_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "introduction": "Hello, I'm an AI assistant",
    "system_prompt": "### Your Role \n You are a helpful assistant"
  },
  "metadata": {
    "to": "user@example.com",
    "from": "+1234567890"
  }
}
```

**Fields:**

* `stream_id` (optional): Stream identifier. If not provided, server generates one
* `config.input_format`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`)
* `config.voice_id` (optional): Override the agent's default TTS voice
* `agent` (optional): Allows configuring individual agent calls via API and previewing changes in introduction or prompt without publishing to production
* `metadata` (optional): Custom metadata object. These will be passed through to the agent code, but there are some special fields you can use as well:
  * `to` (optional): Destination identifier for call routing (defaults to agent ID)
  * `from` (optional): Source identifier for the call (defaults to "websocket")

### Media Input Event

Audio data sent from the client to the server. `payload` audio data should be base64 encoded.

```json theme={null}
{
  "event": "media_input",
  "stream_id": "unique_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
```

**Fields:**

* `stream_id`: Unique identifier for the Stream from the ack response
* `media.payload`: Base64-encoded audio data in the format specified in the start event

### DTMF Event

Sends DTMF (dual-tone multi-frequency) tones.

```json theme={null}
{
  "event": "dtmf",
  "stream_id": "example_id",
  "dtmf": "1"
}
```

**Fields:**

* `stream_id`: Stream identifier
* `dtmf`: DTMF digit (0-9, \*, #)

### Custom Event

Sends custom metadata to the agent.

```json theme={null}
{
  "event": "custom",
  "stream_id": "example_id",
  "metadata": {
    "user_id": "user123",
    "session_info": "custom_data"
  }
}
```

**Fields:**

* `stream_id`: Stream identifier
* `metadata`: Object containing key-value pairs of custom data

## Server events

### Ack Event

Confirms stream configuration. Returns the server-generated `stream_id` if one wasn't provided in the `start` event.

```json theme={null}
{
  "event": "ack",
  "stream_id": "example_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "system_prompt": "### Your Role \n You are a helpful assistant",
    "introduction": "Hello, I'm an AI assistant"
  }
}
```

### Media Output Event

Server sends agent audio response. `payload` is base 64 encoded audio data.

```json theme={null}
{
  "event": "media_output",
  "stream_id": "example_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
```

### Clear Event

Indicates the agent wants to clear/interrupt the current audio stream.

```json theme={null}
{
  "event": "clear",
  "stream_id": "example_id"
}
```

### Transfer Call Event

Indicates the agent wants to transfer the call to a phone number. The client is responsible for initiating the transfer on its telephony side.

```json theme={null}
{
  "event": "transfer_call",
  "stream_id": "example_id",
  "transfer": {
    "target_phone_number": "+1234567890"
  }
}
```

**Fields:**

* `stream_id`: Stream identifier
* `transfer.target_phone_number`: E.164 phone number to transfer the call to

## Connection Management

### Inactivity Timeout

The server closes idle connections after **180 seconds**. Any client message resets the timer:

* Application messages (media\_input, dtmf, custom events)
* Standard WebSocket ping frames
* Any other valid WebSocket message

When the timeout occurs, the connection is closed with:

* **Code:** 1000 (Normal Closure)
* **Reason:** `"connection idle timeout"`

### Ping/Pong Keepalive

To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive:

```python theme={null}
# Client sends ping to reset inactivity timer
pong_waiter = await websocket.ping()
latency = await pong_waiter
```

```javascript theme={null}
// Requires the Node.js `ws` library — the browser WebSocket API does not expose ping()
setInterval(() => {
  if (websocket.readyState === WebSocket.OPEN) {
    websocket.ping();
  }
}, 60000); // Send ping every 60 seconds
```

The server automatically responds to ping frames with pong frames and resets the inactivity timer upon receiving any message.

### Connection Close

The connection can be closed by either the client or server using WebSocket close frames.

**Client-initiated close:**

```python theme={null}
await websocket.close(code=1000, reason="session completed")
```

**Server-initiated close:**
When the agent ends the call, the server closes the connection with:

* **Code:** 1000 (Normal Closure)
* **Reason:** `"call ended by agent"` or `"call ended by agent, reason: {specific_reason}"` if additional context is available

## Best Practices

1. **Send `start` first** — The connection closes if any other event is sent before `start`.
2. **Choose the right audio format** — Match the format to your source: `mulaw_8000` for telephony, `pcm_44100` for web clients.
3. **Handle closes cleanly** — Always capture close codes and reasons for debugging and recovery.
4. **Keep the connection alive** — Send WebSocket ping frames every 60–90 seconds to avoid the 180-second inactivity timeout.
5. **Manage stream IDs** — Provide your own `stream_id` values to improve observability across systems.
6. **Recover from idle timeouts** — On `1000 / connection idle timeout`, reconnect and resend a `start` event.


# Overview
Source: https://docs.cartesia.ai/line/integrations/overview


Your Line agent needs audio input to work. Choose based on your use case.

## Telephony

Use [Cartesia Telephony](/line/integrations/telephony/phone-numbers) for managed phone numbers. Cartesia provisions numbers and handles the telephony infrastructure for inbound and outbound use cases.

You can also use your own telephony stack by connecting to the [Calls API](/line/integrations/calls-api).

<Note>
  Bringing your own phone numbers or CCaaS provider is on the roadmap.
</Note>

## Web and Mobile Apps

Use the [Calls API](/line/integrations/calls-api) to stream audio between your application and the agent via WebSocket.

```javascript theme={null}
const ws = new WebSocket(`wss://api.cartesia.ai/agents/stream/${agentId}`);
```

This option works great for:

* Web applications with browser microphone access
* Mobile apps with native audio capture

## Pricing

| Feature                  | Price per Minute | Notes                                 |
| ------------------------ | ---------------- | ------------------------------------- |
| Agent Calling            | \$0.06           | Base rate for all voice agent calls   |
| Telephony (add-on)       | +\$0.014         | Additional when using managed numbers |
| **Total with Telephony** | **\$0.074**      | Combined cost for phone-based calls   |

View your usage and remaining Voice Agent credits on the [Subscription](https://play.cartesia.ai/subscription) page.


# Outbound
Source: https://docs.cartesia.ai/line/integrations/telephony/outbound-dialing


Agents can make outbound dials with an API request. Simply specify a set of target phone numbers and your agent ID
to place your dial.

<Warning title="Compliance" icon="triangle-exclamation">
  **Compliance**

  You are solely responsible for remaining complaint with relevant local regulations for dialing including the Telephone
  Consumer Protection Act (TCPA).

  See Cartesia's [Acceptable Use Policy](https://cartesia.ai/legal/acceptable-use.html) for more detail.
</Warning>

<CodeGroup>
  ```bash Bash lines theme={null}
  curl -X POST "https://api.cartesia.ai/twilio/call/outbound" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $CARTESIA_API_KEY" \
  -H "Cartesia-Version: 2025-04-16" \
  -d '{
    "target_numbers": ["YOUR_PHONE_NUMBER"],
    "agent_id": "YOUR_AGENT_ID",
    "metadata": {
      "customer_id": "cust_123",
      "custom_prompt": "Be extra friendly"
    }
  }'
  ```

  ```python Python lines theme={null}
  import requests

  url = "https://api.cartesia.ai/twilio/call/outbound"

  headers = {
      "Content-Type": "application/json",
      "Authorization": "Bearer YOUR_CARTESIA_API_KEY",
      "Cartesia-Version": "2025-04-16"
  }

  payload = {
      "target_numbers": ["YOUR_PHONE_NUMBER"],
      "agent_id": "YOUR_AGENT_ID",
      "metadata": {
          "customer_id": "cust_123",
          "custom_prompt": "Be extra friendly"
      }
  }

  response = requests.post(url, headers=headers, json=payload)

  print("Status Code:", response.status_code)
  print("Response:", response.json())
  ```

  ```bash CLI theme={null}
  # Trigger an outbound call from a deployed agent to a specific number
  cartesia call <phone_number> <agent_id>
  ```
</CodeGroup>

The `metadata` field accepts any JSON object up to 1MB. This data is passed to your agent code deployment and can be accessed to customize agent behavior per call.

You can access the metadata in your agent code via the `call_request.metadata` object in your `get_agent` function.

```python theme={null}
async def get_agent(env, call_request):
    if call_request.metadata:
        logger.info(f"Received metadata: {call_request.metadata}")
    # Use metadata to customize agent behavior
    return LlmAgent(...)
```

<Note>You are limited to one outbound dial placed per second, any requests faster than one dial per second will be queued. </Note>


# Phone Numbers
Source: https://docs.cartesia.ai/line/integrations/telephony/phone-numbers


Cartesia Telephony provides managed phone numbers so your agent can receive and make real phone calls without setting up your own telephony infrastructure.

## Provisioning

The platform automatically provisions a phone number for each agent when you promote to production. When an agent is deleted, the assigned phone number is released and cannot be re-assigned to another agent.

<Note>
  Bringing your own phone numbers or CCaaS provider is on the roadmap.
</Note>

## Finding Your Phone Number

When viewing your Line agents from the Playground, you can see the provisioned phone number on the Agents page in the card:

<Frame>
  <img alt="Phone number shown in agent card" />
</Frame>

Or in the header once you navigate to the agent's page:

<Frame>
  <img alt="Phone number shown in agent header" />
</Frame>

You can also retrieve your phone number using the [CLI](/line/cli).

List all agents to see their phone numbers:

```bash theme={null}
cartesia agents ls
```

Or get detailed information for a specific agent:

```bash theme={null}
cartesia status <agent_id>
```

This returns agent information including name, deployments, and phone numbers.


# Introduction
Source: https://docs.cartesia.ai/line/introduction

Build intelligent, low-latency voice agents with Line.

## What is Line?

Line brings voice to your text agents with Cartesia's state-of-the-art speech models. We handle audio orchestration, deployment, and observability so you can focus on your agent's reasoning.

## Get Started

<CardGroup>
  <Card title="Quickstart" icon="rocket" href="./start-building/quickstart">
    Build, deploy, and call your first agent
  </Card>

  <Card title="Agent Builder" icon="sparkles" href="./start-building/agent-builder">
    Prototype and iterate on agents without code
  </Card>

  <Card title="SDK" icon="code" href="./sdk/overview">
    Write your custom reasoning logic in code
  </Card>
</CardGroup>

## Audio Orchestration

Line deploys your code in seconds in our managed runtime with auto-scaling and blazing fast audio processing, using [Ink](https://cartesia.ai/ink) for speech-to-text and [Sonic](https://cartesia.ai/sonic) for text-to-speech.

<Frame>
  <img alt="Line voice agent platform architecture" />
</Frame>

## What You Can Build

Line gives you full control over your agent's behavior through code: connect any LLM, call external APIs, query databases, and handle interruptions and turn-taking.

## Developer Tools

* **[CLI](/line/cli)** – Deploy and test agents from your terminal
* **[Call logs](/line/infrastructure/observability#call-logs)** – Debug conversations and monitor performance
* **[Evaluations](/line/evaluations/metrics)** – Measure agent quality with custom metrics
* **[Deployments](/line/infrastructure/observability#deployment)** – Track versions and roll back changes


# Agents
Source: https://docs.cartesia.ai/line/sdk/agents


Agents process input events and yield output events to control the conversation.

## What is an Agent?

An Agent controls the input/output event loop. The `process` method receives events (user speech, call start, etc.) and yields responses.

An Agent can be:

1. A **class** with a `process` method
2. A **function** with the same signature `(env, event) -> AsyncIterable[OutputEvent]`

```python theme={null}
from line.events import CallStarted, UserTurnEnded, AgentSendText

class HelloAgent:
    async def process(self, env, event):
        if isinstance(event, CallStarted):
            yield AgentSendText(text="Hello!")
        elif isinstance(event, UserTurnEnded):
            yield AgentSendText(text="I heard you!")
```

**How an Agent works:**

* Events arrive (user speaks, call starts, button pressed)
* SDK calls `agent.process(env, event)`
* Agent yields output events (speech, tool calls, handoffs)
* SDK handles audio, LLM calls, and state management

***

## LlmAgent

Use the built-in `LlmAgent` which wraps 100+ LLM providers via LiteLLM:

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",  # Or "gpt-5.2", "gemini/gemini-2.5-flash", etc.
    api_key="your-api-key",
    tools=[...],  # Optional list of tools
    config=LlmConfig(
        system_prompt="You are a helpful assistant...",
        introduction="Hello! How can I help you today?",
    ),
)
```

### Prompting

`system_prompt` to define your agent's personality and `introduction` for the greeting:

```python theme={null}
import os
from line import CallRequest
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import AgentEnv, VoiceAgentApp

SYSTEM_PROMPT = """You are a friendly customer service agent.

Rules:
- Be polite and empathetic
- Confirm understanding before taking action
-  end_call to gracefully end conversations
"""

async def get_agent(env: AgentEnv, call_request: CallRequest):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt=SYSTEM_PROMPT,
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

### Supported Models

| Provider                                                            | Model Examples                                                         |
| ------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| Anthropic                                                           | `anthropic/claude-haiku-4-5-20251001`, `anthropic/claude-sonnet-4-5`   |
| OpenAI                                                              | `gpt-5.4`, `gpt-5.2`                                                   |
| Google                                                              | `gemini/gemini-2.5-flash-preview-09-2025`, `gemini/gemini-3.0-preview` |
| And 100+ more via [LiteLLM](https://docs.litellm.ai/docs/providers) |                                                                        |

### LlmConfig Options

| Option              | Type                  | Description                                                |
| ------------------- | --------------------- | ---------------------------------------------------------- |
| `system_prompt`     | `str`                 | The system prompt defining agent behavior                  |
| `introduction`      | `Optional[str]`       | Message sent on call start. `None` or `""` to wait for r   |
| `temperature`       | `Optional[float]`     | Sampling temperature                                       |
| `max_tokens`        | `Optional[int]`       | Maximum tokens per response                                |
| `top_p`             | `Optional[float]`     | Nucleus sampling threshold                                 |
| `stop`              | `Optional[List[str]]` | Stop sequences                                             |
| `seed`              | `Optional[int]`       | Random seed for reproducibility                            |
| `presence_penalty`  | `Optional[float]`     | Presence penalty for token generation                      |
| `frequency_penalty` | `Optional[float]`     | Frequency penalty for token generation                     |
| `num_retries`       | `int`                 | Number of retries on failure (default: 2)                  |
| `fallbacks`         | `Optional[List[str]]` | Fallback models if primary fails                           |
| `timeout`           | `Optional[float]`     | Request timeout in seconds                                 |
| `reasoning_effort`  | `Optional[str]`       | `none`, `low`, `medium`, or `high`. Dependent on provider. |
| `extra`             | `Dict[str, Any]`      | Provider-specific options passed through to LiteLLM        |

### History Management

`LlmAgent` exposes a `history` attribute for structured control over the conversation history the LLM sees.

**Adding entries:**

```python theme={null}
# Append a user note (role="user" is the default)
agent.history.add_entry("The user prefers formal language.")

# Insert before a specific event
agent.history.add_entry("Context about the caller.", before=some_event)
```

**Replacing history segments:**

```python theme={null}
# Replace the entire history
agent.history.update(new_events)

# Replace everything from `start` onward
agent.history.update(new_events, start=some_event)

# Replace a specific segment
agent.history.update(new_events, start=start_event, end=end_event)
```

### Per-Turn Overrides

`process()` accepts keyword arguments that apply to just that turn without mutating the agent:

```python theme={null}
# Higher temperature for just this turn
await agent.process(env, event, config=LlmConfig(temperature=0.9))

# Swap a specific tool for one turn
await agent.process(env, event, tools=[custom_lookup_tool])

# Inject ephemeral context
await agent.process(env, event, context="The user is a VIP customer.")

# Completely override history for one turn
await agent.process(env, event, history=custom_history_list)
```

Only explicitly set `LlmConfig` fields take effect — unset fields fall through to the agent's stored config.

To change tools permanently (e.g., enabling `end_call` after a certain point), modify `agent.tools` directly instead of using per-turn overrides.

***

## Controlling the Conversational Loop

Use **event filters** to control when your agent’s `process` method runs, and which events can interrupt it.

### Default Behavior

```python theme={null}
# Agent processes these events:
run_filter = [CallStarted, UserTurnEnded, CallEnded]

# These events interrupt the agent:
cancel_filter = [UserTurnStarted]
```

This means: agent greets on call start, responds when user finishes speaking, and can be interrupted.

### Customizing Filters

Return a tuple from `get_agent` to override defaults:

```python theme={null}
from line.events import CallStarted, UserTurnEnded, UserTurnStarted, CallEnded

async def get_agent(env, call_request):
    agent = LlmAgent(...)
    
    # Customize behavior
    run_filter = [CallStarted, UserTurnEnded, CallEnded]
    cancel_filter = [UserTurnStarted]
    
    return (agent, run_filter, cancel_filter)
```

### Common Customizations

**More responsive (process partial transcriptions):**

```python theme={null}
from line.events import CallStarted, UserTurnEnded, UserTextSent, CallEnded

run_filter = [CallStarted, UserTurnEnded, UserTextSent, CallEnded]
cancel_filter = [UserTurnStarted]
```

This makes your agent start processing before the user finishes speaking, creating a more responsive experience.

**Uninterruptible turns:**

If you want a single message to complete without being interrupted by the user, mark the output as `interruptible=False` when sending it with `AgentSendText`.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(
    text="Before we continue, I need to share a quick disclaimer.",
    interruptible=False,
)
```

**Custom logic with functions:**

```python theme={null}
def business_hours_only(event):
    hour = datetime.now().hour
    if isinstance(event, (CallStarted, CallEnded)):
        return True
    return isinstance(event, UserTurnEnded) and 9 <= hour < 17

return (agent, business_hours_only, [UserTurnStarted])
```

<Tip>
  For advanced patterns like guardrails, routing, and agent wrappers, see [Advanced Patterns](./patterns#agent-wrappers).
</Tip>

***

## Handling Incoming Calls

When a call arrives, you can inspect caller information and configure how your agent responds before it starts.

1. A call arrives from a web client or telephony provider
2. Your `pre_call_handler` receives a `CallRequest` with caller details
3. You return configuration (voice, language) or reject the call
4. Your `get_agent` function creates an agent using the enriched request

### Parsing the CallRequest

Contains information about the incoming call:

| Field           | Type             | Description                                     |
| --------------- | ---------------- | ----------------------------------------------- |
| `call_id`       | `str`            | Unique identifier for the call                  |
| `from_`         | `str`            | Caller identifier (phone number or client ID)   |
| `to`            | `str`            | Called number or agent ID                       |
| `agent_call_id` | `str`            | Agent call ID for logging/correlation           |
| `metadata`      | `Optional[dict]` | Custom data passed from your client application |
| `agent`         | `AgentConfig`    | Prompts configured in Playground or via API     |

The `agent` field contains an `AgentConfig` with:

| Field           | Type            | Description                                                        |
| --------------- | --------------- | ------------------------------------------------------------------ |
| `system_prompt` | `Optional[str]` | System prompt configured in Playground or via the Calls API        |
| `introduction`  | `Optional[str]` | Introduction message configured in Playground or via the Calls API |

### Returning a PreCallResult

Use `pre_call_handler` to set voice, language, or reject calls before your agent starts:

```python theme={null}
from line.voice_agent_app import CallRequest, PreCallResult, VoiceAgentApp

async def pre_call_handler(call_request: CallRequest):
    return PreCallResult(
        metadata={"tier": "premium"},  # Merged into call_request.metadata
        config={
            "tts": {
                "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
                "model": "sonic-3",
                "language": "en",
            }
        }
    )

app = VoiceAgentApp(get_agent=get_agent, pre_call_handler=pre_call_handler)
```

Your client application can pass metadata (user ID, language preference, account tier) in the call request. Your `pre_call_handler` reads this and configures TTS/STT accordingly.

#### Configuration Options

**TTS Options:**

| Option                  | Type   | Description                                                                              |
| ----------------------- | ------ | ---------------------------------------------------------------------------------------- |
| `voice_id`              | string | Voice identifier (UUID)                                                                  |
| `model`                 | string | TTS model (`sonic-3`, `sonic-turbo`)                                                     |
| `language`              | string | Language code (`en`, `es`, `hi`, etc.)                                                   |
| `pronunciation_dict_id` | string | [Custom pronunciation dictionary](/build-with-cartesia/sonic-3/custom-pronunciations) ID |

**STT Options:**

| Option     | Type   | Description                          |
| ---------- | ------ | ------------------------------------ |
| `language` | string | Language code for speech recognition |

#### Rejecting Calls

Return `None` to reject a call with a 403 status:

```python theme={null}
async def pre_call_handler(call_request: CallRequest):
    if is_blocked(call_request.from_):
        return None  # Rejects with 403
    return PreCallResult()
```

#### Custom Pronunciations

Use a [pronunciation dictionary](/build-with-cartesia/sonic-3/custom-pronunciations) to control how specific words are spoken:

```python theme={null}
async def pre_call_handler(call_request: CallRequest):
    return PreCallResult(
        config={
            "tts": {
                "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
                "model": "sonic-3",
                "pronunciation_dict_id": "your-dict-id",
            }
        }
    )
```

### Accessing call metadata in your Agent logic

The `CallRequest` is available in `get_agent`:

```python theme={null}
async def get_agent(env, call_request):
    # Log call information
    logger.info(f"Call {call_request.call_id} from {call_request.from_}")

    # Access metadata passed from your application (or added in pre_call_handler)
    customer_id = call_request.metadata.get("customer_id") if call_request.metadata else None
    customer_name = call_request.metadata.get("customer_name") if call_request.metadata else None

    # Build a personalized system prompt using metadata
    base_prompt = call_request.agent.system_prompt or "You are a helpful customer service agent."

    if customer_id:
        base_prompt += f"\n\nCurrent customer ID: {customer_id}"
    if customer_name:
        base_prompt += f"\nCustomer name: {customer_name}"

    return LlmAgent(
        model="gpt-5-nano",
        api_key=os.getenv("OPENAI_API_KEY"),
        config=LlmConfig(
            system_prompt=base_prompt,
            introduction=call_request.agent.introduction,
        ),
    )
```

`LlmConfig.from_call_request()` handles the priority chain automatically:

1. `CallRequest.agent.system_prompt` value (if set)
2. Your fallback value (if provided)
3. SDK default

```python theme={null}
async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig.from_call_request(
            call_request,
            fallback_system_prompt="You are a sales assistant.",
            fallback_introduction="Hi! How can I help with your purchase?",
            temperature=0.7,  # Additional LlmConfig options
        ),
    )
```

Using `CallRequest` lets you iterate on system prompts from the Playground instantly, while code handles the technical configuration and fallback defaults.

### Letting The User Speak First

Set `introduction` to an empty string to wait for the user to speak first:

```python theme={null}
config=LlmConfig.from_call_request(
    call_request,
    fallback_system_prompt=SYSTEM_PROMPT,
    fallback_introduction="",
)
```

***

## Custom Agent Function

For advanced use cases, you can build agents from scratch as functions:

```python theme={null}
from line.events import UserTurnEnded, AgentSendText, CallStarted

async def my_agent(env, event):
    if isinstance(event, CallStarted):
        yield AgentSendText(text="Hello! How can I help?")
    elif isinstance(event, UserTurnEnded):
        user_text = event.content[0].content if event.content else ""
        yield AgentSendText(text=f"You said: {user_text}")
```

## Custom Agent Class

Or as classes with state:

```python theme={null}
class GreetingAgent:
    def __init__(self, greeting: str):
        self.greeting = greeting
        self.greeted = False

    async def process(self, env, event):
        if isinstance(event, CallStarted) and not self.greeted:
            yield AgentSendText(text=self.greeting)
            self.greeted = True
```

<Tip>
  Most developers can use `LlmAgent` with tools rather than building custom agents from scratch! Custom agents are powerful when you need full control over the event processing logic without LLM reasoning.
</Tip>


# Events
Source: https://docs.cartesia.ai/line/sdk/events


Events are typed Python objects for communication between your agent and the Cartesia platform. Your agent receives **input events** from the harness and yields **output events** to control the conversation.

<Tip>
  To learn which events trigger your agent and how to customize this behavior (e.g., responding to DTMF, preventing interruptions), see [Controlling the Conversational Loop](/line/sdk/agents#controlling-the-conversational-loop).
</Tip>

## Input Events

Input events are received by your agent from the Cartesia harness. All input events include an optional `history` field containing the complete conversation history. When `history` is `None`, the event is being used within a history list; when `history` contains a list, the event has the full conversation context attached.

### Call Lifecycle

| Event         | Description            |
| ------------- | ---------------------- |
| `CallStarted` | The call has connected |
| `CallEnded`   | The call has ended     |

```python theme={null}
from line.events import CallStarted, CallEnded

async def process(self, env, event):
    if isinstance(event, CallStarted):
        yield AgentSendText(text="Hello! How can I help?")
    elif isinstance(event, CallEnded):
        # Perform cleanup
        pass
```

### User Turn Events

| Event             | Description                                                     |
| ----------------- | --------------------------------------------------------------- |
| `UserTurnStarted` | The user started speaking (triggers interruption by default)    |
| `UserTurnEnded`   | The user finished speaking (triggers new agent turn by default) |
| `UserTextSent`    | User text content (within `UserTurnEnded.content`)              |
| `UserDtmfSent`    | User pressed a DTMF button                                      |

```python theme={null}
from line.events import UserTurnEnded, UserTextSent

if isinstance(event, UserTurnEnded):
    for content in event.content:
        if isinstance(content, UserTextSent):
            user_message = content.content
```

### Agent Turn Events (in history)

| Event              | Description                |
| ------------------ | -------------------------- |
| `AgentTurnStarted` | Agent started its turn     |
| `AgentTurnEnded`   | Agent finished its turn    |
| `AgentTextSent`    | Agent text that was spoken |
| `AgentDtmfSent`    | DTMF tone sent by agent    |

### Handoff Event

| Event            | Description                           |
| ---------------- | ------------------------------------- |
| `AgentHandedOff` | Control transferred to a handoff tool |

### Custom Event

| Event            | Description                                                                                                        |
| ---------------- | ------------------------------------------------------------------------------------------------------------------ |
| `UserCustomSent` | Custom metadata sent from the client via the WebSocket [`custom` event](/line/integrations/calls-api#custom-event) |

Received when your client application sends a `custom` WebSocket event to the call stream. The event carries a `metadata` dict with whatever key-value pairs the client included:

```python theme={null}
from line.events import UserCustomSent

async def process(self, env, event):
    if isinstance(event, UserCustomSent):
        action = event.metadata.get("action")
        # React to client-side triggers (e.g., button clicks, form submissions)
```

***

## Output Events

Output events are yielded by your agent to control the conversation.

### Speech

You can choose to send messages with `AgentSendText`.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(text="Hello! How can I help you today?")
```

By default, users can interrupt the agent. However, if you have a disclaimer or another important message that you wish be uninterruptible, you can set the `interruptible` flag as false.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(
    text="Before we continue, I need to share a quick disclaimer.",
    interruptible=False,
)
```

### Call Control

```python theme={null}
from line.events import AgentEndCall, AgentTransferCall, AgentSendDtmf

# End the call
yield AgentEndCall()

# Transfer to another number
yield AgentTransferCall(target_phone_number="+14155551234")

# Send DTMF tone
yield AgentSendDtmf(button="1")
```

### Dynamic Configuration

Update call settings (voice, pronunciation, language) mid-conversation using `AgentUpdateCall`:

```python theme={null}
from line.events import AgentUpdateCall

# Change voice
yield AgentUpdateCall(voice_id="5ee9feff-1265-424a-9d7f-8e4d431a12c7")

# Change pronunciation dictionary
yield AgentUpdateCall(pronunciation_dict_id="dict-123")

# Change language
yield AgentUpdateCall(language="es")

# Update multiple settings at once
yield AgentUpdateCall(
    voice_id="spanish-voice-id",
    pronunciation_dict_id="spanish-dict-id",
    language="es"
)
```

**AgentUpdateCall Parameters:**

| Field                   | Type                     | Description                                                                       |
| ----------------------- | ------------------------ | --------------------------------------------------------------------------------- |
| `type`                  | `Literal["update_call"]` | Event type identifier (automatically set)                                         |
| `voice_id`              | `Optional[str]`          | Updates the agent's voice                                                         |
| `pronunciation_dict_id` | `Optional[str]`          | Updates the pronunciation dictionary                                              |
| `language`              | `Optional[str]`          | Updates the language used on speech-to-text (STT) and text-to-speech (TTS) models |

All fields are optional—only set fields are updated.

### Tool Events

These are emitted by `LlmAgent` to track tool execution:

```python theme={null}
from line.events import AgentToolCalled, AgentToolReturned

# Emitted when LLM calls a tool
yield AgentToolCalled(
    tool_call_id="call_123",
    tool_name="get_weather",
    tool_args={"city": "San Francisco"}
)

# Emitted when tool returns
yield AgentToolReturned(
    tool_call_id="call_123",
    tool_name="get_weather",
    tool_args={"city": "San Francisco"},
    result="72°F and sunny"
)
```

### Logging

```python theme={null}
from line.events import LogMetric, LogMessage

# Log a metric
yield LogMetric(name="response_time_ms", value=150)

# Log a message
yield LogMessage(
    name="order_lookup",
    level="info",
    message="Found order #12345",
    metadata={"order_id": "12345"}
)
```

### Custom Events

Send arbitrary metadata from your agent to the harness:

```python theme={null}
from line.events import AgentSendCustom

yield AgentSendCustom(metadata={"action": "show_form", "form_id": "checkout"})
```

Pair with [`UserCustomSent`](#custom-event) for bidirectional metadata exchange.

### Voice & Language Control

Change voice or speech recognition language mid-call:

```python theme={null}
from line.events import AgentUpdateCall

# Switch to Spanish voice and speech recognition
yield AgentUpdateCall(voice_id="spanish-voice-id", language="es")

# Enable multilingual auto-detect mode
yield AgentUpdateCall(language="multilingual")
```

The `language` field sets the ASR (speech recognition) language. Pass any language code supported by [Ink STT](/build-with-cartesia/stt-models), or `"multilingual"` for automatic language detection.

***

## Event History

All input events include an optional `history` field containing the conversation history. When `history` is `None`, the event is inside a history list; when it contains a list, full conversation context is attached. `LlmAgent` handles this automatically—you only need to understand history if building custom agents.

### Accessing History

```python theme={null}
from line.events import UserTextSent, AgentTextSent

async def process(self, env, event):
    for past_event in event.history:
        if isinstance(past_event, UserTextSent):
            print(f"User said: {past_event.content}")
        elif isinstance(past_event, AgentTextSent):
            print(f"Agent said: {past_event.content}")
```

<Accordion title="Event types in history">
  Events in the history list have `history=None` to avoid redundant nesting. The event types are the same as regular input events:

  | Event Type         | Description               |
  | ------------------ | ------------------------- |
  | `CallStarted`      | Call began                |
  | `UserTurnStarted`  | User started speaking     |
  | `UserTextSent`     | User's transcribed speech |
  | `UserDtmfSent`     | User's DTMF button press  |
  | `UserTurnEnded`    | User finished speaking    |
  | `AgentTurnStarted` | Agent started responding  |
  | `AgentTextSent`    | Agent's spoken text       |
  | `AgentDtmfSent`    | Agent's DTMF tone         |
  | `AgentTurnEnded`   | Agent finished responding |
  | `CallEnded`        | Call ended                |
</Accordion>

<Accordion title="How LlmAgent processes history">
  `LlmAgent` automatically converts the event history to LLM messages:

  * **User messages**: From `UserTextSent` events
  * **Assistant messages**: From `AgentTextSent` events
  * **Tool calls**: From `AgentToolCalled` and `AgentToolReturned` events

  This means the LLM sees full context including previous tool calls and results, enabling it to reference that information without making redundant API calls.
</Accordion>

<Accordion title="Custom agents: Using history">
  If building a custom agent (not using `LlmAgent`), you can use history for context, summarization, or pattern detection:

  ```python theme={null}
  class CustomAgent:
      async def process(self, env, event):
          user_turns = sum(
              1 for e in event.history
              if isinstance(e, UserTurnEnded)
          )

          if user_turns > 5:
              yield AgentSendText(text="We've been chatting for a while! Is there anything else I can help with?")
  ```
</Accordion>


# SDK Overview
Source: https://docs.cartesia.ai/line/sdk/overview


The [Line SDK](https://github.com/cartesia-ai/line/) is a Python framework for building voice agents. Handles audio infrastructure, speech recognition, and conversation flow.

```bash theme={null}
uv add cartesia-line
```

<Note>
  New to Line? Start with the [Quickstart](/line/start-building/quickstart) to build and deploy your first agent.
</Note>

## Core Concepts

| Component                                           | Purpose                                                                 |
| --------------------------------------------------- | ----------------------------------------------------------------------- |
| [`Agent`](./agents)                                 | Controls the input/output event loop via a `process` method             |
| [`LlmAgent`](./agents#llmagent)                     | Built-in agent that wraps 100+ LLM providers via LiteLLM                |
| [`Tools`](./tools)                                  | Functions your agent can call—database lookups, handoffs, web search    |
| [`VoiceAgentApp`](./agents#handling-incoming-calls) | HTTP server that connects your agent to Cartesia's audio infrastructure |

```python theme={null}
import os
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import VoiceAgentApp

async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)
```

The agent speaks the `introduction` when a call starts, then responds to whatever the user says using the LLM.

## Features

* **Real-time interruption support** — Handles audio interruptions and turn-taking out-of-the-box.
* **Tool calling** — Connect to databases, APIs, and external services
* **Multi-agent handoffs** — Route conversations between specialized agents
* **Web search** — Built-in tool for real-time information lookup

## Add Capabilities

### Look up information

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool
async def get_order_status(ctx, order_id: Annotated[str, "The order ID"]):
    """Look up an order's current status."""
    order = await db.get_order(order_id)
    return f"Order {order_id} is {order.status}"
```

### Handoff to another agent

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, agent_as_handoff, end_call

spanish_agent = LlmAgent(
    model="gpt-5-nano",
    api_key=os.getenv("OPENAI_API_KEY"),
    tools=[end_call],
    config=LlmConfig(
        system_prompt="You speak only in Spanish.",
        introduction="¡Hola! ¿Cómo puedo ayudarte?",
    ),
)

main_agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[
        end_call,
        agent_as_handoff(
            spanish_agent,
            name="transfer_to_spanish",
            description="Transfer when user requests Spanish.",
        ),
    ],
    config=LlmConfig(...),
)
```

### Search the web

```python theme={null}
from line.llm_agent import end_call, web_search

agent = LlmAgent(
    tools=[end_call, web_search],  # Add built-in web search
    ...
)
```

See [Tools](./tools) for the full guide.

## Code Examples

| Example                                                                                   | Description                                        |
| ----------------------------------------------------------------------------------------- | -------------------------------------------------- |
| [Basic Chat](https://github.com/cartesia-ai/line/tree/main/examples/basic_chat)           | Simple conversational agent                        |
| [Chat Supervisor](https://github.com/cartesia-ai/line/tree/main/examples/chat_supervisor) | Fast chat model with powerful reasoning escalation |
| [Form Filler](https://github.com/cartesia-ai/line/tree/main/examples/form_filler)         | Collect structured data via conversation           |
| [Multi-Agent](https://github.com/cartesia-ai/line/tree/main/examples/transfer_agent)      | Hand off between specialized agents                |

### Integrations

| Integration                                                                                   | Description              |
| --------------------------------------------------------------------------------------------- | ------------------------ |
| [Exa Web Research](https://github.com/cartesia-ai/line/tree/main/example_integrations/exa)    | Real-time web search     |
| [Browserbase](https://github.com/cartesia-ai/line/tree/main/example_integrations/browserbase) | Fill web forms via voice |

## Next Steps

<CardGroup>
  <Card title="Agents" icon="robot" href="./agents">
    Configure prompts, LLMs, and conversation flow
  </Card>

  <Card title="Tools" icon="wrench" href="./tools">
    Add custom tools and multi-agent handoffs
  </Card>
</CardGroup>


# Advanced Patterns
Source: https://docs.cartesia.ai/line/sdk/patterns


Patterns for production voice agents: observability, tool design, multi-agent systems, and guardrails.

## Complete Example: Multi-Agent Customer Service

This example combines prompting, all three tool types, and multi-agent handoffs:

```python theme={null}
import os
from typing import Annotated
from line import CallRequest
from line.llm_agent import (
    LlmAgent, LlmConfig, loopback_tool, passthrough_tool,
    agent_as_handoff, end_call
)
from line.events import AgentSendText, AgentTransferCall
from line.voice_agent_app import AgentEnv, VoiceAgentApp

# Loopback tool: Fetch order info for LLM to contextualize
@loopback_tool
async def get_order_status(ctx, order_id: Annotated[str, "The order ID"]):
    """Look up order status by ID."""
    order = await db.get_order(order_id)
    return f"Order {order_id}: {order.status}, delivers {order.delivery_date}"

# Passthrough tool: Deterministic transfer action
@passthrough_tool
async def transfer_to_human(ctx):
    """Transfer to a human agent."""
    yield AgentSendText(text="Let me connect you with a team member who can help further.")
    yield AgentTransferCall(target_phone_number="+18005551234")

SYSTEM_PROMPT = """You are a friendly customer service agent for Acme Corp.

You can:
- Look up order status using get_order_status
- Transfer to a human agent using transfer_to_human
- Transfer to Spanish support using transfer_to_spanish
- End calls politely using end_call

Rules:
- Always confirm the order ID before looking it up
- Offer to transfer to a human if you can't resolve the issue
- Transfer to Spanish support if the user speaks Spanish or requests it
- Be empathetic and professional
"""

async def get_agent(env: AgentEnv, call_request: CallRequest):
    # Spanish-speaking specialist agent
    spanish_agent = LlmAgent(
        model="gpt-5-nano",
        api_key=os.getenv("OPENAI_API_KEY"),
        tools=[get_order_status, transfer_to_human, end_call],
        config=LlmConfig(
            system_prompt="Eres un agente de servicio al cliente amigable para Acme Corp. Habla solo en español.",
            introduction="¡Hola! Gracias por llamar a Acme Corp. ¿Cómo puedo ayudarte hoy?",
        ),
    )

    # Main English-speaking agent with handoff capability
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[
            get_order_status,
            transfer_to_human,
            agent_as_handoff(
                spanish_agent,
                handoff_message="Transferring you to our Spanish-speaking team...",
                name="transfer_to_spanish",
                description="Transfer to Spanish support when user speaks Spanish or requests it.",
            ),
            end_call,
        ],
        config=LlmConfig(
            system_prompt=SYSTEM_PROMPT,
            introduction="Hi! Thanks for calling Acme Corp. How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

***

## Observability

### Log Metrics

Track performance and business metrics:

```python theme={null}
from line.events import LogMetric, LogMessage

@loopback_tool
async def process_order(ctx, order_id: Annotated[str, "Order ID"]):
    """Process a customer order."""
    import time
    start = time.time()

    result = await api.process_order(order_id)

    # Log timing metric
    yield LogMetric(name="order_processing_ms", value=(time.time() - start) * 1000)

    # Log business event
    yield LogMessage(
        name="order_processed",
        level="info",
        message=f"Processed order {order_id}",
        metadata={"status": result.status}
    )

    return f"Order {order_id} processed: {result.status}"
```

### Built-in LLM Agent Metrics

`LlmAgent` automatically emits three timing metrics on every turn — no code needed:

| Metric               | Description                                                                            |
| -------------------- | -------------------------------------------------------------------------------------- |
| `llm_first_chunk_ms` | Time from start of response generation to first chunk (text or tool call) from the LLM |
| `llm_first_text_ms`  | Time from start of response generation to first text chunk                             |
| `agent_turn_ms`      | Total agent processing time for the turn                                               |

***

## Tool Patterns

### Validation in Tools

Validate inputs before processing:

```python theme={null}
@loopback_tool
async def book_appointment(
    ctx,
    date: Annotated[str, "Date in YYYY-MM-DD format"],
    time: Annotated[str, "Time in HH:MM format"]
):
    """Book an appointment."""
    from datetime import datetime

    try:
        dt = datetime.strptime(f"{date} {time}", "%Y-%m-%d %H:%M")
    except ValueError:
        return "Invalid date or time format. Please use YYYY-MM-DD and HH:MM."

    if dt < datetime.now():
        return "Cannot book appointments in the past."

    # Proceed with booking
    return f"Appointment booked for {dt.strftime('%B %d at %I:%M %p')}"
```

### Async Operations in Tools

Handle long-running operations with proper timeout handling:

```python theme={null}
import asyncio

@loopback_tool
async def search_inventory(ctx, query: Annotated[str, "Search query"]):
    """Search inventory with timeout protection."""
    try:
        result = await asyncio.wait_for(
            inventory_api.search(query),
            timeout=5.0
        )
        return f"Found {len(result.items)} items matching '{query}'"
    except asyncio.TimeoutError:
        return "Search is taking longer than expected. Please try a more specific query."
```

### Error Handling

Handle errors gracefully in tools:

```python theme={null}
@loopback_tool
async def get_account_info(ctx, account_id: Annotated[str, "Account ID"]):
    """Look up account information."""
    try:
        account = await api.get_account(account_id)
        return f"Account {account_id}: Balance ${account.balance:.2f}"
    except AccountNotFoundError:
        return f"Account {account_id} not found."
    except Exception as e:
        logger.error(f"Error fetching account: {e}")
        return "Sorry, I couldn't retrieve that account information right now."
```

***

## Agent Wrappers

Agent wrappers add cross-cutting behavior (logging, validation, routing) without modifying the underlying agent.

### Guardrails: Safety and Content Filtering

Wrappers are ideal for implementing guardrails that filter unsafe content in both directions:

```python theme={null}
class GuardrailsAgent:
    def __init__(self, inner_agent, safety_api):
        self.inner = inner_agent
        self.safety_api = safety_api

    async def process(self, env, event):
        # Pre-processing: Check user input for unsafe content
        if isinstance(event, UserTurnEnded):
            user_text = event.content[0].content if event.content else ""

            if await self.safety_api.is_unsafe(user_text):
                yield AgentSendText(text="I'm here to help with appropriate requests. Let's keep our conversation respectful.")
                return

        # Post-processing: Check agent output for safety issues
        async for output in self.inner.process(env, event):
            if isinstance(output, AgentSendText):
                if await self.safety_api.is_unsafe(output.text):
                    yield LogMessage(
                        name="safety_violation",
                        level="warning",
                        message=f"Blocked unsafe output: {output.text[:100]}..."
                    )
                    yield AgentSendText(text="I apologize, but I can't provide that information.")
                    continue

            yield output
```

Common guardrail patterns:

* Content safety filtering (toxicity, hate speech, PII)
* Rate limiting and abuse prevention
* Compliance checks (HIPAA, financial regulations)
* Brand safety (off-brand responses)

### Routing Between Multiple Agents

Dynamically switch between specialized agents based on conversation context:

```python theme={null}
class RouterAgent:
    def __init__(self, default_agent, specialists: dict):
        self.default = default_agent
        self.specialists = specialists
        self.current = default_agent

    async def process(self, env, event):
        # Switch agent based on user input
        if isinstance(event, UserTurnEnded):
            user_text = event.content[0].content if event.content else ""

            if "billing" in user_text.lower():
                self.current = self.specialists.get("billing", self.default)
            elif "technical" in user_text.lower():
                self.current = self.specialists.get("technical", self.default)

        async for output in self.current.process(env, event):
            yield output
```

Use with `LlmAgent`:

```python theme={null}
async def get_agent(env, call_request):
    return RouterAgent(
        default_agent=LlmAgent(
            model="gpt-5-nano",
            api_key=os.getenv("OPENAI_API_KEY"),
            config=LlmConfig(system_prompt="You are a helpful assistant..."),
        ),
        specialists={
            "billing": LlmAgent(
                model="gpt-5-nano",
                api_key=os.getenv("OPENAI_API_KEY"),
                config=LlmConfig(system_prompt="You are a billing specialist..."),
            ),
            "technical": LlmAgent(
                model="anthropic/claude-haiku-4-5-20251001",
                api_key=os.getenv("ANTHROPIC_API_KEY"),
                config=LlmConfig(system_prompt="You are a technical support specialist..."),
            ),
        }
    )
```

### Best Practices

Keep wrappers focused on a single responsibility. Use `async for` and `yield` to preserve streaming. Stack simple wrappers rather than building one complex one.

```python theme={null}
# Composable wrappers
agent = LoggingWrapper(
    ValidationWrapper(
        LlmAgent(...)
    )
)
```

***

## Example Implementations

Full working examples demonstrating these patterns:

| Example                                                                                       | Pattern             | Description                                            |
| --------------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------ |
| [Form Filler](https://github.com/cartesia-ai/line/tree/main/examples/form_filler)             | Stateful tools      | Walk users through a YAML-defined form with validation |
| [Multi-Agent Transfer](https://github.com/cartesia-ai/line/tree/main/examples/transfer_agent) | agent\_as\_handoff  | English/Spanish agent handoff                          |
| [Chat Supervisor](https://github.com/cartesia-ai/line/tree/main/examples/chat_supervisor)     | Background research | Separate agents for talking and longer-thinking        |


# Tools
Source: https://docs.cartesia.ai/line/sdk/tools


Tools let your agent perform actions and retrieve information. The SDK supports three tool paradigms that differ in how they affect conversation flow.

## Defining Tools

Any properly annotated function can be a tool. The SDK uses the function's docstring as the description and type annotations for parameters:

```python theme={null}
from typing import Annotated

async def get_weather(
    ctx,
    city: Annotated[str, "The city to check weather for"],
    units: Annotated[str, "celsius or fahrenheit"] = "fahrenheit"
):
    """
    Look up the current weather in a given city.
    """
    return f"72°F and sunny in {city}"
```

<Note>
  The first parameter of every tool must be `ctx` (the tool context). This provides access to conversation state and is required for forward compatibility. Your tool parameters follow after `ctx`.
</Note>

***

## Tool Types

<Note>
  Plain functions passed to `tools` are automatically wrapped as loopback tools. Use decorators (`@loopback_tool`, `@passthrough_tool`, `@handoff_tool`) for explicit control.
</Note>

### Loopback Tools (`@loopback_tool`)

The default behavior. The tool's result is sent back to the LLM, which can then continue generating a response.

```python theme={null}
from line.llm_agent import loopback_tool

@loopback_tool
async def get_account_balance(ctx, account_id: Annotated[str, "The account ID"]):
    """Look up the balance for a customer account."""
    balance = await api.get_balance(account_id)
    return f"${balance:.2f}"
```

**Use for:** Information retrieval, calculations, API queries.

### Passthrough Tools (`@passthrough_tool`)

Output events go directly to the user, bypassing the LLM. Use for deterministic actions.

```python theme={null}
from line.llm_agent import passthrough_tool
from line.events import AgentSendText, AgentEndCall

@passthrough_tool
async def end_call_with_message(ctx, message: Annotated[str, "Goodbye message"]):
    """End the call with a custom goodbye message."""
    yield AgentSendText(text=message)
    yield AgentEndCall()
```

**Use for:** Call control (`EndCall`, `TransferCall`, `SendDtmf`), deterministic responses.

### Handoff Tools (`@handoff_tool`)

Transfers control to another handler. All future events are routed to the handoff target instead of the original agent.

```python theme={null}
from typing import Annotated
from line.llm_agent import handoff_tool
from line.events import AgentHandedOff, AgentSendText, UserTurnEnded, AgentEndCall

@handoff_tool
async def run_satisfaction_survey(
    ctx,
    customer_name: Annotated[str, "The customer's name"],
    event
):
    """Hand off to a customer satisfaction survey at the end of the call."""
    if isinstance(event, AgentHandedOff):
        # First call - send introduction
        yield AgentSendText(
            text=f"Thank you for your call, {customer_name}. "
            "Please stay on the line for a brief satisfaction survey. "
            "On a scale of 1 to 5, how would you rate your experience today?"
        )
        return

    # Subsequent calls - handle survey responses
    if isinstance(event, UserTurnEnded):
        user_response = event.content[0].content if event.content else ""
        yield AgentSendText(text=f"You rated us {user_response}. Thank you for your feedback!")
        yield AgentEndCall()
```

**Use for:** Custom multi-step flows, specialized handlers with their own logic.

When using a handoff tool, the `event` parameter receives different values depending on timing:

* **First call**: `event` is `AgentHandedOff` — use this to send a transition message
* **Subsequent calls**: `event` is the actual `InputEvent` (`UserTurnEnded`, etc.)

Once a handoff occurs, the original agent no longer receives events. The handoff tool function handles all future conversation turns.

<Tip>
  To hand off to another `LlmAgent`, use the [`agent_as_handoff`](#agent_as_handoff) helper instead of writing a raw `@handoff_tool`. It handles the delegation automatically.
</Tip>

***

## Built-in Tools

```python theme={null}
from line.llm_agent import end_call, send_dtmf, transfer_call, web_search

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[end_call, send_dtmf, transfer_call, web_search],
    config=LlmConfig(...),
)
```

| Tool            | Description                                | When to Use                                                   |
| --------------- | ------------------------------------------ | ------------------------------------------------------------- |
| `end_call`      | Ends the call                              | User says "goodbye" or the agent's objective has been met     |
| `transfer_call` | Transfers to another number (E.164 format) | Escalating to human agents, routing to departments            |
| `web_search`    | Searches the web for real-time info        | Current events, live prices, recent news the LLM doesn't know |

**Examples:**

```python theme={null}
# End call: Let the LLM decide when conversation is complete
tools=[end_call]  # LLM calls this when user says "thanks, bye!"

# Transfer: Route to human support
tools=[transfer_call]  # LLM calls transfer_call(target_phone_number="+18005551234")

# Web search with custom context size
tools=[web_search(search_context_size="high")]  # More context for complex queries
```

### `end_call`

Ends the current call and disconnects. The actual hangup occurs after the agent's final speech completes, so the user hears the full goodbye message before disconnection.

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, end_call

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[end_call],
    config=LlmConfig(...),
)
```

By default, `end_call` uses a conservative policy that only ends the call when:

* The user's objective is fully complete
* The user explicitly says goodbye
* The agent has said a natural goodbye

#### Custom Description

We recommend providing a custom description tailored to your use case. The description **fully replaces** the default—it is not appended—so include complete instructions with explicit do/don't guidance.

```python theme={null}
from line.llm_agent import end_call

# Restaurant reservation agent
tools=[end_call(description="""Ends the call and disconnects.

Call when ALL of the following are true:
- The reservation is confirmed with date, time, party size, and name.
- You have repeated the reservation details back to the guest.
- The guest confirms the details are correct or says goodbye.

Do not call when:
- The guest asks to modify the reservation.
- Details are missing or unconfirmed.
- The guest says 'okay' or 'thanks' without an explicit goodbye.

If unsure, ask: 'Is there anything else I can help you with for your reservation?'
""")]

# Order confirmation agent
tools=[end_call(description="""Ends the call and disconnects.

Call when ALL of the following are true:
- The order is placed and confirmed.
- You have provided the order number and estimated delivery time.
- The customer acknowledges with a goodbye phrase.

Do not call when:
- The customer has questions about their order.
- Payment has not been confirmed.
- The customer says 'got it' without saying goodbye.
""")]
```

| Parameter     | Type            | Description                                                                                                 |
| ------------- | --------------- | ----------------------------------------------------------------------------------------------------------- |
| `description` | `Optional[str]` | Fully replaces the default description. Include complete instructions for when the LLM should end the call. |

### `agent_as_handoff`

Creates a handoff tool from another `Agent`—the easiest way to implement multi-agent workflows.

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, agent_as_handoff, end_call, UpdateCallConfig

spanish_agent = LlmAgent(
    model="gpt-5-nano",
    api_key=os.getenv("OPENAI_API_KEY"),
    tools=[end_call],
    config=LlmConfig(
        system_prompt="You speak only in Spanish.",
        introduction="¡Hola! ¿Cómo puedo ayudarte?",
    ),
)

main_agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[
        end_call,
        agent_as_handoff(
            spanish_agent,
            handoff_message="Transferring to Spanish support...",
            update_call=UpdateCallConfig(
                voice_id="spanish-voice-id",
                pronunciation_dict_id="spanish-pronunciation-dict-id"
            ),
            name="transfer_to_spanish",
            description="Use when user requests Spanish.",
        ),
    ],
    config=LlmConfig(...),
)
```

| Parameter         | Type                         | Description                                                                             |
| ----------------- | ---------------------------- | --------------------------------------------------------------------------------------- |
| `agent`           | `Agent`                      | The agent to hand off to                                                                |
| `handoff_message` | `Optional[str]`              | Message spoken before the handoff                                                       |
| `update_call`     | `Optional[UpdateCallConfig]` | Optional config to update call settings (voice, pronunciation, language) before handoff |
| `name`            | `Optional[str]`              | Tool name for the LLM                                                                   |
| `description`     | `Optional[str]`              | When the LLM should use this tool                                                       |

When called, `agent_as_handoff` automatically sends the handoff message, updates the call settings if specified, triggers the new agent's introduction, and routes all future events to it.

<Tip>
  See [Advanced Patterns](/line/sdk/patterns) for a complete multi-agent example with loopback, passthrough, and handoff tools.
</Tip>

***

## Long-Running Tools

By default, tool calls are terminated when the agent is interrupted (though any reasoning and tool call response values already produced are preserved for use in the next generation).

For tools that are expected to take a long time to complete, set `is_background=True`. The tool will continue running in the background until completion regardless of interruptions, then loop back to the LLM to produce a response.

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool(is_background=True)
async def search_database(ctx, query: Annotated[str, "Search query"]) -> str:
    """Search the database - may take several seconds."""
    results = await slow_database_search(query)
    return format_results(results)

@loopback_tool(is_background=True)
async def generate_report(ctx, report_type: Annotated[str, "Type of report"]) -> str:
    """Generate a detailed report - runs in background."""
    report = await compile_report(report_type)
    return report
```

<Note>
  Background tools are useful when:

  * The operation may take longer than typical user patience (e.g., complex searches, report generation)
  * You want the user to be able to speak while the operation completes
  * The result should be delivered even if the user interrupts with another question
</Note>


# Agent Builder
Source: https://docs.cartesia.ai/line/start-building/agent-builder


Prototype voice agents in the Playground. Test prompts, configure voices, and deploy in seconds.

## Create your agent

Go to [play.cartesia.ai/agents](https://play.cartesia.ai/agents) and select **Start in Playground**.

<Frame>
  <img alt="Create your first voice agent options" />
</Frame>

Customize your agent's behavior, voice, and greeting.

<Frame>
  <img alt="Dynamic agent configuration interface" />
</Frame>

**System Prompt** — Define your agent's role and guidelines. You can also provide a natural language description of your agent and the platform will generate a structured system prompt.

**Voice** — Choose from Cartesia's voice library. Preview voices before selecting.

**Initial Message** — Set the greeting your agent speaks when calls start. Check **Skip agent introduction** to have the agent wait for the user to speak first.

**Background Sound** — Add ambient audio for call center atmospheres or office environments.

**Preview** changes before publishing.

## Continue building in code

Connect your Playground agent to GitHub to customize with code.

<Steps>
  <Step title="Connect to GitHub">
    On your agent page, click **Connect to GitHub**. Authorize Cartesia to create a repository.
  </Step>

  <Step title="Clone locally">
    ```bash theme={null}
    git clone https://github.com/your-org/your-agent.git
    cd your-agent
    ```
  </Step>

  <Step title="Install dependencies">
    ```bash theme={null}
    uv pip install .
    ```
  </Step>

  <Step title="Edit your agent">
    Open `main.py` to add tools, custom logic, or modify the prompt.
  </Step>

  <Step title="Deploy">
    Push to deploy your changes.

    ```bash theme={null}
    git push
    ```
  </Step>
</Steps>

## Next steps

<CardGroup>
  <Card title="Quickstart" icon="rocket" href="/line/start-building/quickstart">
    Build agents with the SDK
  </Card>

  <Card title="Agents" icon="robot" href="/line/sdk/agents">
    Prompts, voices, and pre-call configuration
  </Card>
</CardGroup>


# Quickstart
Source: https://docs.cartesia.ai/line/start-building/quickstart


Build an agent, deploy it, and make your first call within minutes.

## Prerequisites

* A free Cartesia account ([sign up here](https://play.cartesia.ai))
* Python 3.9+
* An LLM API key (Anthropic, OpenAI, Google, etc.)
* [uv](https://docs.astral.sh/uv/) (Python package and project manager)

## Install the CLI

```bash theme={null}
curl -fsSL https://cartesia.sh | sh
cartesia auth login
```

## Install uv

Install [uv](https://docs.astral.sh/uv/), a fast Python package manager to manage dependencies and virtual environments.

```bash theme={null}
curl -LsSf https://astral.sh/uv/install.sh | sh
```

## Create your agent

Create a new project and install dependencies. uv will automatically set up a virtual environment and manage your packages.

```bash theme={null}
uv init my-voice-agent && cd my-voice-agent
uv add cartesia-line
```

Create `main.py`:

```python theme={null}
import os
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import VoiceAgentApp

async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001", # Or "gpt-5-nano", "gemini/gemini-2.5-flash", etc.
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

## Test locally

Start your agent server.

```bash theme={null}
ANTHROPIC_API_KEY=your-api-key PORT=8000 uv run python main.py
```

In a separate terminal, chat with your agent by simply running:

```bash theme={null}
cartesia chat 8000
```

This lets you test your agent's reasoning before deploying.

## Deploy

Link your project and deploy.

```bash theme={null}
cartesia init    # Choose "Create new" and name your agent
cartesia deploy
```

Your agent deploys in under 30 seconds on Cartesia's managed runtime.

## Set environment variables

Configure your API key for the deployed agent.

```bash theme={null}
cartesia env set ANTHROPIC_API_KEY=your-api-key
```

Or import from a `.env` file:

```bash theme={null}
cartesia env set --from .env
```

## Make a call

Call your agent from your phone.

```bash theme={null}
cartesia call +1XXXXXXXXXX
```

Or visit the [Playground](https://play.cartesia.ai/agents) to call from the web.

## Next steps

<CardGroup>
  <Card title="Add tools" icon="wrench" href="/line/sdk/tools">
    Connect databases, APIs, and external services
  </Card>

  <Card title="Configure prompts" icon="robot" href="/line/sdk/agents">
    Customize system prompts and conversation flow
  </Card>

  <Card title="Calls API" icon="globe" href="/line/integrations/calls-api">
    Connect web clients via WebSocket
  </Card>

  <Card title="Agent Builder" icon="sparkles" href="/line/start-building/agent-builder">
    Build agents visually in the Playground
  </Card>
</CardGroup>


# API Conventions
Source: https://docs.cartesia.ai/use-the-api/api-conventions


<Warning>
  All endpoints use HTTPS. HTTP is not supported. API keys that call the API
  over HTTP may be subject to automatic rotation.
</Warning>

All API requests use the following base URL: `https://api.cartesia.ai`. (For WebSockets the corresponding protocol is `wss://`.)

### Always send a `Cartesia-Version` header

Each request you send our API should have a `Cartesia-Version` header containing the date (`YYYY-MM-DD`) when you tested your integration. For WebSockets, you can alternately use the `?cartesia_version` query parameter, which will take precedence.

This will help us provide you with timely deprecation notices and enable us to provide automatic backwards compatibility where possible.

For a given `Cartesia-Version`, we will preserve existing input and output fields, but we may make non-breaking changes, such as:

1. Add optional request fields.
2. Add additional response fields.
3. Change conditions for specific error types
4. Add variants to enum-like output values.

Our versioning scheme is inspired by the [Anthropic API](https://docs.anthropic.com/en/api/versioning).

### Use API keys when making requests from a server

Create a new API key at [play.cartesia.ai/keys](https://play.cartesia.ai/keys). Include `Authorization: Bearer <api_key>` in the headers of your requests.

### Use access tokens when making requests from a client app

Never use API keys in client apps; they grant full account access and can be extracted from browser or mobile code.

Instead, your server can generate a short-lived access token and send it to the client. See the [Access Token API Reference](/api-reference/auth/access-token) for how to generate one.

* For HTTP requests, include `Authorization: Bearer <access_token>` in the headers.

* For WebSocket connections, pass the token as the `?access_token=<access_token>` query parameter since browsers can't set headers on WebSocket handshakes.

### Check response codes

Our API uses standard HTTP response codes; refer to [httpstatuses.io](https://httpstatuses.io).

### Parse structured error responses

For `Cartesia-Version` values on or after `2026-03-01`, Cartesia returns structured JSON errors.

For the full error reference (all current error codes, schemas, and field nullability), see [API Errors](/use-the-api/api-errors).

```json HTTP error response (Cartesia-Version 2026-03-01 and newer) theme={null}
{
  "error_code": "concurrency_limited",
  "title": "Too many concurrent requests",
  "message": "You have exceeded your plan's concurrency limit.",
  "request_id": "550e8400-e29b-41d4-a716-446655440000"
}
```

Field meanings:

1. `error_code`: machine-readable identifier for client logic; can be `null`.
2. `title`: short human-readable summary.
3. `message`: detailed human-readable explanation.
4. `request_id`: request identifier for support/debugging.
5. `doc_url`: optional link to docs for the specific error (when available).

Common `error_code` values today include `quota_exceeded`, `concurrency_limited`, `voice_model_mismatch`, `voice_not_found`, `model_not_found`, `language_not_supported`, `file_too_large`, `unsupported_audio_format`, and `plan_upgrade_required`.

WebSocket and SSE error events include the same error fields plus transport context:

```json WebSocket/SSE error event (Cartesia-Version 2026-03-01 and newer) theme={null}
{
  "type": "error",
  "done": true,
  "status_code": 429,
  "error_code": "concurrency_limited",
  "title": "Too many concurrent requests",
  "message": "You have exceeded your plan's concurrency limit.",
  "request_id": "550e8400-e29b-41d4-a716-446655440000:happy-monkeys-fly:8a0f5f3a-3b2f-4f28-b73e-8c5f27e2f8bb",
  "context_id": "happy-monkeys-fly"
}
```

Notes:

1. `context_id` appears for TTS WebSocket errors when available.
2. `status_code` is included in WebSocket/SSE error payloads; for HTTP, status remains in the HTTP response status line.
3. `request_id` is always a string; HTTP and SSE request IDs are UUIDs, while WebSocket request IDs may include additional context.

For `Cartesia-Version` values before `2026-03-01` (and invalid versions), legacy error formats are returned instead:

1. HTTP errors are plain text in `Title: Message` format.
2. WebSocket/SSE errors use legacy envelopes with string-only error messages.

### Pass data according to the method

All GET requests use query parameters to pass data. All POST requests use a JSON body or `multipart/form-data`.


# Compare TTS Endpoints
Source: https://docs.cartesia.ai/use-the-api/compare-tts-endpoints

How bytes, SSE, and WebSocket differ for text-to-speech, and when to use each.

Cartesia exposes three ways to turn text into speech. The same models, voices, and core parameters apply everywhere. What changes is how you connect, how audio is framed on the wire, and whether you get timestamps, continuations (streaming model output into one spoken line), or many generations on one connection.

All three endpoints stream audio as it is produced. The bytes endpoint delivers that stream as a single HTTP response body (the same pattern the playground uses). SSE and WebSocket stream too; they chunk audio into multiple events or messages, which is how per-chunk metadata such as timestamps is carried.

## Feature comparison

|           | Multiple generations per connection | Timestamps | Continuations |
| --------- | ----------------------------------- | ---------- | ------------- |
| WebSocket | Yes                                 | Yes        | Yes           |
| Bytes     | No (one `POST` per generation)      | No         | No            |
| SSE       | No (one `POST` per generation)      | Yes        | No            |

An **utterance** is one stretch of speech you want pronounced as a single unit (usually a sentence or a line of UI copy). **Continuations** let you send that utterance as several WebSocket messages that share a `context_id`. See [Stream inputs using continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) and [contexts](/use-the-api/tts-websocket/contexts).

```mermaid theme={null}
flowchart TD
    Q1{"Are you streaming text from an LLM<br>or other partial input?"}
    Q2{"Do you need timestamps on HTTP<br>without WebSocket?"}
    Q3{"Will you speak often enough that<br>repeated connect/TLS cost hurts?"}
    WS["WebSocket"]
    SSE["SSE"]
    Bytes["Bytes"]

    Q1 -- "Yes" --> WS
    Q1 -- "No" --> Q2
    Q2 -- "Yes" --> SSE
    Q2 -- "No" --> Q3
    Q3 -- "Yes" --> WS
    Q3 -- "No" --> Bytes
```

If you care about time-to-first-byte on every turn, remember that a new HTTPS request pays for TCP and TLS again; that overhead is often on the same order as TTFB for the audio itself. WebSocket amortizes that cost when you keep the socket open.

SSE is still supported for stacks that already consume Server-Sent Events or when you want timestamps while staying on HTTP. For audio only, bytes is usually the better HTTP choice (smaller encoding than JSON-wrapped chunks).

## Pick an endpoint in one minute

| What you are building                                                                                                                              | Use this                                                   | Short label                         |
| -------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------- |
| Full transcript in one request; you want a streaming HTTP body (efficient; same pattern as the playground)                                         | [`POST /tts/bytes`](/api-reference/tts/bytes)              | Stream speech (bytes)               |
| Full transcript in one request; you need timestamps without WebSocket, or your stack already uses SSE                                              | [`POST /tts/sse`](/api-reference/tts/sse)                  | Stream speech with timestamps (SSE) |
| Long-lived session, partial transcript (for example LLM tokens), lowest latency across many turns, timestamps, or several utterances on one socket | [WebSocket `/tts/websocket`](/api-reference/tts/websocket) | Live session (WebSocket)            |

If the full transcript is not known up front, use WebSocket with contexts, not bytes or SSE.

***

## Bytes (`POST /tts/bytes`)

Best for batch jobs, caching files, notifications, and anywhere one `POST` per generation is enough.

The response body streams while audio is generated. You can read progressively or buffer to the end. For many output formats this is leaner on the wire than SSE because you receive raw or file bytes instead of JSON-wrapped chunks.

Typical flow:

1. One JSON payload with the full `transcript`, voice, model, and output format (WAV, MP3, raw PCM, and so on).
2. `POST` to `/tts/bytes`.
3. Read the body as data arrives, or consume it to completion.

One request is one generation. For another line of speech, send another `POST`.

See [bytes reference](/api-reference/tts/bytes).

***

## SSE (`POST /tts/sse`)

Best when you need timestamps while staying on HTTP without WebSocket, or when your integration already uses SSE. If you only need audio and not SSE-shaped events, bytes is usually simpler. WebSocket is otherwise the full-featured option for real-time use and supports timestamps as well.

SSE remains available largely for backward compatibility and for teams that standardize on Server-Sent Events.

Typical flow:

1. Same as bytes: one JSON body with the full transcript.
2. `POST` to `/tts/sse`.
3. Consume Server-Sent Events; each event carries the next chunk until completion.

Bytes vs SSE:

|            | Bytes                                           | SSE                                            |
| ---------- | ----------------------------------------------- | ---------------------------------------------- |
| Shape      | One streaming response body (raw or file bytes) | Many SSE events (often JSON plus base64 audio) |
| Timestamps | No                                              | Yes (in the event payload)                     |

You still send one full transcript per request: SSE does not support WebSocket-style continuations across multiple `POST`s. An optional `context_id` is echoed for your logs; it does not merge multiple HTTP requests into one utterance. To send text in pieces over time, use WebSocket.

See [SSE reference](/api-reference/tts/sse).

***

## WebSocket (`/tts/websocket`)

Best for assistants, games, telephony-style stacks, or any case where the connection stays open and transcript text may arrive over time.

Why people choose WebSocket:

1. Latency: you pay connect cost once; later generations avoid extra TCP/TLS round trips (often tens to low hundreds of ms per turn).
2. Streaming input: send fragments as they arrive (for example from an LLM) and keep prosody across them. See [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) and [contexts](/use-the-api/tts-websocket/contexts).
3. Timestamps: word- or segment-level timing (model and language limits apply; see WebSocket docs).
4. Multiplexing: multiple `context_id` values on one connection for overlapping utterances.

Typical flow:

1. Open the WebSocket.
2. Send JSON messages. When one utterance is split across messages, use `context_id` and `continue`: set `continue: true` on partials, and `continue: false` on the last part of that utterance (or use the empty-transcript pattern in [contexts](/use-the-api/tts-websocket/contexts) if you cannot know the final string yet).
3. Read audio until the server finishes that context.

See [WebSocket reference](/api-reference/tts/websocket).

***

## Continuations

If you are not streaming text from a model, start with bytes or SSE, not continuations.

When you do use WebSocket streaming, keep one stable `context_id` per utterance, `continue: true` on every partial, and `continue: false` on the final message for that utterance (see [contexts](/use-the-api/tts-websocket/contexts)).

Things that break text or prosody:

* Concatenation: chunks are joined verbatim. A missing space produces `"...world!How..."` instead of `"...world! How..."`.
* SSML and numbers: avoid splitting tokens that must stay together (for example decimals in SSML). See `max_buffer_delay_ms` in the [continuations guide](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).

If you leave `continue: true` longer than you meant, contexts eventually expire on their own and audio is still generated and flushed according to server rules. It is not a runaway failure mode. You should still send `continue: false` when you know the utterance is complete so your client state matches the server. Do not reuse old `context_id` values for unrelated utterances.

***

## Why WebSocket uses `context_id` (and HTTP does not)

On `POST /tts/bytes` and `POST /tts/sse`, you send a complete transcript in one JSON body. There is no continuation protocol across requests.

`context_id` and `continue` matter on WebSocket when one utterance's text is split across multiple messages. The server concatenates chunks that share a `context_id`. `continue: true` means more text is coming; `continue: false` finalizes that utterance.

Mental model:

* Whole line of speech in one string: bytes or SSE. No context API.
* Text arrives in pieces: WebSocket, one `context_id` per utterance, with continuations.

***

## API ergonomics (all endpoints)

* For server-side calls, prefer the API key in the `Authorization` header instead of query strings (headers are less likely to appear in access logs). WebSocket URLs in browsers may need different patterns for your platform.
* Model IDs, voices, and core generation parameters match across bytes, SSE, and WebSocket. What differs is wire format, how chunks are exposed, and whether input can be streamed with continuations.

***

## Where to go next

<CardGroup>
  <Card title="Stream speech (bytes)" icon="download" href="/api-reference/tts/bytes">
    One POST, streaming response body
  </Card>

  <Card title="Stream speech with timestamps (SSE)" icon="waveform" href="/api-reference/tts/sse">
    Timestamps and SSE-chunked audio
  </Card>

  <Card title="Live session (WebSocket)" icon="plug" href="/api-reference/tts/websocket">
    Streaming input, multiplexing, lowest latency across turns
  </Card>
</CardGroup>


# Concurrency and WebSocket Limits
Source: https://docs.cartesia.ai/use-the-api/concurrency-limits-and-timeouts

Learn about concurrency limits and timeouts with the Cartesia API.

Your account is subject to two types of rate limits: WebSocket limits and generation concurrency limits.

## Concurrency limits by subscription plan

Your subscription plan determines how many requests can be processed simultaneously. Sonic Text-to-Speech (TTS) and Ink Speech-to-Text (STT) each have separate concurrency limits with the same values per plan.

| Plan       | TTS Concurrent Requests | STT Concurrent Requests |
| ---------- | ----------------------- | ----------------------- |
| Free       | 2                       | 8                       |
| Pro        | 3                       | 12                      |
| Startup    | 5                       | 20                      |
| Scale      | 15                      | 60                      |
| Enterprise | Custom                  | Custom                  |

<Note>
  Sonic (Text-to-Speech) and Ink (Speech-to-Text) services have separate concurrent request limits. For example, if you're on the Scale plan, you can have up to 15 concurrent TTS requests AND 60 concurrent STT requests running simultaneously.
</Note>

## Text-to-Speech (TTS) Concurrency

We measure TTS generation concurrency in terms of the number of unique contexts active at a given time.

* For HTTP endpoints, each request is treated as a separate context and counts toward your concurrency limit.
* For WebSockets, a unique <code>context\_id</code> defines a context—sending additional requests with the same <code>context\_id</code> does not increase your concurrency usage. This is because requests to the same context are processed sequentially.
* STT **does not** count towards your TTS concurrency limit

If you exceed your TTS concurrency limit, you will receive a `429 Too Many Requests` error. You can check your concurrency limit and upgrade it on the playground at [play.cartesia.ai](https://play.cartesia.ai).

### Interpreting TTS concurrency limits

How you interpret your TTS concurrency limit depends on how you're using the Sonic model family.

<AccordionGroup>
  <Accordion title="Conversational use cases">
    For real-time conversational use cases, such as powering voice agents, we've found that the number of parallel conversations you can support is effectively 4X your concurrency limit. This is just a rule of thumb, and depends on the types of conversations you're supporting. You can reach out to us to discuss your specific use case.

    For example, if you have a TTS concurrency limit of 15, you can typically support 60 parallel conversations.
  </Accordion>

  <Accordion title="Non-conversational use cases">
    For non-conversational use cases, such as generating speech in batch jobs, there is a more direct relationship between your concurrency limit and the number of parallel generations you can support.

    For example, if you have a TTS concurrency limit of 15, you can typically support 15 parallel TTS generations. You can use a connection pool to ensure you don't exceed your concurrency limit.
  </Accordion>
</AccordionGroup>

### TTS WebSocket limits

We limit the number of parallel TTS WebSocket connections to 10X your concurrency limit. For example, if you have a concurrency limit of 15, you can have up to 150 parallel TTS WebSocket connections.

If you exceed your WebSocket limit, you will receive a `429 Too Many Requests` error on trying to open a new WebSocket connection.

Usually, when users run into TTS WebSocket limits (even at scale), it's because they're not properly closing idle connections. Beyond closing idle connections, you can also create a connection pool to ensure you don't exceed your WebSocket limit.

### TTS WebSocket timeouts

We close idle TTS WebSocket connections after 5 minutes. We recommend closing and re-opening a new websocket connection when connections stay idle for long periods of time.

## Speech-to-Text (STT) Concurrency

Each active transcription stream counts as one concurrent request, regardless of whether you're using HTTP or WebSocket connections.

* Each concurrent HTTP or WebSocket connection counts toward your STT concurrency limit
* Idle STT WebSockets still count towards your STT concurrency limit
* TTS **does not** count towards your STT concurrency limit

If you exceed your STT concurrency limit, you will receive a `429 Too Many Requests` error.

### STT WebSocket timeouts

We close idle STT WebSocket connections after 3 minutes. We recommend closing and re-opening a new websocket connection when connections stay idle for long periods of time.


# Migrating From OpenAI Whisper to Cartesia Ink
Source: https://docs.cartesia.ai/use-the-api/migrate-from-open-ai

Use Cartesia's Batch Speech-to-Text API with OpenAI's client libraries

<Info>
  Batch Speech-to-Text: This documentation covers OpenAI SDK compatibility for Cartesia Ink's batched transcription endpoint.

  For real-time transcription, use our [Streaming STT endpoint](/api-reference/stt/stt).
</Info>

Cartesia's Batch Speech-to-Text API is compatible with OpenAI's client libraries, enabling seamless migration from OpenAI Whisper.

## Endpoints

**Cartesia Native:** `/stt` - Full feature support\
**OpenAI Compatible:** `/audio/transcriptions` - Drop-in replacement for Whisper on the OpenAI SDK

## Migration Guide for OpenAI SDK

Replace your OpenAI base URL with `https://api.cartesia.ai` to use the compatibility layer for Cartesia:

### Parameter Support

**Supported Parameters**:

* `file` - The audio file to transcribe
* `model` - Use `ink-whisper` for Cartesia's latest model
* `language` - Input audio language (ISO-639-1 format)
* `timestamp_granularities` - Include `["word"]` to get word-level timestamps

**Response Format**: Always returns JSON with transcribed text, duration, language, and optionally word timestamps.

For the complete parameter reference, see our [Batch STT API documentation](/api-reference/stt/transcribe).

### Python Example

```python theme={null}
from openai import OpenAI

client = OpenAI(
    api_key="your-cartesia-api-key",
    base_url="https://api.cartesia.ai"
)

with open("audio.wav", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        file=audio_file,
        model="ink-whisper",
        language="en",
        timestamp_granularities=["word"]
    )
    
print(transcript.text)
```

### Node.js Example

```typescript theme={null}
import OpenAI from 'openai';
import fs from 'fs';

const client = new OpenAI({
  apiKey: 'your-cartesia-api-key',
  baseURL: 'https://api.cartesia.ai'
});

const transcription = await client.audio.transcriptions.create({
  file: fs.createReadStream('audio.wav'),
  model: 'ink-whisper',
  language: 'en',
  timestamp_granularities: ['word']
});

console.log(transcription.text);
```

## Direct API Usage

Both endpoints accept identical parameters and return the same JSON response format:

### Cartesia Native Endpoint

```bash theme={null}
curl -X POST https://api.cartesia.ai/stt \
  -H "X-API-Key: your-cartesia-api-key" \
  -F "file=@audio.wav" \
  -F "model=ink-whisper" \
  -F "language=en" \
  -F "timestamp_granularities[]=word"
```

### OpenAI-Compatible Endpoint

```bash theme={null}
curl -X POST https://api.cartesia.ai/audio/transcriptions \
  -H "X-API-Key: your-cartesia-api-key" \
  -F "file=@audio.wav" \
  -F "model=ink-whisper" \
  -F "language=en" \
  -F "timestamp_granularities[]=word"
```

## Migration from OpenAI

To migrate from OpenAI's Whisper API to Cartesia:

1. **Update the base URL**: Change from `https://api.openai.com/v1` to `https://api.cartesia.ai`
2. **Update authentication**: Replace your OpenAI API key with your Cartesia API key
3. **Update model names**: Use `ink-whisper` instead of OpenAI's model names
4. **Keep the same endpoint**: Continue using `/audio/transcriptions`
5. **Avoid unsupported parameters**: Remove `prompt`, `temperature`, and `response_format` parameters
6. **Use timestamp\_granularities (Optional)**: Add `timestamp_granularities: ["word"]` to get word-level timestamps

The core functionality remains the same, with JSON responses containing transcribed text and optional word timestamps.


# Buffering
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/buffering

Control how text is buffered before speech generation to balance prosody and latency.

Cartesia supports two buffering modes for streaming TTS: **managed buffering** and **custom buffering**. The right choice depends on how much control you need over the prosody-latency tradeoff.

<Tip>
  **Start with managed buffering.** It produces natural-sounding speech with minimal integration effort. Switch to custom buffering only if you need fine-grained control.
</Tip>

## Managed buffering

Stream LLM tokens directly to Cartesia and let the API decide when to start generating speech. This is the same approach used in Cartesia's managed voice agents platform.

Set `max_buffer_delay_ms` to a value greater than 0 (the default is 3000ms) and stream text token by token.

```json theme={null}
{
  "model_id": "sonic-3",
  "transcript": "Hello",
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-context",
  "continue": true,
  "max_buffer_delay_ms": 3000
}
```

The API buffers incoming text until it has enough context to produce high-quality speech, or until `max_buffer_delay_ms` elapses—whichever comes first. This produces results similar to sentence-level aggregation while still optimizing for latency.

**When to use managed buffering:**

* You're streaming LLM output token by token
* You want natural-sounding speech without building buffering logic
* You want a simple integration with good defaults

## Custom buffering

Handle buffering yourself and send complete phrases or sentences to Cartesia. Set `max_buffer_delay_ms` to `0` so the API generates speech immediately from whatever you provide.

```json theme={null}
{
  "model_id": "sonic-3",
  "transcript": "Hello, my name is Sonic.",
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-context",
  "continue": true,
  "max_buffer_delay_ms": 0
}
```

With custom buffering, you control the prosody-latency tradeoff directly:

* **Full sentences** produce the best prosody but add latency while you wait for the sentence to complete.
* **Partial sentences** reduce latency but may result in less natural speech at chunk boundaries.

**When to use custom buffering:**

* You need precise control over when speech generation starts
* You have your own sentence detection or text aggregation logic
* You're optimizing for a specific latency target

## Avoid the middle ground

A common mistake is to aggregate text client-side into sentences or phrases *and* use the default `max_buffer_delay_ms` of 3000ms. This can cause unnecessary latency—after receiving a complete sentence, the API may wait up to 3000ms for additional input before generating speech.

Pick one approach:

* **Managed buffering:** Stream tokens with `max_buffer_delay_ms > 0` and let Cartesia handle aggregation.
* **Custom buffering:** Aggregate text yourself and set `max_buffer_delay_ms = 0`.

## Configuration reference

<ParamField type="number">
  Maximum time in milliseconds the API waits for additional input before generating speech from buffered text.

  * **Range:** 0–5000ms
  * **Default:** 3000ms
  * Set to `0` for custom buffering (no server-side buffering)
  * Set to `> 0` for managed buffering
</ParamField>

<Warning>
  If you use `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with managed buffering, make sure decimal values are not split across tokens. Submitting `1.0` as `1`, `.`, `0` will cause parsing errors.
</Warning>

## Tips for best results

* **End sentences with punctuation.** Without closing punctuation (`.`, `?`, `!`), the model may treat text as incomplete and wait for the buffer delay to elapse before generating. See [streaming inputs with continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) for more details.
* **Signal when input is done.** When a turn is complete, use `continue: false` (WebSocket) or `no_more_inputs()` (SDK) so the model doesn't wait for more text.
* **Test with realistic input patterns.** Buffering behavior depends on how text arrives—test with actual LLM output rather than pre-written text.


# Context Flushing and Flush IDs
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/context-flushing-and-flush-i-ds

Learn about managing multiple transcript generations with context flushing.

## Overview

When using [context IDs with the WebSocket API](/use-the-api/tts-websocket/contexts), all audio chunks for transcripts submitted to a single context share the same context ID. This makes it difficult to determine which audio chunks correspond to specific transcript submissions.

While this behavior works well for streaming audio, some implementations require the ability to map audio chunks back to their originating transcripts.

<Frame>
  <img alt="context_flushing" />
</Frame>

## Manual Flushing

Manual flushing creates clear boundaries between transcript submissions within the same context.

### How It Works

Each time you trigger a manual flush, the system increments a `flush_id` counter. This ID is included in corresponding response audio chunk payloads, allowing you to track which transcript generated specific audio chunks.

### Implementation

To trigger a manual flush:

1. Send a request with these parameters:
   * `continue=True` (indicates you're continuing with the same context)
   * `flush=True` (triggering the flush operation)
   * Empty transcript
   * Same context ID as your previous request

### Example Flow

```
1. Submit transcript 1 on context 1
2. Flush context 1
3. Submit transcript 2 on context 1
```

In this flow:

* All audio chunks from transcript 1 will have `flush_id=1`
* The manual flush increments the ID
* All audio chunks from transcript 2 will have `flush_id=2`

## Payload Structure

Each audio chunk payload includes a `flush_id` field that serves as a transcript identifier. This ID increments with each manual flush operation, creating a clear boundary between transcript submissions.

## When to Use Manual Flushing

Consider using manual flushing when:

* You need to associate audio chunks with their originating transcripts
* Your application architecture expects a one-to-one relationship between transcripts and response streams
* You're integrating with frameworks that assume each transcript has a corresponding generator

This feature is particularly helpful when using multiple providers, as it aligns the Cartesia API with systems that expect discrete generator responses per transcript.


# Contexts
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/contexts


<Info>
  This is a hands-on guide to input streaming using WebSocket contexts. For a conceptual overview of how input streaming works in Sonic, see the [input streaming guide](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).
</Info>

> In many real time use cases, you don't have your transcripts available upfront—like when you're generating them using an LLM. For these cases, Sonic supports input streaming.

The context IDs you pass to the Cartesia API identify speech contexts. Contexts maintain prosody between their inputs—so you can send a transcript in multiple parts and receive seamless speech in return.

To stream in inputs on a context, just pass a `continue` flag (set to `true`) for every input that you expect will be followed by more inputs. (By default, this flag is set to `false`.)

To finish a context, just set `continue` to `false`. If you do not know the last transcript in advance, you can send an input with an empty transcript and `continue` set to `false`.

<Note>Contexts automatically expire 1 second after the last audio output is streamed out. Attempting to send another input on the same context ID after expiry is not supported.</Note>

<ParamField type="boolean">
  Whether this input may be followed by more inputs.
</ParamField>

### Input Format

1. Inputs on the same context must keep all fields except `transcript`, `continue`, and `duration` the same.
2. Transcripts are concatenated verbatim, so make sure they form a valid transcript when joined together. Make sure to include any spaces between words or punctuations as necessary. For example, in languages with spaces, you should include a space at the end of the preceding transcript, e.g. transcript 1 is `Thanks for coming, ` and transcript 2 is `it was great to see you.`

### Example

Let's say you're trying to generate speech for "Hello, Sonic! I'm streaming inputs." You should stream in the following inputs (repeated fields omitted for brevity). Note: all other fields (e.g. `model_id`, `language`) are required and should be passed unchanged between requests with input streaming.

```json Input Streaming theme={null}
{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "inputs.", "continue": false, "context_id": "happy-monkeys-fly"}
```

<Tip>
  If [streaming in input tokens](/build-with-cartesia/capability-guides/stream-inputs-using-continuations), we recommend using `max_buffer_delay_ms`, which sets the maximum time the model will buffer text before starting generation.

  Without this option set, the model will start generating immediately on the first request, giving you full control over buffering of inputs.
</Tip>

If you don't know the last transcript in advance, you can send an input with an empty transcript and `continue` set to `false`:

```json Input Streaming theme={null}
{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "inputs.", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "", "continue": false, "context_id": "happy-monkeys-fly"}
```

### Output

You will only receive `done: true` after outputs for the entire context have been returned.

Outputs for a given context will always be in order of the inputs you streamed in. (That is, if you send input A and then input B on a context, you will first receive the chunks corresponding to input A, and then the chunks corresponding to input B.)

## Cancelling Requests

You may also cancel outgoing requests through the websocket.

To cancel a request, send a JSON message with the following structure:

```json WebSocket Request theme={null}
{
  "context_id": "happy-monkeys-fly",
  "cancel": true
}
```

When you send a cancel request:

1. It will only halt requests that have not begun generating a response yet.
2. Any currently generating request will continue sending responses until completion.

<Note>
  The `context_id` in the cancel request should match the `context_id` of the request you want to cancel.
</Note>


# Clone Voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices

Learn how to get the best voice clones from your audio clips.

<Frame>
  <img />
</Frame>

Voice cloning is available through the [playground](https://play.cartesia.ai) and the [API](/2024-11-13/api-reference/voices/clone). With current API versions, instant cloning uses **high-similarity** mode: clones sound more like the source clip, but may reproduce background noise. For the legacy **stability** workflow, pin API version `2024-11-13` and see [Older TTS models](/build-with-cartesia/tts-models/older-models).

For the best voice clones, we recommend following these best practices:

## General best practices for voice cloning

1. **Choose an appropriate script to speak.** You want your recording to align as closely as possible with the voice you want to generate. For example, don't read a colorless transcript in a monotone voice unless you're aiming for a monotonous clone. Instead, prepare a script that is suited to your use case and has the right energy.
2. **Speak as clearly as possible and avoid background noise.** For example, when recording yourself, try to use a high-quality microphone and be in a quiet space.
3. **Avoid long pauses.** Pauses in the recording will be mimicked by the cloned voice, such as between sentences. Ensure your recording matches the pacing you want your voice to follow.
4. **Trim your recording.** The audio you provide should roughly contain speech from start to finish. Make sure the speaker is not cut-off and that there's no excessive silence at the beginning or end. You can use a tool like Audacity or our playground make the perfect clip from your recording.
5. **Speak in the target language.** For instance, if you want the cloned voice to speak Spanish, speak Spanish in the recording. If this is not possible, you can use Cartesia's localization feature—available in the playground and in the API—to convert your clone to a different language.

## Best practices for high-similarity clones

1. **Limit your recording to ten seconds.** This is the sweet spot for high-similarity clones. A longer clip will not result in a better clone.
2. **Set `enhance` to `false` when cloning.** Unless your source clip has substantial background noise, any postprocessing will reduce the similarity of the clone to the source clip.


# Localize voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/localize-voices

Learn how to localize voices for your brand or product.

<LocalizeVoicesIntro />

The localization feature accepts a voice to localize, the gender of the voice, and the target language and accent to localize to, and produces a Voice that you can use to generate speech (or save as a new voice).

<Frame>
  <img />
</Frame>


# Stream Inputs using Continuations
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/stream-inputs-using-continuations

Learn how to stream input text to Sonic TTS.

In many real-time use cases, you don't have input text available upfront—like when you're generating it on the fly using a language model. For these cases, we support input streaming through a feature we call *continuations*.

<Info>
  This guide will cover how input streaming works from the perspective of the TTS model. If you just want to implement input streaming, see [the WebSocket API reference](/api-reference/tts/tts), which implements continuations using *contexts*.
</Info>

## Continuations

Continuations are generations that extend already generated speech. They're called continuations because you're continuing the generation from where the last one left off, maintaining the *prosody* of the previous generation.

If you don't use continuations, you get sudden changes in prosody that create seams in the audio.

<Note>
  Prosody refers to the rhythm, intonation, and stress in speech. It's what makes speech flow naturally and sound human-like.
</Note>

Let's say we're using an LLM and it generates a transcript in three parts, with a one second delay between each part:

1. `Hello, my name is Sonic.`
2. ` It's very nice`
3. ` to meet you.`

To generate speech for the whole transcript, we might think to generate speech for each part independently and stitch the audios together:

<Frame>
  <img alt="no_continuations" />
</Frame>

Unfortunately, we end up with speech that has sudden changes in prosody and strange pacing:

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>

Now, let's try the same transcripts, but using continuations. The setup looks like this:

<Frame>
  <img alt="continuations" />
</Frame>

Here's what we get:

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>

As you can hear, this output sounds seamless and natural.

<Check>
  You can scale up continuations to any number of inputs. There is no limit.
</Check>

## Caveat: Streamed inputs should form a valid transcript when joined

This means that `"Hello, world!"` can be followed by `" How are you?"` (note the leading space) but not `"How are you?"`, since when joined they form the invalid transcript `"Hello, world!How are you?"`.

In practice, this means you should maintain spacing and punctuation in your streamed inputs.

<Warning>
  **End complete sentences with closing punctuation** (for example `.`, `?`, or `!`).

  If a streamed chunk does not end with sentence-ending punctuation, the model often treats it as an incomplete sentence. That can cause:

  * **Extra latency:** Text may stay in the automatic input buffer until the model sees a clearer boundary or until `max_buffer_delay_ms` elapses (**3000ms by default**), so audio starts later than you expect.
  * **Audio artifacts:** The model expects natural sentence endings; without closing punctuation, the generated audio sometimes ends with odd or distorted sounds.

  When a user-facing utterance is finished, put terminal punctuation on the final segment (and signal that no more text is coming on the context when appropriate, for example `no_more_inputs()` in the SDK or `continue: false` over the WebSocket).
</Warning>

## Automatic buffering with `max_buffer_delay_ms`

When streaming inputs from LLMs word-by-word or token-by-token, we buffer text until the optimal transcript length for our model. The default buffer is 3000ms, if you wish to modify this you can use the `max_buffer_delay_ms` parameter, though we *do not recommend making this change*.

<Warning>
  If you plan on using `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with buffering, make sure decimal values are not split up.
  Submitting `1.0` as `1`, `.`, `0` will result in unintended failure modes.
</Warning>

### How it works

When set, the model will buffer incoming text chunks until it's confident it has enough context to generate high-quality speech, or the buffer delay elapses, whichever comes first.

Without this buffer, the model would immediately start generating with each input, which could result in choppy audio or unnatural prosody if inputs are very small (like single words or tokens).

### Configuration

* **Range**: Values between 0-5000ms are supported
* **Default**: 3000ms

Use this *only* if

* you have custom buffering client side, in which case you can set this to 0
* you have choppiness even at 3000ms, in which case you can try a higher value

```js lines theme={null}
// Example WebSocket request with `max_buffer_delay_ms`
{
  "model_id": "sonic-3",
  "transcript": "Hello",  // First word/token
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-conversation-123",
  "continue": true,
  "max_buffer_delay_ms": 3000  // Buffer up to 3000ms
}
```

Let's try the following transcripts with continuations and the default `max_buffer_delay_ms=3000`: `['Hello', 'my name', 'is Sonic.', "It's ", 'very ', 'nice ', 'to ', 'meet ', 'you.']`

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>


# Changelog 2024
Source: https://docs.cartesia.ai/changelog/2024

Product, API, and platform changes for 2024

<Update label="December 2024">
  ### API

  * Pricing updates; character usage columns migrated to bigint; presign URLs for Pro Voice Clone; **`voices/<id>/conditioning`** endpoint; file to dataset in presign; userID-level endpoint restrictions; Stripe Customer ID on checkout.
  * EU deployment and Hindi HC fixes.

  ### Playground

  * New model on Playground highlighting **transcript following** improvements (demo, not GA).
  * Blog and play.cartesia.ai live.

  ### Models / Voices

  * Model aliasing updated for **`sonic`** and **`sonic-preview`**; twilight-morning in API and enterprise; conditioning entries for voice clone and multilingual.
  * Embedding search for LoRA voice selection.

  ### Other

  * Infrastructure and scaling updates.
  * State of Voice blog and map.
</Update>

<Update label="November 2024">
  ### API

  * **Cartesia-Version 2024-11-13** — Upgrade to new API version; **unified clone voice endpoint**; datasets support; files endpoint pagination; FineTuneRequest status; fine-tunes API in Playground; presign URLs for Pro Voice Clone; **Flush Done** event for manual WebSocket flushing; **`<pause>`** tag for continuations.
  * GCP Enterprise.

  ### Playground

  * Changes for new API; replay suite; GCP Enterprise.

  ### Models / Voices

  * **Flush Done** event for manual flushing in WebSocket; **`<pause>`** tag for continuations within a single transcript; spelling fixes; manual flush and flush ID.
  * Empty encoding field allowed for mp3.

  ### Docs

  * API version **2024-11-13**: Sonic 2, capability guides (clone, pronunciations, speed/emotion, continuations, localize), formatting for Sonic 2.
  * Integrations: LiveKit, Pipecat, Rasa, Thoughtly, Twilio, MCP. Enterprise: SSO, organizations. See [API Conventions](/use-the-api/api-conventions).
</Update>

<Update label="October 2024">
  ### API

  * Cartesia JS bytes endpoint; gen blocks removed from character counting; health checks and middleware; **user-level queueing** with queue length cap and timeout; 10× queue size rejection; Slang (continuations) and ConditioningData; voice changer JS SDK.
  * Remove max limit from Playground.

  ### Playground

  * GCP: API and ingress for GCP US Central. Queueing: user-level queueing in API gateway; queue length cap and `queuedRequest` timeout.
  * Voice Changer: Playground UI polish; ConditioningData as part of ResolvedVoice; Slang rollout; flush on start/end of spell tags.
  * LoRA release UI; onboarding data upsert fix; welcome page submit loading state; enterprise contact links.

  ### Other

  * Canonical linking and sitemap.
  * Blog and navigation (Blog, Careers) updates.
</Update>

<Update label="September 2024">
  ### API

  * User-level queueing; queue size and websocket queueing rejection; **`api_status`** field for voice API usability; LoRA pricing and UX cleanup; **flush all audio on DONE token** (including CB); user option to obfuscate transcripts in logs.
  * LoRA and load balancer improvements.

  ### Playground

  * **Function calling**; agent creation, tests, and dev setup; voice agent infrastructure enabled.
  * LoRA: HiFi cloning endpoint and Playground page; 8 new voices on Playground; Indian accent.
  * **Voice Changer** Playground UI; JS SDK for voice changer. Language added to TTS request from `voices/[id]`; flush all audio on DONE token; user option to obfuscate transcripts in logs.

  ### Docs

  * Blog and sitemap updates.
</Update>

<Update label="August 2024">
  ### API

  * Reject invalid transcripts (docs and API gateway); `no_more_inputs` in WebSockets can use `voice_embedding` instead of `voice_id`.
  * Improved bad model id handling.

  ### Playground

  * **Localization** page in Playground and JS client; dialects and future-compatibility. Switch Playground to voice ID; allow both `id` and embedding for `TTSRequest`; archive voices (kept accessible via API).
  * Replay button; feedback form; fix multilingual recommended voices when switching back to English; better error messaging.

  ### Models / Voices

  * **LoRA** support (multiple voices per LoRA, new cache key, easy-brook-lora, vc-flowing-dream).

  ### Other

  * On-device homepage launch; proper links for "Request a demo" button.
  * **LoRA**: multiple voices per LoRA.
</Update>

<Update label="July 2024">
  ### API

  * **Voice Conversion endpoint** — New API endpoint. **Timestamps** on WebSocket endpoint; **per-generation voice controls** (speed, emotion) in API; polar-tree deployed (`sonic-multilingual`); continuous batching support; VocalWave (English) and long-generation support; `sonic-english` → vocal-wave, `sonic-multilingual` → ancient-voice aliasing.
  * **`buffer`** and **`mp3`** params on `/bytes`; MP3 streaming and WAV encoding fixes; request cancellation; empty transcript allowed when `continue=false`; Stripe webhook cache clear; subscription cancellation/reactivation; Redis cache for overage; keys endpoints.
  * Clerk-based auth in API.

  ### Playground

  * Optional **`enhance`** flag for voice cloning in JS client, Python client, and Playground; voice update endpoint and docs; gate voice cloning for free users.
  * Prevent playing audio while playback in progress; download button disabled until generation finished; API key deletion clearer with copy button; character usage indicator; subscription and checkout fixes; gating clone form for free users.

  ### Docs

  * Voice cloning docs; timestamps and continuations; user guides for voice control and Twilio; emotion control and timestamps; "phonemes" terminology.
  * Voice cloning from file.

  ### Other

  * Python client: continuations support, custom `base_url`, fallback for websockets; JS client v1.0.1: `onError` prop on useTTS.
  * Voice controls (speed, emotion) in Python client and docs.
</Update>

<Update label="June 2024">
  ### API

  * **Continuations** — Support for streaming input via SSE and Bytes; **`NoMoreInputs`** signal. **Cartesia Version** enforced via header; Playground and checkout/subscription endpoints send it.
  * 48 kHz added to valid sample rates; `.wav` byte streaming; HTTP streaming endpoint for raw bytes; API standardization (backwards-compatible); new voices endpoints; mulaw and alaw backwards compatibility; Python client v1.0.0 (overhaul, `output_format`); JS client: `pcm_s16le`, `pcm_alaw`, `pcm_mulaw` and improved typing; caching for voices; **`context_id`** in WebSocket response and docs.
  * Stripe webhooks for renewals and expiration; OpenAPI spec update.

  ### Playground

  * Multilingual: `language` parameter on voices API and in API; Playground language selection; multilingual copy on homepage; default `sonic-english` → feasible-haze.
  * Mobile layout improvements; multilingual UI papercuts; voice cloning and empty transcript styling fixes; filtering moved from `voices/[id]` to Speak page.

  ### Models / Voices

  * **`sonic-multilingual`** and **`sonic-english`** aliasing; `language` column on voices.
  * Recommended voices.

  ### Docs

  * Version **2024-06-10**: get-started, API conventions, integrations (LiveKit, Pipecat, Rasa, Thoughtly, Twilio, MCP), clone voices, embeddings/voice mixing. See [API Conventions](/use-the-api/api-conventions).

  ### Other

  * ToS changes; revised pricing tiers; legal notices on sign-in and sign-up; overage toggle in Playground.
  * Character usage limit blocks WebSocket when exceeded.
</Update>

<Update label="May 2024">
  ### API

  * **Cartesia Version** header; HTTP streaming for raw bytes; new voices endpoints; mulaw/alaw backwards compatibility; API standardization (backwards-compatible); Python client v1.0.0; JS client structure overhaul.
  * Clone voice upload fix.

  ### Playground

  * Redesign and Sonic launch copy; subscriptions page; favoriting voices; **emotion and speed sliders**; User vs Default voices; **tags** (Age, Accent) in DB and Playground; **`sample_text`** field (API Gateway and Playground); buffer streamed audio before playback; character usage indicator; API key auto-created on user creation; custom sign-in/sign-up and 404 on sign-out fix; disable generation button while audio playing; human-readable model names and skilled-cherry.
  * Character limit increase.

  ### Models / Voices

  * Human-readable model names; skilled-cherry; polar-tree (`sonic-multilingual`); continuations and output format; Python client numpy array support.
  * Voice cloning disclaimer.

  ### Docs

  * Mintlify docs added.

  ### Other

  * Stripe webhooks for subscriptions; subscription cancellation and reactivation; character usage checks on generation routes; free subscription by default; Scale plan limit (8M chars/month); checkout and receipts.
  * Custom sign-in/sign-up pages.
</Update>

<Update label="April 2024">
  ### API

  * **`model_id`** added as parameter to generate; minimum transcript length enforced; `voice` moved to `AudioGenerationRequest`; experimental router removed; speed controls and voice edit page; video generation endpoint.
  * WhisperX removed from dependencies.
</Update>

<Update label="March 2024">
  ### API

  * WebSocket interrupt support; get voice embedding route; Redis cache for API keys; streaming switched from Octet to JSON; new model `genial-planet-1346`; `voice` param required on requests; formatting support.
  * WhisperX for transcription (later removed).

  ### Playground

  * Voice cloning in the UI; connection info in JS client; audio downloadable; transcript length validation (max 400 chars, empty rejected); improved UX and crash handling when API key missing; welcome message and icons.
  * API key creation on sign-up via Clerk webhooks.

  ### Other

  * Voice cloning and connection info in JS client.
</Update>


# Changelog 2025
Source: https://docs.cartesia.ai/changelog/2025

Product, API, and platform changes for 2025

<Update label="December 2025">
  ### API

  * **sonic-3-latest** (preview) and dated **sonic-3-YYYY-MM-DD** snapshots.
  * **sonic-3-latest** added to Playground TTS with banner when selected. See [Changelog 2026](/changelog/2026).

  ### Voice changes

  * **Voice Library** — December: 25 new voices across 6 languages (12 English, 6 Hindi, 4 Arabic, 1 Spanish, 1 Japanese); 14 featured.
  * Voice library changes; featured voice badge on voice page; **`/voices/recent`** endpoint.

  ### Playground

  * **Report generation** (report button, alert when user reports).
  * **Voice move**; **archive and publish** voices.
  * **PVC**: custom PVC voices UI, multiple user errors surfaced to UI, feature flag for custom model during creation.
  * **Pronunciation dicts**: new backend APIs, generator on create/edit, case sensitivity badge.
  * **Agents**: new text-to-agent UI, create agent from **Github repo tarball**, system prompt generator for UI agent.
  * **Narrations sunset** notice; TTS History pagination; auth strategy for access-tokens.
  * **sonic-3-latest** banner and naming.

  ### Other

  * PVC, STT, and agent improvements.
  * Error handling and error codes.
</Update>

<Update label="November 2025">
  ### API

  * Improved error handling and public error responses; cache invalidation by voice ID.
  * IPVC train API (remove **`markAsReady`**); dataset files overfetch fix; default voice logic fix.

  ### Playground

  * Pronunciation dicts migrate to new backend APIs; persist visual theme to DB; PVC pipeline error and recommendations.
  * Call logs conversation view default; TTS textarea height fix; Sonic-3 model for partners shown.
  * Billing overage "blood bar" and alert fixes; PVC gate for Startup plan.
  * Pronunciation dict generator on create/edit; API version in dialog; featured voice toggle; narrations model selection.

  ### Line / Agents

  * No user audio warning (250ms); Pipecat DeepgramNovaVADFilter.
  * Call recording and artifact storage fixes.

  ### Models / Voices

  * Sonic 3 PVC and normalizer updates; LoRA and PVC error handling; expand option for dataset file count.
  * **`preview_file_url`**; **`tags_operator`** on GET /voices; restrict delete to non-public voices; **`owner_id`** check for fine tune voices; **`user_errors`** for PVCs.
  * New Arabic accents; African French and Canadian French.
</Update>

<Update label="October 2025">
  ### Model changes

  * **Sonic 3 launch (Oct 27)** — **sonic-3-2025-10-27** stable snapshot released; 42 languages; volume, speed, and emotion controls.
  * Real-time conversation with emotion and laughter; \~190ms median latency. See [Sonic 3](/build-with-cartesia/tts-models/latest) and [Volume, Speed, and Emotion](/build-with-cartesia/sonic-3/volume-speed-emotion).

  ### Other

  * Continued PVC, STT, and agent improvements; error handling and public errors; manifone voices; Sonic 3 PVC and normalizer updates.
  * Transcript buffer multilingual and Thai pronunciation dictionary fix; TTFA buffering and reporting; Voice Conversion operator reload; audio norm operator.
</Update>

<Update label="September 2025">
  ### API

  * **`user_id`** to **`owner_id`** in API (model aliasing / ownership).
  * Improved error handling and version/limit checks.

  ### Line / Agents

  * Warning if no user audio for 250ms+; Pipecat **DeepgramNovaVADFilter** for spurious `on_speech_started`.
  * Call recording and artifact storage fixes.

  ### Models / Voices

  * STT: Migrate STT providers to Deepgram where appropriate; Deepgram for non-English or language-detect agents; word-level user text chunks.
  * Sonic 3 / PVC: Sonic 3 PVC updates; Hindi Sonic 3 normalizer revert; LoRA data processing and expand option for dataset file count; PVC errors to webhook.
  * Manifone new voice; African French and Canadian French accents; partner agents can configure TTS models.

  ### Other

  * LoRA bugfixes.
</Update>

<Update label="August 2025">
  ### API

  * Production-facing agent WebSocket; **cancel endpoint** for ending live calls.
  * Improved error handling and public error codes; cache invalidation by voice ID.

  ### Playground

  * Telephony: stop billing for customer-managed numbers; Cartesia vs Twilio param separation.
  * Outbound number management columns.

  ### Line / Agents

  * **Deepgram Nova VAD** (`utterance_end_ms` configurable via **`vad_stop_secs`**).

  ### Models / Voices

  * New endpoint for **`<audio>`** tags; **accent** column on voice API; **`max_buffer_delay`** applied to continuations; eu-north-1 region.
  * **GET /voices** **`tags_operator`**; **`preview_file_url`**; restrict deleting voices to non-public; check **`owner_id`** when listing fine tune voices; **`user_errors`** for PVCs from API.
  * New Arabic accents migration.

  ### Other

  * Max rollover multiplier for credit plans.
</Update>

<Update label="July 2025">
  ### API

  * **`deploy_error`** status fix.

  ### Playground

  * **LangChain** launched voice agents with Cartesia Sonic TTS.
  * Billing: Stripe customer for enterprise if needed; call runtime logs in call logs side panel; Call Logs UI nits (from June work).

  ### Line / Agents

  * Partner pipeline parity with User Agent; **concurrency fix** (negative concurrency); agent metric LLM credit usage for evals; AgentEvaluations functionality.
  * User Code Connector WS handlers fix; agent end turn handling; summarization system prompt; **`user_prompt`** in API; transcript removed from agent metric result; deadlock fix in WS timeout.

  ### Other

  * Flushing and concurrency fixes.
</Update>

<Update label="June 2025">
  ### API

  * **UserCodeAgent** deployment URL; **cancel endpoint** for force-ending live calls via API; Agent EoUD metric; cartesia agent speed-up; user prompt stored separately in agent metrics; **`agent_evaluations`** table; async flush for aggregator; User Code Connector WS and last bot turn handling; deployment URL delay on pickup.
  * Concurrency and WS timeout fixes; improved goroutine handling; agent workers **`/chats`** timeout increase.

  ### Playground

  * **Call Logs** page for agents with data table and side panel; **Agents demo** with Twilio web dialer, visualizer, and like/dislike feedback; deployment detail page and list; **Twilio number provisioning** (Parts 1 & 2); GitConnector redeploy on commit; deployment logs; zip upload for deployment; feature flag by organization; agents gated behind feature flag; **Deepgram as default STT** for agents; orgs v2 (frontend and backend); 20K credits for organizations; enterprise free trial days and email invoice options.
  * **Credit usage**: separate TTS & STT concurrency panels; STT and Infill charts; voice page copyable fields; call runtime logs in call logs panel.

  ### Models / Voices

  * STT: Whisper large v3; serve multiple models in STT pipeline; word-level user text chunks.
  * FinetunedSTTContext fixes.
</Update>

<Update label="May 2025">
  ### API

  * Voice conversion in enterprise.

  ### Playground

  * Post–April: Following [April 2025](/changelog/2025) API changes (embeddings removed; use [Voice IDs](/build-with-cartesia/tts-models/voice-ids) and [Clone Voice](/api-reference/voices/clone)).

  ### Line / Agents

  * User code deployments from DB; **`agent_deployments`** table; STT cartesia-streaming and Pipecat streaming Whisper; Bedrock proxy for OpenAI-compatible; timestamp bug fixes and default to original timestamps.
  * Partner `/chat` and `/config` updates; DTMF support in UserCodeConnector; endpointing architecture.

  ### Models / Voices

  * STT: Batch engine utilization; Pipecat streaming Whisper.
  * Deepgram STT client `url`/`base_url` fix.

  ### Other

  * Voice clone uploads fix.
</Update>

<Update label="April 2025">
  ### Breaking

  * **sonic-2-2025-04-16** — Starting with **`sonic-2-2025-04-16`**, we're removing support for: Embeddings; **`stability`** cloning mode; Experimental controls for speed and emotion. The **`similarity`** cloning mode is dramatically better. To control speed and emotion today, use Instant Voice Cloning (e.g. FFMPEG, Voice Changer, or instant clones from **`sonic-2-2025-03-07`** embeddings). Users who need embeddings or experimental controls can use API version **`2024-11-13`** with model **`sonic-2-2025-03-07`** (both still available). See [Older models](/build-with-cartesia/tts-models/older-models).

  ### API

  * listVoices by ID for single voice; warm-monkey PVC; **access tokens** (JWT); Cartesia-Version 2024-11-13; phoneme/original timestamps language check; TTS History source; LoRA from fine-tune checkpoints; context expiry replaced by input stream delay.
  * **`sonic-2`** and **`sonic-2-2025-04-16`** ignore experimental controls on TTS generations; voice cloning supports only **`similarity`** clones.
  * Removed embeddings from all endpoints; voices may only be specified by Voice ID; **`/tts`** cannot be called with voice embeddings.
  * Deprecated **`/voices/create`** and **`/voices/mix`**.
</Update>

<Update label="March 2025">
  ### Breaking

  * **sonic-2-2025-03-07** is the last Sonic 2 snapshot supporting voice embeddings and experimental controls. Use with API version **`2024-11-13`** for legacy behavior.
  * sonic-preview → JollyTotem, RoseLion deprecated; sonic-2 alias to jolly-totem for speaker switching. See [Older models](/build-with-cartesia/tts-models/older-models).

  ### API

  * **Cartesia-Version** updated to **2024-11-13**; model latency via header on bytes endpoint; new Sonic PVC model warm-monkey; listVoices by ID (single voice); **access tokens** (JWT signing, validation); API-level check for languages supporting phoneme and original timestamps.
  * Organizations and billing; **free credits** 10k → 20k; overages product; subscription cache invalidation webhook; TTS History **source** column (api, playground, narrations); LoRA voices from base VoiceVariation and checkpoint for fine tunes.

  ### Playground

  * **sonic-2** and **sonic-turbo** aliases launched; Sonic 2 / Sonic Turbo messaging (Turbo = 40ms latency).
  * cartesia.ai/sonic and playground updates.

  ### Line / Agents

  * Agent ID in websocket URL; telephony info on partner calls; Pipecat version upgrade; partner demo tool calls; warm-monkey PVC model; prespeak and function call flow updates.
  * Twilio voice routes support agent IDs; Keypad DTMF on agent; half-duplex STT and LLM context; original timestamps support in API.

  ### Other

  * **sonic-pvc** alias and DryVoice as sonic-pvc model. **Python SDK** announced.
</Update>

<Update label="February 2025">
  ### API

  * **listVoices** by ID; localize endpoint voice name fix; 400s for bad body params; text forcing max transcript length; **OpenAI-compatible STT server**; agent with local STT; voice tags; on-device transcripts in evals; jolly-totem as default sonic-preview.
  * S2S and Agents foundational blocks.

  ### Playground

  * Instant cloning enabled for free users; voice tags; localize refactored to use conditioning; listVoices can query by ID for single voice; Sarah (Similarity) and Southern Woman migrations; on-device transcripts.
  * Narrations settings (JSONB).

  ### Line / Agents

  * Agent with local STT; foundational S2S + Agents blocks; design and pipeline work.

  ### Models / Voices

  * STT: cartesia-streaming and Pipecat streaming Whisper; on-device transcripts.
</Update>

<Update label="January 2025">
  ### API

  * **sonic-lite** added to API; EU deployment for prod API; save option for TTS bytes handler; CORS header for **Cartesia-File-ID**; Stripe credits default to `char_limit` in checkout; Redis cache for overage settings; polar-mountain and VC in EU; ListFiles paginator fix.
  * Eval break/spell tags and replacement/normalization mode.

  ### Models / Voices

  * sonic-preview routed to MisunderstoodFrog; polar-mountain added and staged; visionary-yogurt timestamp requests for any language.
  * jolly-totem as default sonic-preview.
</Update>


# Changelog 2026
Source: https://docs.cartesia.ai/changelog/2026

Product, API, and platform changes for 2026

<Update label="April 2026">
  ### Sonic 3.5

  *Sonic 3.5 is now available on `sonic-3-latest`. We'd love for you to try it and tell us what you think.*

  #### Why you should try it

  * **More natural speech, pacing, and emotional expression**, especially noticeable on expressive, conversational, and support-style transcripts.
  * **Cleaner audio quality** across all languages and voices.
  * **Better alphanumeric read-out** — confirmation codes, order numbers, phone numbers, IDs, and emails sound meaningfully more natural, in all supported languages.
  * **Step-change multilingual performance**, particularly Hebrew, Japanese, Spanish, Hindi, German, Korean, and French.
  * **English heteronyms** — tricky English heteronyms like "read," "bass," and "bow" now pronounce correctly in context.

  #### How to try it

  1. Point your API call or Playground request to the model ID `sonic-3-latest`.
  2. Keep your existing voice IDs, request shape, and prompting — no code changes required for most customers.
  3. Send us feedback on any voice or transcript that behaves differently than you expect.

  <Note>
    As with any `-latest` alias, `sonic-3-latest` can be updated without notice and is not recommended for production. Pin to a dated snapshot (e.g. `sonic-3`) for production traffic.
  </Note>

  #### What to know to be successful

  * **Spell tags still work the same way.** If you already wrap alphanumerics in `<spell>...</spell>`, you don't need to change anything — you'll just get better-sounding output. See [here](/build-with-cartesia/sonic-3-5/prompting-tips#controlling-pacing-and-spelling) for more details.
  * **If you use custom delimiters** (commas/periods between characters or groups) to control pacing, our recommended format has changed. Use **spaces between characters** and **commas between groups**, e.g. `A B C, 1 2 3` instead of `A, B, C. 1, 2, 3.`. See [Prompting tips for Sonic 3.5](/build-with-cartesia/sonic-3-5/prompting-tips) for more details.
  * **Speed and volume controls are temporarily disabled** on `sonic-3-latest`. If you rely on speed or volume augmentation (including via SSML), stay on `sonic-3` for now. We believe that Sonic 3.5 has more natural pacing and you may find that you don't need to use speed control as much when using this model.
  * **Timestamps behave slightly differently.** If you use end-of-word timestamps for interruption handling, you should not see a meaningful change. If you depend on beginning-of-word timestamps, please test carefully and reach out if you see regressions for your use case.
  * **Existing Professional Voice Clones (PVCs) do not carry over to `sonic-3-latest`.** Professional Voice Clones are pinned to the base model they were trained on (e.g. `sonic-3`) and will function as a standard voice clone for this model. For more information, see [Clone Voices (Pro)](/build-with-cartesia/capability-guides/clone-voices-pro/playground).
  * **Providing proper context to the model improves naturalness.** Please see our buffering guide [here](/use-the-api/tts-websocket/buffering) for more details.

  #### Where to look for help

  * [Sonic 3.5 model overview](/build-with-cartesia/tts-models/sonic-3-5)
  * [Prompting tips for Sonic 3.5](/build-with-cartesia/sonic-3-5/prompting-tips)
  * [Model aliases and snapshots](/build-with-cartesia/tts-models/latest#continuous-updates-and-model-snapshots)
</Update>

<Update label="March 2026">
  ### Breaking

  * **Text-to-Agent (T2A) API** — Text-to-Agent workflow for Line is **deprecated**.

  ### API

  * **Error responses** — For `Cartesia-Version: 2026-03-01`, we now return structured JSON. See [API Errors](/use-the-api/api-errors).
    * API versions before `2026-03-01` continue to return legacy error formats (for example HTTP `Title: Message`).
    * **Voices** — `PATCH /voices/{id}`: voice owners can now update accent and gender. Voice creation validates language. Invalid voice UUIDs and pronunciation-dictionary IDs return 404 instead of ambiguous errors.
  * **PVC model routing** — PVC voices require a dated model ID (e.g. **`sonic-3-2026-01-12`**) instead of **`sonic-3`**. See [Clone Voices (Pro)](/build-with-cartesia/capability-guides/clone-voices-pro/api).
  * **Voice search** — Name and metadata search is **diacritics-insensitive**.

  ### Playground

  * **Pro voice clones**
    * Clearer **language mismatch** messaging
    * **Background noise removal** is now a simple on/off control
    * **Fine-tuning model support**:
      * Removed support for older models
      * Now only **sonic-3-2026-01-12** is supported
  * **Multilingual agents** — Multilingual agent configuration is now supported in the Playground.
  * **Agents UI** — Search by **call ID** and **agent ID**.

  ### Billing

  * **Concurrency** — Organizations can receive **notifications** when concurrency nears configured **limits**.

  ### Model / voice

  * **Professional Voice Clones** — Backend updates improve stability of the professional voice cloning workflow.
  * **Accents & filters** — Additional **accent** options (e.g. **Irish**, **New Zealand**, **South African**, **Belgian**) and **locale aliases** for accent filtering in APIs and Playground.
  * **Voice Library** — **94** new voices across **17** locales (including Arabic, German, English variants, Spanish, Finnish, French, Hebrew, Hindi, Japanese, Korean, Polish, Portuguese, Swedish, Telugu, Thai, and more).

  ### Self-hosted

  * **On-premises** — API for managing voices on self-hosted deployments.

  ### Cartesia SDK

  * **cartesia-js v3.0.0** (Mar 2) — Major updates:

    * New features: `flush_id` included in chunk and voice changer binary responses; `output_format` and infill support; inline WebSocket response types; byte endpoint returns **ArrayBuffer**; improved **WebPlayer** and client export.
    * Fixes: memory leak and timing issues with abort signals/listeners, handling of empty `Content-Length`, and **TimeoutError** now includes a message.

    See [cartesia-js releases](https://github.com/cartesia-ai/cartesia-js/releases) for full details.
</Update>

<Update label="February 2026">
  ### Line

  * **[History Management API](/line/sdk/agents#history-management)**: You can add or replace the history provided to your agent, for example, to summarize a long conversation.
  * **[Custom User Events](/line/sdk/events#custom-event)**: You can send bidirectional custom events between your client and the agent. You could use this, for example, if you have a web application with UI interactions.
  * **[Uninterruptible Messages](/line/sdk/events#speech)**: You can set messages as uninterruptible. A common use case is a legal disclaimer at the beginning of a call.
  * **End Tool Call Improvements**: The default end call tool call is more conservative to prevent calls from ending prematurely.

  ### API

  * Increased reliability of API connections

  ### Cartesia SDK

  * **cartesia-python v3.0.0** (Feb 9). See full details in [cartesia-python releases](https://github.com/cartesia-ai/cartesia-python/releases).

  ### Playground

  * Shipped a new TTS page
  * Shipped a new Voice Creation page
  * Shipped a new Agents page

  ### Model changes

  * **Improved pronunciation of real-world text patterns across languages**
    * Enhanced support for structured and formatted speech patterns: numbers, dates, times, currency, phone numbers, IDs, percentages, and amounts/measurements.
    * Support for various date formats (YYYY-MM-DD, YYYY/MM/DD, 年月日).
    * Support for measurement units (meters, kg, tablespoon, gigabytes, etc.) with locale awareness.
    * Support for domestic and international phone number formats with locale-specific chunking for French, Italian, German, Portuguese, Korean, and more.
    * Improved alphanumeric ID handling with katakana/hiragana readings and Latin acronym transliteration to katakana for Japanese.
    * Improves all languages except English, Hindi & other Indic languages, Arabic, Hebrew, Chinese, Swedish, Georgian, Bulgarian, and Tagalog (targeted for future updates).
  * **Support for regional and locale-specific pronunciation within languages**
    * Regional voices use region-specific terms in addition to accent (e.g. Belgian and Swiss French "nonante" vs. Canadian and French "quatre-vingt-dix").
    * Region-specific number terminology, currency symbols, date formats, and measurement units.
    * Locale-aware date and time formatting (e.g. Russian year suffixes, French/Spanish time conventions).
    * Locale-aware currency symbol handling (e.g. \$ as "dollars" in en\_US and "pesos" in es\_MX).
    * Locale pronunciation falls back to the primary country for that language (e.g. US for English, Brazil for Portuguese). We will continue to expand locale-aware support.
    * Improves all languages except English, Hindi & other Indic languages, Arabic, Hebrew, Chinese, Swedish, Georgian, Bulgarian, and Tagalog (targeted for future updates). Existing regional pronunciation for English voices (e.g. British) is unaffected.

  ### Voice changes

  * **Voice Library**: 39 new voices across 21 locales

  ### Breaking changes effective June 1, 2026

  The following model snapshots and languages are discontinued effective June 1, 2026:

  | Model                | Snapshots                                                        | Languages                  |
  | -------------------- | ---------------------------------------------------------------- | -------------------------- |
  | `sonic`              | All                                                              | All                        |
  | `sonic-english`      | —                                                                | All                        |
  | `sonic-multilingual` | —                                                                | All                        |
  | `sonic-2`            | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | it, nl, pl, ru, sv, tr, hi |
  |                      | `sonic-2-2025-03-07`                                             | All                        |
  | `sonic-turbo`        | `sonic-turbo-2025-06-04`                                         | it, nl, pl, ru, sv, tr     |
  |                      | `sonic-turbo-2025-03-07`                                         | All                        |

  The following endpoints are discontinued effective June 1, 2026:

  | Discontinued Endpoint                      | Replacement                                |
  | ------------------------------------------ | ------------------------------------------ |
  | Voice Embedding: `POST /voices/clone/clip` | [Clone Voice](/api-reference/voices/clone) |
  | Mix Voices: `POST /voices/mix`             | —                                          |
  | Create Voice: `POST /voices`               | [Clone Voice](/api-reference/voices/clone) |

  The following endpoints stop accepting voice embeddings effective June 1, 2026:

  | Endpoint with a breaking change       | Replacement |
  | ------------------------------------- | ----------- |
  | TTS (bytes): `POST /tts/bytes`        | Voice ID    |
  | TTS (SSE): `POST /tts/sse`            | Voice ID    |
  | TTS (WebSocket): `WSS /tts/websocket` | Voice ID    |
</Update>

<Update label="January 2026">
  ### API

  * **Regionalization** — Calls routed to US, EU, APAC by origin.
  * **Parameterized outbound calls** — [Docs](/line/integrations/telephony/outbound-dialing)
  * **Pronunciation dictionaries** — [Docs](/line/sdk/agents#custom-pronunciations)

  ### Model changes

  * **Sonic-3 model versioning scheme introduced**
    * New preview track: **`sonic-3-latest`** (continuous updates for early access and feedback).
    * Stable track: **`sonic-3`** always points to the most recent stable release.
    * Immutable dated snapshots: **`sonic-3-YYYY-MM-DD`** never change.
    * Details: [Continuous updates and model snapshots](/build-with-cartesia/tts-models/latest#continuous-updates-and-model-snapshots)
  * **Promotion to stable checkpoint:** **`sonic-3-2026-01-12`**
    * Included improvements: consistent speed & volume, custom IPA pronunciations with stronger adherence, Hindi prosody improvements, Korean prosody/intonation improvements.

  ### Voice changes

  * **Featured Voices launched** — Curated set of 30+ best-performing voices (e.g. [Cathy](https://play.cartesia.ai/voices/e8e5fffb-252c-436d-b842-8879b84445b6), [Henry](https://play.cartesia.ai/voices/87286a8d-7ea7-4235-a41a-dd9fa6630feb)).
  * **Voice Library** — December: 25 new voices across 6 languages.
  * **Voice Library** — January: 9 Spanish voices (Mexican, Colombian, Castilian).

  ### Playground

  * Voice library usability improvements (test with your own scripts, call an agent per voice).
  * One-click **Report Issue** on TTS Playground.
  * Mini voice picker (recently used + saved) on TTS page.
  * PVC UI + reliability (loading skeletons, error messages, better behavior with large datasets and silence).

  ### Line

  * **Line SDK v0.2** — [Repo](https://github.com/cartesia-ai/line). Improved DX, long-running tool-call handling, **committed turns**, better turn-taking and transcription.
</Update>


# Set up an organization
Source: https://docs.cartesia.ai/enterprise/set-up-an-organization


Organization workspaces enable seamless collaboration between multiple team members. All users in an organization share the same view of resources, including voices, API keys, and datasets. The only exceptions are playground generation history and starred voices, which remain private to each individual user.

By default, your Cartesia account initializes as an organization workspace on the Free subscription plan with a limit of one member.

<Warning>
  To invite team members, you must first upgrade your organization to the
  Startup tier or higher. After upgrading, you can invite unlimited users at no
  additional cost.
</Warning>

## Manage your organization

<Steps>
  <Step title="Upgrade your current organization">
    Organizations must be upgraded to the Startup tier or above before team members can be invited. Each workspace has its own billing and credit limits, so make sure you are on the intended organization before proceeding to upgrade your subscription.

    <Frame>
      <img alt="Upgrade organization" />
    </Frame>
  </Step>

  <Step title="Invite your team">
    Once you've upgraded your organization, you can use the "Manage" button in the workspace switcher to manage it:

    <Frame>
      <img alt="Organization manage button in switcher" />
    </Frame>

    This pops up a modal where you can change your profile and invite your team:

    <Frame>
      <img alt="Organization manager modal" />
    </Frame>

    There are two membership types in an organizaton:

    1. Admin: has the ability to manage the organization profile, invitations, and members.
    2. Member: can use all functionality included in the subscription, but cannot alter organization settings.

    <Frame>
      <img alt="Organization membership types" />
    </Frame>

    You can invite unlimited team members in an organization once it is on Startup tier or higher.
  </Step>

  <Step title="Create voices, API keys, and other resources in your organization">
    Once your organization is upgraded, voices, Line agents, API keys, and other resources will be available to all users in the organization.
  </Step>
</Steps>

## Create additional organizations

If you want separate workspaces on different subscriptions, you can create another organization by going to the playground at [https://play.cartesia.ai](https://play.cartesia.ai), selecting the workspace switcher, and clicking **Create organization**.

<Frame>
  <img alt="Create organization" />
</Frame>

This will bring up a dialog where you can name your organization and upload a logo.

<Frame>
  <img alt="Organization creation dialog" />
</Frame>

Please reach out to us at [support@cartesia.ai](mailto:support@cartesia.ai) if you run into any troubles with your organization.


# Set up SSO
Source: https://docs.cartesia.ai/enterprise/set-up-sso


We support Single-Sign On (SSO) for customers on the Enterprise plan via SAML. This integration is processed through our identity provider, [Clerk](https://clerk.com).

## Set up SSO with Okta

1. Send us your SSO domain.
2. We will send you a service provider configuration, which consists of a single-sign on URL and an audience URI (SP entity ID).
3. Follow steps 2, 3, 4, and 5 in [the Clerk SSO guide](https://clerk.com/docs/authentication/enterprise-connections/saml/okta), and send us the metadata URL you get from step 6.1.

After you are done, we will complete the remaining SSO setup and send you a confirmation that SSO is enabled for your organization.


# Error Handling
Source: https://docs.cartesia.ai/examples/error-handling

Example of error handling with SDK exceptions.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def error_handling_example(client: Cartesia) -> None:
        """Example of error handling with SDK exceptions."""
        try:
            _response = client.tts.generate(
                model_id="sonic-3",
                transcript="Hello, world!",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
            )
        except BadRequestError as e:
            print(f"Bad request: {e}")
        except AuthenticationError as e:
            print(f"Auth failed: {e}")
        except NotFoundError as e:
            print(f"Not found: {e}")
        except RateLimitError as e:
            print(f"Rate limited: {e}")
        except APIError as e:
            print(f"API error: {e}")
    ```

    From [cartesia-python/examples/examples.py:545](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L545)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function errorHandling(client: Cartesia): Promise<void> {
      /** Example of error handling with SDK exceptions. */
      try {
        await client.tts.generate({
          model_id: 'sonic-3',
          transcript: 'Hello, world!',
          voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
          output_format: { container: 'wav', encoding: 'pcm_f32le', sample_rate: 44100 },
        });
      } catch (e) {
        if (e instanceof BadRequestError) {
          console.log(`Bad request: ${e.message}`);
        } else if (e instanceof AuthenticationError) {
          console.log(`Auth failed: ${e.message}`);
        } else if (e instanceof NotFoundError) {
          console.log(`Not found: ${e.message}`);
        } else if (e instanceof RateLimitError) {
          console.log(`Rate limited: ${e.message}`);
        } else if (e instanceof APIError) {
          console.log(`API error: ${e.message}`);
        } else {
          throw e;
        }
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:398](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L398)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py error_handling_example
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts errorHandling
    ```
  </Tab>
</Tabs>


# Create Infill Audio
Source: https://docs.cartesia.ai/examples/infill-create

Create infill audio between two clips.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def infill_create(client: Cartesia) -> None:
        """Create infill audio between two clips."""
        from pathlib import Path
        # Can pass file paths directly (as Path objects)
        response = client.tts.infill(
            model_id="sonic-3",
            language="en",
            transcript="Infill text",
            left_audio=Path("left.wav"),
            right_audio=Path("right.wav"),
            voice_id="6ccbfb76-1fc6-48f7-b71d-91ac6298247b",
            output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        response.write_to_file("infill_output.wav")
        print(f"Saved audio to infill_output.wav")
        print(f"Play with: ffplay -f wav infill_output.wav")
    ```

    From [cartesia-python/examples/examples.py:504](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L504)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def infill_create_async(client: AsyncCartesia) -> None:
        """Create infill audio between two clips."""
        from pathlib import Path
        response = await client.tts.infill(
            model_id="sonic-3",
            language="en",
            transcript="Infill text",
            left_audio=Path("left.wav"),
            right_audio=Path("right.wav"),
            voice_id="6ccbfb76-1fc6-48f7-b71d-91ac6298247b",
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        await response.write_to_file("infill_output_async.wav")
        print("Saved audio to infill_output_async.wav")
        print("Play with: ffplay -f wav infill_output_async.wav")
    ```

    From [cartesia-python/examples/async\_examples.py:341](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L341)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py infill_create
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py infill_create_async
    ```
  </Tab>
</Tabs>


# Next.js Full Example
Source: https://docs.cartesia.ai/examples/nextjs

A complete Next.js application with batch TTS, HTTP streaming, and WebSocket streaming.

A full Next.js app demonstrating three approaches to Cartesia TTS in the browser:
batch generation, HTTP streaming, and WebSocket streaming. Includes a server-side
token endpoint so API keys are never exposed to the client.

## Token Endpoint

```typescript app/api/token/route.ts theme={null}
import Cartesia from "@cartesia/cartesia-js";

const client = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY });

export async function POST() {
  const { token } = await client.accessToken.create({
    grants: { tts: true },
    expires_in: 300,
  });
  return Response.json({ token });
}
```

## Batch and HTTP Streaming

```tsx app/page.tsx theme={null}
"use client";

import { useRef, useState } from "react";
import Cartesia from "@cartesia/cartesia-js";

const SAMPLE_RATE = 44100;
const BYTES_PER_SAMPLE = 4; // f32le

async function getToken(): Promise<string> {
  const res = await fetch("/api/token", { method: "POST" });
  const { token } = await res.json();
  return token;
}

// =============================================================================
// Batch: waits for the full response, then plays via <audio> element
// =============================================================================

function BatchCartesiaTTSExample() {
  const audioRef = useRef<HTMLAudioElement>(null);
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      const client = new Cartesia({ token: await getToken() });
      const response = await client.tts.generate({
        model_id: "sonic-3",
        transcript: "Hello! This audio was generated in one batch and then played.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "wav", encoding: "pcm_s16le", sample_rate: SAMPLE_RATE },
      });

      const blob = await response.blob();
      const url = URL.createObjectURL(blob);
      const audio = audioRef.current!;
      audio.src = url;
      audio.onended = () => URL.revokeObjectURL(url);
      await audio.play();
    } finally {
      setLoading(false);
    }
  }

  return (
    <section>
      <h2>Batch</h2>
      <p>Waits for the full audio, then plays via an audio element.</p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Generating..." : "Speak"}
      </button>
      <audio ref={audioRef} controls style={{ display: "block", marginTop: "0.5rem" }} />
    </section>
  );
}

// =============================================================================
// Streaming: plays audio chunks as they arrive via Web Audio API
// =============================================================================

function StreamingCartesiaTTSExample() {
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      const client = new Cartesia({ token: await getToken() });
      const response = await client.tts.generate({
        model_id: "sonic-3",
        transcript:
          "Hello! This audio is being streamed and played as chunks arrive.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: SAMPLE_RATE },
      });

      // Stream the response and play each chunk as it arrives.
      // We buffer incoming bytes so we only decode complete f32 samples —
      // getReader() can split chunks at arbitrary byte boundaries.
      const audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE });
      let nextStartTime = audioCtx.currentTime;
      const reader = response.body!.getReader();
      let leftover = new Uint8Array(0);

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        // Prepend any leftover bytes from the previous chunk
        let bytes: Uint8Array;
        if (leftover.length > 0) {
          bytes = new Uint8Array(leftover.length + value.length);
          bytes.set(leftover);
          bytes.set(value, leftover.length);
        } else {
          bytes = value;
        }

        // Only decode complete samples, save the remainder
        const usableBytes = bytes.length - (bytes.length % BYTES_PER_SAMPLE);
        leftover = bytes.slice(usableBytes);

        if (usableBytes === 0) continue;

        // Copy to an aligned buffer so Float32Array doesn't throw on unaligned offset
        const aligned = new ArrayBuffer(usableBytes);
        new Uint8Array(aligned).set(bytes.subarray(0, usableBytes));
        const floats = new Float32Array(aligned);

        const buf = audioCtx.createBuffer(1, floats.length, SAMPLE_RATE);
        buf.getChannelData(0).set(floats);

        const source = audioCtx.createBufferSource();
        source.buffer = buf;
        source.connect(audioCtx.destination);

        const startTime = Math.max(nextStartTime, audioCtx.currentTime);
        source.start(startTime);
        nextStartTime = startTime + buf.duration;
      }
    } finally {
      setLoading(false);
    }
  }

  return (
    <section>
      <h2>Streaming</h2>
      <p>Plays audio chunks as they arrive via the Web Audio API.</p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Streaming..." : "Speak"}
      </button>
    </section>
  );
}

// =============================================================================
// Page
// =============================================================================

export default function Home() {
  return (
    <main style={{ padding: "2rem", fontFamily: "system-ui" }}>
      <h1>Cartesia TTS — Next.js Example</h1>
      <div style={{ display: "flex", flexDirection: "column", gap: "2rem", marginTop: "1rem" }}>
        <BatchCartesiaTTSExample />
        <StreamingCartesiaTTSExample />
      </div>
      <p style={{ marginTop: "2rem" }}>
        <a href="/websocket">WebSocket streaming example →</a>
      </p>
    </main>
  );
}
```

## WebSocket Streaming

```tsx app/websocket/page.tsx theme={null}
"use client";

import { useState } from "react";
import Cartesia from "@cartesia/cartesia-js";

const SAMPLE_RATE = 44100;

export default function WebSocketExample() {
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      // 1. Get a short-lived token from our server
      const res = await fetch("/api/token", { method: "POST" });
      const { token } = await res.json();

      // 2. Connect via WebSocket from the browser
      const client = new Cartesia({ token });
      const ws = await client.tts.websocket();

      // 3. Stream audio and play each chunk as it arrives
      const audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE });
      let nextStartTime = audioCtx.currentTime;

      const resp = ws.generate({
        model_id: "sonic-3",
        transcript:
          "Hello from a WebSocket! Each audio chunk is played the moment it arrives, giving you the lowest possible latency.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: SAMPLE_RATE },
      });

      for await (const event of resp) {
        if (event.type === "chunk" && event.audio) {
          // event.audio is a Uint8Array of f32le samples
          const aligned = new ArrayBuffer(event.audio.byteLength);
          new Uint8Array(aligned).set(event.audio);
          const floats = new Float32Array(aligned);

          const buf = audioCtx.createBuffer(1, floats.length, SAMPLE_RATE);
          buf.getChannelData(0).set(floats);

          const source = audioCtx.createBufferSource();
          source.buffer = buf;
          source.connect(audioCtx.destination);

          const startTime = Math.max(nextStartTime, audioCtx.currentTime);
          source.start(startTime);
          nextStartTime = startTime + buf.duration;
        }
      }

      ws.close();
    } finally {
      setLoading(false);
    }
  }

  return (
    <main style={{ padding: "2rem", fontFamily: "system-ui" }}>
      <h1>Cartesia TTS — WebSocket Streaming</h1>
      <p>
        Uses the SDK&apos;s WebSocket API directly from the browser.
        Audio plays as each chunk arrives for lowest latency.
      </p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Streaming..." : "Speak"}
      </button>
      <p style={{ marginTop: "1rem" }}>
        <a href="/">← Back to HTTP examples</a>
      </p>
    </main>
  );
}
```

## Run this example

```sh theme={null}
cd cartesia-js/examples/nextjs
npm install
CARTESIA_API_KEY=YOUR_KEY npm run dev
```

Then open [http://localhost:3000](http://localhost:3000).

## Source

<Card title="View on GitHub" icon="github" href="https://github.com/cartesia-ai/cartesia-js/tree/main/examples/nextjs">
  Full Next.js example project
</Card>


# Transcribe Audio
Source: https://docs.cartesia.ai/examples/stt-transcribe

Transcribe audio with word timestamps.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def stt_transcribe(client: Cartesia) -> None:
        """Transcribe audio with word timestamps."""
        with open("audio.wav", "rb") as f:
            response = client.stt.transcribe(
                file=f,
                model="ink-whisper",
                language="en",
                timestamp_granularities=["word"],  # Optional: get word timestamps
            )
        print(response.text)
        if response.words:
            for word in response.words:
                print(f"{word.word}: {word.start}s - {word.end}s")
    ```

    From [cartesia-python/examples/examples.py:526](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L526)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function sttTranscribe(client: Cartesia): Promise<void> {
      /** Transcribe audio with word timestamps. */
      const file = fs.createReadStream('audio.wav');
      const response = await client.stt.transcribe({
        file,
        model: 'ink-whisper',
        language: 'en',
        timestamp_granularities: ['word'],
      });
      console.log(response.text);
      if (response.words) {
        for (const word of response.words) {
          console.log(`${word.word}: ${word.start}s - ${word.end}s`);
        }
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:377](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L377)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py stt_transcribe
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts sttTranscribe
    ```
  </Tab>
</Tabs>


# Download Audio File
Source: https://docs.cartesia.ai/examples/tts-download-file

Generate audio and trigger a file download in the browser.

```typescript theme={null}
async function ttsDownloadFile(client: Cartesia): Promise<void> {
  /** Generate audio and trigger a file download in the browser. */
  const response = await client.tts.generate({
    model_id: 'sonic-3',
    transcript: 'This audio will be downloaded as a file.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'wav', encoding: 'pcm_s16le', sample_rate: 44100 },
  });

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);

  const a = document.createElement('a');
  a.href = url;
  a.download = 'speech.wav';
  a.click();

  URL.revokeObjectURL(url);
}
```

From [cartesia-js/examples/browser\_examples.ts:54](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L54)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Generate to File
Source: https://docs.cartesia.ai/examples/tts-generate-to-file

Use generate() and write_to_file() to write a wav file.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_generate_to_file(client: Cartesia) -> None:
        """Use generate() and write_to_file() to write a wav file."""
        response = client.tts.generate(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        response.write_to_file("output.wav")
        print(f"Saved audio to output.wav")
        print(f"Play with: ffplay -f wav output.wav")
    ```

    From [cartesia-python/examples/examples.py:30](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L30)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsGenerateToFile(client: Cartesia): Promise<void> {
      /** Use generate() and write_to_file() to write a wav file. */
      const response = await client.tts.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'wav', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      const buffer = Buffer.from(await response.arrayBuffer());
      fs.writeFileSync('output.wav', buffer);
      console.log('Saved audio to output.wav');
      console.log('Play with: ffplay -f wav output.wav');
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:29](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L29)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_generate_to_file
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsGenerateToFile
    ```
  </Tab>
</Tabs>


# Play Audio in Browser
Source: https://docs.cartesia.ai/examples/tts-play-audio

Generate a wav and play it using an <audio> element.

```typescript theme={null}
async function ttsPlayAudio(client: Cartesia): Promise<void> {
  /** Generate a wav and play it using an <audio> element. */
  const response = await client.tts.generate({
    model_id: 'sonic-3',
    transcript: 'Hello from the browser!',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'wav', encoding: 'pcm_s16le', sample_rate: 44100 },
  });

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);

  const audio = new Audio(url);
  audio.onended = () => URL.revokeObjectURL(url);
  await audio.play();
}
```

From [cartesia-js/examples/browser\_examples.ts:33](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L33)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# WebSocket Basic
Source: https://docs.cartesia.ai/examples/tts-websocket-basic

Basic WebSocket usage with websocket_connect() context manager.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_basic(client: Cartesia) -> None:
        """Basic WebSocket usage with websocket_connect() context manager."""
        with client.tts.websocket_connect() as connection:
            connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "voice-id"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            import datetime
            filename = f"tts_websocket_basic_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.done:
                        break

            print(f"Saved audio to {filename}")
            print(f"Play with:\n  $ ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:196](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L196)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_basic_async(client: AsyncCartesia) -> None:
        """Basic WebSocket usage with websocket_connect() context manager."""
        async with client.tts.websocket_connect() as connection:
            await connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            filename = f"tts_ws_basic_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.done:
                        break
            
            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:109](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L109)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketBasic(client: Cartesia): Promise<void> {
      /** Basic WebSocket usage with websocket_connect() context manager. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const filename = `tts_websocket_basic_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ws.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      })) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with:\n  $ ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:48](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L48)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_basic
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_basic_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketBasic
    ```
  </Tab>
</Tabs>


# WebSocket Continuations
Source: https://docs.cartesia.ai/examples/tts-websocket-continuations

Streaming a transcript split into multiple parts, using continuations.

<Tabs>
  <Tab title="Python">
    Useful for streaming transcripts generated by an LLM.

    ```python theme={null}
    def tts_websocket_continuations(client: Cartesia) -> None:
        """Streaming a transcript split into multiple parts, using continuations.
        Useful for streaming transcripts generated by an LLM."""
        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format={
                    "container": "raw",
                    "encoding": "pcm_f32le",
                    "sample_rate": 44100,
                },
            )

            for part in ["The road ", "goes ever ", "on and ", "on."]:
                ctx.push(part)

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_websocket_continuations_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with:\n  $ ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:222](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L222)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_continuations_async(client: AsyncCartesia) -> None:
        """Streaming a transcript split into multiple parts, using continuations."""
        transcripts = ["The only thing we have to fear ", "is ", "fear itself."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context()

            for transcript in transcripts:
                await ctx.send(
                    model_id="sonic-3",
                    transcript=transcript,
                    voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                    output_format=output_format,
                    continue_=True,
                )

            await ctx.no_more_inputs()

            filename = f"tts_ws_continuations_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:131](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L131)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketContinuations(client: Cartesia): Promise<void> {
      /** Streaming a transcript split into multiple parts, using continuations. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      for (const part of ['The road ', 'goes ever ', 'on and ', 'on.']) {
        await ctx.push({ transcript: part });
      }
      await ctx.no_more_inputs();

      const filename = `tts_websocket_continuations_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with:\n  $ ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:73](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L73)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_continuations
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_continuations_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketContinuations
    ```
  </Tab>
</Tabs>


# WebSocket Flushing
Source: https://docs.cartesia.ai/examples/tts-websocket-flushing

Demonstrates manual flushing to separate audio from different transcripts.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_flushing(client: Cartesia) -> None:
        """Demonstrates manual flushing to separate audio from different transcripts."""
        transcripts = ["Stay hungry, ", "stay foolish."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )  # Auto-generates context_id

            # 1. Send first transcript
            print("Sending first transcript...")
            ctx.push(transcripts[0])

            # 2. Flush! This forces all buffered audio for the first transcript to be generated
            # and increments the flush_id counter on the server.
            print("Flushing...")
            ctx.flush()

            # 3. Send second transcript
            print("Sending second transcript...")
            ctx.push(transcripts[1])

            ctx.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # We'll save audio to separate files based on flush_id
            files: dict[int, IO[bytes]] = {}

            for response in ctx.receive():
                # Log every response, but redact audio data to avoid swamping the console.
                loggable = {k: ("[...]" if k == "data" else v) for k, v in response.model_dump().items()}
                print(f"Event: {loggable}")

                if response.type == "chunk" and response.audio:
                    # Get flush_id from response (defaults to 0 if not present)
                    flush_id = response.flush_id or 0

                    if flush_id not in files:
                        filename = f"tts_flush_{flush_id}_{timestamp}.pcm"
                        files[flush_id] = open(filename, "wb")

                    files[flush_id].write(response.audio)

            # Close all open files
            for f in files.values():
                f.close()

            print("\nFinished.")
            print("You can play the generated audio files with these commands:")
            for flush_id, f in files.items():
                print(f"  Flush ID {flush_id}: ffplay -f f32le -ar 44100 {f.name}")
    ```

    From [cartesia-python/examples/examples.py:255](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L255)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_flushing_async(client: AsyncCartesia) -> None:
        """Demonstrates manual flushing to separate audio from different transcripts."""
        transcripts = ["First transcript.", "Second transcript."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context()

            # 1. Send first transcript
            print("Sending first transcript...")
            await ctx.send(
                model_id="sonic-3",
                transcript=transcripts[0],
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
            )

            # 2. Flush!
            print("Flushing...")
            await ctx.send(
                model_id="sonic-3",
                transcript="",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
                flush=True,
            )

            # 3. Send second transcript
            print("Sending second transcript...")
            await ctx.send(
                model_id="sonic-3",
                transcript=transcripts[1],
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
            )

            await ctx.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
            
            files: dict[int, IO[bytes]] = {}

            async for response in ctx.receive():
                if response.type == "chunk" and response.audio:
                    flush_id = response.flush_id or 0

                    if flush_id not in files:
                        filename = f"tts_flush_async_{flush_id}_{timestamp}.pcm"
                        files[flush_id] = open(filename, "wb")
                        print(f"Created new file for flush_id {flush_id}: {filename}")

                    files[flush_id].write(response.audio)

                elif response.type == "flush_done":
                    print(f"Flush done received for flush_id: {response.flush_id}")

            for f in files.values():
                f.close()

            print("\nFinished.")
            print("You can play the generated audio files with these commands:")
            for flush_id, f in files.items():
                print(f"  Flush ID {flush_id}: ffplay -f f32le -ar 44100 {f.name}")
    ```

    From [cartesia-python/examples/async\_examples.py:160](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L160)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketFlushing(client: Cartesia): Promise<void> {
      /** Demonstrates manual flushing to separate audio from different transcripts. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      // 1. Send first transcript
      console.log('Sending first transcript...');
      await ctx.push({ transcript: 'Stay hungry, ' });

      // 2. Flush — forces all buffered audio for the first transcript to be generated.
      console.log('Flushing...');
      await ctx.flush();

      // 3. Send second transcript
      console.log('Sending second transcript...');
      await ctx.push({ transcript: 'stay foolish.' });

      await ctx.no_more_inputs();

      const ts = timestamp();
      const files: Map<number, fs.WriteStream> = new Map();

      for await (const event of ctx.receive()) {
        // Log every response, but redact audio data to avoid swamping the console.
        const loggable = { ...(event as any) };
        if (loggable.data) loggable.data = '[...]';
        console.log('Event:', JSON.stringify(loggable));

        if (event.type === 'chunk' && event.audio) {
          const flushId = (event as any).flush_id ?? 0;
          if (!files.has(flushId)) {
            const name = `tts_flush_${flushId}_${ts}.pcm`;
            files.set(flushId, fs.createWriteStream(name));
          }
          files.get(flushId)!.write(event.audio);
        }
      }

      for (const f of files.values()) f.end();
      ws.close();

      console.log('\nFinished. Play the generated audio files with:');
      for (const [flushId, f] of files) {
        console.log(`  Flush ID ${flushId}: ffplay -f f32le -ar 44100 ${(f as any).path}`);
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:104](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L104)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_flushing
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_flushing_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketFlushing
    ```
  </Tab>
</Tabs>


# WebSocket Low-Latency Playback
Source: https://docs.cartesia.ai/examples/tts-websocket-low-latency

Play audio chunks as they arrive for lowest latency.

```typescript theme={null}
async function ttsWebsocketLowLatency(client: Cartesia): Promise<void> {
  /** Play audio chunks as they arrive for lowest latency. */
  const sampleRate = 44100;
  const audioCtx = new AudioContext({ sampleRate });
  let nextStartTime = audioCtx.currentTime;

  const ws = await client.tts.websocket();

  for await (const event of ws.generate({
    model_id: 'sonic-3',
    transcript: 'Low latency streaming. Each chunk plays as soon as it arrives.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: sampleRate },
  })) {
    if (event.type === 'chunk' && event.audio) {
      const floats = new Float32Array(
        event.audio.buffer,
        event.audio.byteOffset,
        event.audio.byteLength / 4,
      );

      const audioBuffer = audioCtx.createBuffer(1, floats.length, sampleRate);
      audioBuffer.getChannelData(0).set(floats);

      const source = audioCtx.createBufferSource();
      source.buffer = audioBuffer;
      source.connect(audioCtx.destination);

      // Schedule this chunk right after the previous one
      const startTime = Math.max(nextStartTime, audioCtx.currentTime);
      source.start(startTime);
      nextStartTime = startTime + audioBuffer.duration;
    }
  }

  ws.close();
}
```

From [cartesia-js/examples/browser\_examples.ts:127](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L127)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# WebSocket Stream to Web Audio
Source: https://docs.cartesia.ai/examples/tts-websocket-stream-audio

Stream audio from a WebSocket and play it in real-time with Web Audio API.

```typescript theme={null}
async function ttsWebsocketStreamAudio(client: Cartesia): Promise<void> {
  /** Stream audio from a WebSocket and play it in real-time with Web Audio API. */
  const sampleRate = 44100;
  const audioCtx = new AudioContext({ sampleRate });

  const ws = await client.tts.websocket();

  const chunks: Float32Array[] = [];

  for await (const event of ws.generate({
    model_id: 'sonic-3',
    transcript: 'This is being streamed in real time from a WebSocket connection.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: sampleRate },
  })) {
    if (event.type === 'chunk' && event.audio) {
      // event.audio is a raw buffer of f32le samples
      const floats = new Float32Array(
        event.audio.buffer,
        event.audio.byteOffset,
        event.audio.byteLength / 4,
      );
      chunks.push(floats);
    }
  }

  ws.close();

  // Combine all chunks into a single AudioBuffer and play
  const totalSamples = chunks.reduce((sum, c) => sum + c.length, 0);
  const audioBuffer = audioCtx.createBuffer(1, totalSamples, sampleRate);
  const channelData = audioBuffer.getChannelData(0);

  let offset = 0;
  for (const chunk of chunks) {
    channelData.set(chunk, offset);
    offset += chunk.length;
  }

  const source = audioCtx.createBufferSource();
  source.buffer = audioBuffer;
  source.connect(audioCtx.destination);
  source.start();
}
```

From [cartesia-js/examples/browser\_examples.ts:78](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L78)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Delete a Voice
Source: https://docs.cartesia.ai/examples/voices-delete

Delete a voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_delete(client: Cartesia, voice_id: str) -> None:
        """Delete a voice."""
        client.voices.delete(voice_id)
    ```

    From [cartesia-python/examples/examples.py:495](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L495)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesDelete(client: Cartesia): Promise<void> {
      /** Delete a voice. */
      await client.voices.delete('voice-id');
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:368](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L368)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_delete
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesDelete
    ```
  </Tab>
</Tabs>


# List Voices to DOM
Source: https://docs.cartesia.ai/examples/voices-list-to-dom

Fetch voices and display them in a <ul> element.

```typescript theme={null}
async function voicesListToDOM(client: Cartesia): Promise<void> {
  /** Fetch voices and display them in a <ul> element. */
  const ul = document.createElement('ul');

  for await (const voice of client.voices.list({ limit: 20 })) {
    const li = document.createElement('li');
    li.textContent = `${voice.name} (${voice.language})`;
    ul.appendChild(li);
  }

  document.body.appendChild(ul);
}
```

From [cartesia-js/examples/browser\_examples.ts:169](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L169)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Update a Voice
Source: https://docs.cartesia.ai/examples/voices-update

Update a voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_update(client: Cartesia, voice_id: str) -> None:
        """Update a voice."""
        client.voices.update(
            voice_id,
            name="Updated Name",
            description="Updated description",
        )
    ```

    From [cartesia-python/examples/examples.py:486](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L486)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesUpdate(client: Cartesia): Promise<void> {
      /** Update a voice. */
      await client.voices.update('voice-id', {
        name: 'Updated Name',
        description: 'Updated description',
      });
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:360](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L360)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_update
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesUpdate
    ```
  </Tab>
</Tabs>


# Welcome to Cartesia
Source: https://docs.cartesia.ai/get-started/overview

Our API enables developers to build real-time, multimodal AI experiences that feel natural and responsive.

<Frame>
  <img alt="" />
</Frame>

The Cartesia API is the fastest, most emotive, ultra-realistic voice AI platform. Purpose-built for developers, it serves state-of-the-art models for both text-to-speech and speech-to-text, enabling seamless conversational AI experiences.

## Sonic Models for Text-to-Speech

Sonic models take text input and and stream back ultra-realistic speech in response. They can also clone voices, with full control over pronunciation and accent.

**Sonic 3 is the world's fastest, most emotive, ultra-realistic text-to-speech model.** It can stream out the first byte of audio in just 90ms, making it perfect for real-time and conversational experiences as well as dubbing, narration, AI avatars, and more. (To put things into perspective, 90ms is about twice as fast as the blink of an eye.)

**If real-time performance is your top priority,** Sonic Turbo offers even better performance, streaming out the first byte of audio in just 40ms.

Learn more about available Sonic model variants and their capabilities in the [TTS Models](../build-with-cartesia/tts-models/latest) section.

## Ink Models for Speech-to-Text

Ink models provide streaming speech-to-text transcription optimized for real-time voice applications.

**Ink-Whisper**, our debut model, is specifically engineered for conversational AI—handling telephony artifacts, background noise, accents, and proper nouns that typically challenge standard STT systems.

Ink-Whisper uses advanced dynamic chunking to process variable-length audio segments, reducing errors and hallucinations during pauses or audio gaps. At just \$0.13/hour, it's the most affordable streaming STT model available.

Learn more about the Ink model and its capabilities in the [STT Models](../build-with-cartesia/stt-models) section.

## Support

<CardGroup>
  <Card title="Discord" icon="discord" href="https://discord.gg/cartesia">
    Join our Discord server to chat with the Cartesia team, engage with the community, and get help with your projects.
  </Card>

  <Card title="Email" icon="envelope" href="mailto:support@cartesia.ai">
    Email us at [support@cartesia.ai](mailto:support@cartesia.ai) to get help with integrating Cartesia, your account, or billing.
  </Card>
</CardGroup>


# LiveKit
Source: https://docs.cartesia.ai/integrations/live-kit


<Frame>
  <img alt="LiveKit Agents logo" />
</Frame>

**LiveKit** is a WebRTC-first platform for realtime **video, voice, and data** in your product. **LiveKit Agents** is its framework for conversational agents.

**Cartesia** integrates in two ways: **LiveKit Inference** (hosted **cartesia/sonic-3** and related model IDs in the agent runtime; keys and pricing are through **LiveKit**—see [LiveKit’s Cartesia TTS guide](https://docs.livekit.io/agents/models/tts/inference/cartesia)) and the open-source **[livekit-plugins-cartesia](https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-cartesia)** Python package for **TTS and STT** using your **Cartesia** credentials from the worker.

# Demo

Here's a demo of a voice assistant built with LiveKit Agents and Cartesia:

<Card title="LiveKit Cartesia Demo" icon="solid link" href="https://cartesia-assistant.vercel.app/">
  Try out the LiveKit Cartesia demo.
</Card>

The source code for this demo is available [here](https://github.com/livekit-examples/cartesia-voice-agent)


# Overview
Source: https://docs.cartesia.ai/integrations/overview

Partner integrations for Cartesia TTS and STT in your own app—not Cartesia-hosted agents.

Cartesia provides first-party speech APIs and SDKs, and integrates with many other products and developer frameworks. The pages in this section describe each path at a high level; detailed setup usually lives in partner documentation and repositories.

## Prerequisites

You’ll need these for almost every integration below. Individual pages also list extras (partner accounts, runtimes, SDK installs).

* **[Cartesia API key](https://play.cartesia.ai/keys)** — create and manage keys in the Playground.
* **A voice** — pick one in the Playground or API; see [Choosing a voice](/build-with-cartesia/capability-guides/choosing-a-voice) and [Voice IDs](/build-with-cartesia/tts-models/voice-ids).

## Integrations

<CardGroup>
  <Card title="LiveKit" icon="circle" href="/integrations/live-kit">
    Realtime rooms and agents—Cartesia via LiveKit Inference or the Cartesia plugin.
  </Card>

  <Card title="Pipecat" icon="cat" href="/integrations/pipecat">
    Python voice and multimodal agents with official Cartesia TTS/STT services.
  </Card>

  <Card title="Twilio" icon="phone" href="/integrations/twilio">
    Programmable Voice and Media Streams with Cartesia TTS (Node walkthrough).
  </Card>

  <Card title="Tencent RTC" icon="tencent-weibo" href="/integrations/tencent-rtc">
    TRTC realtime media with Cartesia for conversational AI workloads.
  </Card>

  <Card title="Thoughtly" icon="phone" href="/integrations/thoughtly">
    No-code phone agents; Cartesia is the default voice stack for new agents.
  </Card>

  <Card title="Rasa" icon="robot" href="/integrations/rasa">
    Rasa Pro voice assistants with Cartesia as the TTS backend.
  </Card>

  <Card title="Vision Agents (by Stream)" icon="camera" href="/integrations/vision-agents-by-stream">
    Stream’s Vision Agents framework with a Cartesia TTS plugin.
  </Card>

  <Card title="MCP" icon="comment" href="/tools/ai/mcp">
    `cartesia-mcp` for Cursor, Claude Desktop, and other MCP clients.
  </Card>
</CardGroup>


# Pipecat
Source: https://docs.cartesia.ai/integrations/pipecat


<Frame>
  <img alt="Pipecat logo" />
</Frame>

## Overview

[**Pipecat**](https://www.pipecat.ai/) is an open-source Python framework for realtime **voice** agents.

Building voice agents requires the creation and orchestration of pipelines, media and communication transports (such as Daily or LiveKit), and pluggable AI models.

**Cartesia** is available as a first-party provider plugin for **[TTS and STT services](https://github.com/pipecat-ai/pipecat/tree/main/src/pipecat/services/cartesia)** in the Pipecat repo.

## Prerequisites

Pipecat’s examples require a recent Python installation (see the Pipecat repo's [root-level README](https://github.com/pipecat-ai/pipecat/tree/main#prerequisites) for current prerequisites).

Install the **`pipecat-ai`** Python package with the **`cartesia`** extra for TTS/STT (bracket syntax):

```
pip install "pipecat-ai[cartesia,...]"

# or

uv add "pipecat-ai[cartesia,...]"
```

You'd also need to choose the **transport** extras your sample needs - you can do this by matching whatever the upstream README lists for that example.

## Getting Started - TTS (Websockets)

Pipecat's getting-started example provides you with a small, copy-friendly path to wire Cartesia TTS into a Pipecat [TTS WebSocket API](https://docs.cartesia.ai/api-reference/tts/websocket), and:

<Card title="Cartesia & Pipecat | Getting Started" icon="github" href="https://github.com/pipecat-ai/pipecat/tree/main/examples/getting-started">
  Getting-started examples in the Pipecat repo.
</Card>

## Getting Started - TTS and STT (Websockets & HTTP)

For smaller voice-focused samples using Cartesia STT and TTS you can choose between two transports - WebSockets or HTTP:

<CardGroup>
  <Card title="Pipecat & Cartesia Voice (WebSockets)" icon="github" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-cartesia.py">
    Voice bot using Cartesia STT & TTS over WebSocket.
  </Card>

  <Card title="Pipecat & Cartesia Voice (HTTP)" icon="github" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-cartesia-http.py">
    Same flow using Cartesia STT & TTS over HTTP.
  </Card>
</CardGroup>

## Orchestrated Conversational AI

For a fuller example app that shows an end to end Voice Agent experience (VAD -> STT -> LLM -> TTS) orchestrated with Pipecat, see StudyPal:

<Card title="Pipecat > StudyPal" icon="github" href="https://github.com/pipecat-ai/pipecat-examples/tree/main/studypal">
  StudyPal example in the pipecat-examples repo.
</Card>


# Rasa
Source: https://docs.cartesia.ai/integrations/rasa


**Rasa** is an open dialogue stack; **voice streaming with Cartesia** is documented for **Rasa Pro** (commercial) assistants. Configure a voice channel in **`credentials.yml`** with `tts: name: cartesia` and **`CARTESIA_API_KEY`** per Rasa’s speech-integrations reference. Start with their walkthrough, then use the reference for parameters (`model_id`, `voice`, multilingual `language_map`, etc.):

<Card title="Tutorial: Build a Voice Agent with Rasa and Cartesia" href="https://rasa.com/blog/building-a-voice-bot-with-rasa-and-cartesia-a-technical-tutorial/">
  Full tutorial for building a voice agent with Rasa and Cartesia.
</Card>

For implementation details, see their documentation:

<Card title="Rasa > Docs > Speech integrations (Cartesia)" href="https://rasa.com/docs/reference/integrations/speech-integrations/#cartesia-tts">
  Rasa reference for Cartesia TTS in voice channels.
</Card>


# Thoughtly
Source: https://docs.cartesia.ai/integrations/thoughtly


<Frame>
  <div>
    <img alt="Thoughtly logo" />
  </div>
</Frame>

**Thoughtly** is a no-code platform for **inbound and outbound AI phone agents** (sales, support, routing): visual flows, CRM and calendar integrations, analytics, and telephony. Following the [Thoughtly × Cartesia partnership](https://www.thoughtly.com/blog/thoughtly-upgrades-its-voice-library-through-partnership-with-cartesia/), **new agents default to Cartesia voices** (low-latency TTS, expanded library, cloning from a short sample in-product); Thoughtly notes existing agents can keep prior voices during migration.

# Demo

<Card title="Thoughtly Cartesia Demo" icon="link" href="https://app.arcade.software/share/MaOO9bPhyHAP5ZdOq8Gt">
  See a demo of Cartesia on Thoughtly.
</Card>


# Integrate with Twilio
Source: https://docs.cartesia.ai/integrations/twilio

How to integrate Twilio with Cartesia to generate audio from text and send it as a voice call.

Use **Twilio Programmable Voice** with **Media Streams** so a phone call receives audio generated by **Cartesia TTS** over WebSockets. This walkthrough uses **Node.js**: a small server bridges Twilio’s stream to Cartesia and plays TTS audio on the callee’s line.

## Prerequisites

Before you begin, make sure you have the following:

1. [Node.js](https://nodejs.org/en/download) installed.
2. A [Twilio account](https://www.twilio.com/en-us/try-twilio). You will need your Account SID and Auth Token.
3. A [Cartesia API key](https://play.cartesia.ai/keys).
4. A phone number that you want to call.
5. A Twilio phone number to call from.
6. An [ngrok authtoken](https://dashboard.ngrok.com/get-started/your-authtoken) (a free account works).

## Get Started

<Steps>
  <Step title="Set Up Your Project">
    1. Create a new directory for your project and navigate to it in your terminal.
    2. Initialize a new Node.js project:
       ```bash lines theme={null}
       npm init -y
       ```
    3. Install the required dependencies:
       ```bash lines theme={null}
       npm install twilio ws http @ngrok/ngrok dotenv
       ```
  </Step>

  <Step title="Configure Environment Variables">
    Create a `.env` file in your project root and add the following:

    ```sh lines theme={null}
    TWILIO_ACCOUNT_SID="your_twilio_account_sid"
    TWILIO_AUTH_TOKEN="your_twilio_auth_token"
    CARTESIA_API_KEY="your_cartesia_api_key"
    NGROK_AUTHTOKEN="your_ngrok_authtoken"
    ```

    Replace the placeholder values with your actual credentials.
  </Step>

  <Step title="Create the Main Script">
    Create a file named `app.js` (or any name you prefer) and add the following code:

    ```javascript lines theme={null}
    const twilio = require('twilio');
    const WebSocket = require('ws');
    const http = require('http');
    const ngrok = require('@ngrok/ngrok');
    const dotenv = require('dotenv');
    const crypto = require('crypto');

    // Load environment variables
    dotenv.config({ override: true });

    // Function to get a value from environment variable or command line argument
    function getConfig(key, defaultValue = undefined) {
      return process.env[key] || process.argv.find(arg => arg.startsWith(`${key}=`))?.split('=')[1] || defaultValue;
    }

    // Configuration
    const config = {
        TWILIO_ACCOUNT_SID: getConfig('TWILIO_ACCOUNT_SID'),
        TWILIO_AUTH_TOKEN: getConfig('TWILIO_AUTH_TOKEN'),
        CARTESIA_API_KEY: getConfig('CARTESIA_API_KEY'),
        NGROK_AUTHTOKEN: getConfig('NGROK_AUTHTOKEN'),
    };

    // Validate required configuration
    const requiredConfig = ['TWILIO_ACCOUNT_SID', 'TWILIO_AUTH_TOKEN', 'CARTESIA_API_KEY', 'NGROK_AUTHTOKEN'];
    for (const key of requiredConfig) {
        if (!config[key]) {
            console.error(`Missing required configuration: ${key}`);
            process.exit(1);
        }
    }

    const client = twilio(config.TWILIO_ACCOUNT_SID, config.TWILIO_AUTH_TOKEN);
    ```
  </Step>

  <Step title="Configure Cartesia TTS">
    In the script, you'll find a configuration section for Cartesia TTS. Make sure to set the following variables according to your needs:

    ```javascript lines theme={null}
    const TTS_WEBSOCKET_URL = `wss://api.cartesia.ai/tts/websocket?cartesia_version=2025-03-01`;
    const modelId = 'sonic-3';
    const voice = {
        'mode': 'id',
        // You can check available voices using the Cartesia API or at https://play.cartesia.ai
        'id': "e07c00bc-4134-4eae-9ea4-1a55fb45746b"
    };
    const partialResponse = 'Hi there, my name is Cartesia. I hope youre having a great day!';
    ```
  </Step>

  <Step title="Set Up Twilio Calling">
    Configure your Twilio outbound and inbound numbers:

    ```javascript lines theme={null}
    const outbound = "+1234567890"; // Replace with the number you want to call
    const inbound = "+1234567890";  // Replace with your Twilio number
    ```
  </Step>

  <Step title="Implement Main Logic">
    The `main()` function orchestrates the entire process:

    1. Connects to the Cartesia TTS WebSocket
    2. Tests the TTS WebSocket
    3. Sets up a Twilio WebSocket server
    4. Creates an ngrok tunnel for the Twilio WebSocket
    5. Initiates the call using Twilio

    ```javascript expandable lines  theme={null}
    let ttsWebSocket;
    let callSid;
    let messageComplete = false;
    let audioChunksReceived = 0;

    function log(message) {
      console.log(`[${new Date().toISOString()}] ${message}`);
    }

    function connectToTTSWebSocket() {
      return new Promise((resolve, reject) => {
        log('Attempting to connect to TTS WebSocket');
        ttsWebSocket = new WebSocket(TTS_WEBSOCKET_URL, {
          headers: { 'X-Api-Key': config.CARTESIA_API_KEY }
        });

        ttsWebSocket.on('open', () => {
          log('Connected to TTS WebSocket');
          resolve(ttsWebSocket);
        });

        ttsWebSocket.on('error', (error) => {
          log(`TTS WebSocket error: ${error.message}`);
          reject(error);
        });

        ttsWebSocket.on('close', (code, reason) => {
          log(`TTS WebSocket closed. Code: ${code}, Reason: ${reason}`);
          reject(new Error('TTS WebSocket closed unexpectedly'));
        });
      });
    }

    function sendTTSMessage(message) {
      const textMessage = {
        'model_id': modelId,
        'transcript': message,
        'voice': voice,
        'output_format': {
          'container': 'raw',
          'encoding': 'pcm_mulaw',
          'sample_rate': 8000
        },
        // create a new context for each message since each is a complete transcript
        'context_id': crypto.randomUUID()
      };

      log(`Sending message to TTS WebSocket: ${message}`);
      ttsWebSocket.send(JSON.stringify(textMessage));
    }

    function testTTSWebSocket() {
      return new Promise((resolve, reject) => {
        const testMessage = 'This is a test message';
        let receivedAudio = false;

        sendTTSMessage(testMessage);

        const timeout = setTimeout(() => {
          if (!receivedAudio) {
            reject(new Error('Timeout: No audio received from TTS WebSocket'));
          }
        }, 10000); // 10 second timeout

        ttsWebSocket.on('message', (audioChunk) => {
          if (!receivedAudio) {
            log(audioChunk);
            log('Received audio chunk from TTS for test message');
            receivedAudio = true;
            clearTimeout(timeout);
            resolve();
          }
        });
      });
    }

    async function startCall(twilioWebsocketUrl) {
      try {
        log(`Initiating call with WebSocket URL: ${twilioWebsocketUrl}`);
        const call = await client.calls.create({
          twiml: `<Response><Connect><Stream url="${twilioWebsocketUrl}"/></Connect></Response>`,
          to: outbound,  // Replace with the phone number you want to call
          from: inbound  // Replace with your Twilio phone number
        });

        callSid = call.sid;
        log(`Call initiated. SID: ${callSid}`);
      } catch (error) {
        log(`Error initiating call: ${error.message}`);
        throw error;
      }
    }

    async function hangupCall() {
      try {
        log(`Attempting to hang up call: ${callSid}`);
        await client.calls(callSid).update({status: 'completed'});
        log('Call hung up successfully');
      } catch (error) {
        log(`Error hanging up call: ${error.message}`);
      }
    }

    function setupTwilioWebSocket() {
        return new Promise((resolve, reject) => {
          const server = http.createServer((req, res) => {
            log(`Received HTTP request: ${req.method} ${req.url}`);
            res.writeHead(200);
            res.end('WebSocket server is running');
          });

          const wss = new WebSocket.Server({ server });

          log('WebSocket server created');

          wss.on('connection', (twilioWs, request) => {
            log(`Twilio WebSocket connection attempt from ${request.socket.remoteAddress}`);

            let streamSid = null;

            twilioWs.on('message', (message) => {
              try {
                const msg = JSON.parse(message);
                log(`Received message from Twilio: ${JSON.stringify(msg)}`);

                if (msg.event === 'start') {
                  log('Media stream started');
                  streamSid = msg.start.streamSid;
                  log(`Stream SID: ${streamSid}`);
                  sendTTSMessage(partialResponse);
                } else if (msg.event === 'media' && !messageComplete) {
                  log('Received media event');
                } else if (msg.event === 'stop') {
                  log('Media stream stopped');
                  hangupCall();
                }
              } catch (error) {
                log(`Error processing Twilio message: ${error.message}`);
              }
            });

            twilioWs.on('close', (code, reason) => {
              log(`Twilio WebSocket disconnected. Code: ${code}, Reason: ${reason}`);
            });

            twilioWs.on('error', (error) => {
              log(`Twilio WebSocket error: ${error.message}`);
            });

            // Handle incoming audio chunks from TTS WebSocket
            ttsWebSocket.on('message', (audioChunk) => {
              log('Received audio chunk from TTS');
              try {
                if (streamSid) {
                  twilioWs.send(JSON.stringify({
                    event: 'media',
                    streamSid: streamSid,
                    media: {
                      payload: JSON.parse(audioChunk)['data']
                    }
                  }));

                  audioChunksReceived++;
                  log(`Audio chunks received: ${audioChunksReceived}`);

                  if (audioChunksReceived >= 50) {
                    messageComplete = true;
                    log('Message complete, preparing to hang up');
                    setTimeout(hangupCall, 2000);
                  }
                } else {
                  log('Warning: Received audio chunk but streamSid is not set');
                }
              } catch (error) {
                log(`Error sending audio chunk to Twilio: ${error.message}`);
              }
            });

            log('Twilio WebSocket connected and handlers set up');
          });

          wss.on('error', (error) => {
            log(`WebSocket server error: ${error.message}`);
          });

          server.listen(0, () => {
            const port = server.address().port;
            log(`Twilio WebSocket server is running on port ${port}`);
            resolve(port);
          });

          server.on('error', (error) => {
            log(`HTTP server error: ${error.message}`);
            reject(error);
          });
        });
      }

    async function setupNgrokTunnel(port) {
        try {
          const listener = await ngrok.forward({
            addr: port,
            authtoken: config.NGROK_AUTHTOKEN,
          });
          const wssUrl = listener.url().replace('https://', 'wss://');
          log(`ngrok tunnel established: ${wssUrl}`);
          return wssUrl;
        } catch (error) {
          log(`Error setting up ngrok tunnel: ${error.message}`);
          throw error;
        }
      }

    async function main() {
      try {
        log('Starting application');

        await connectToTTSWebSocket();
        log('TTS WebSocket connected successfully');

        await testTTSWebSocket();
        log('TTS WebSocket test passed successfully');

        const twilioWebsocketPort = await setupTwilioWebSocket();
        log(`Twilio WebSocket server set up on port ${twilioWebsocketPort}`);

        const twilioWebsocketUrl = await setupNgrokTunnel(twilioWebsocketPort);

        await startCall(twilioWebsocketUrl);
      } catch (error) {
        log(`Error in main function: ${error.message}`);
      }
    }

    // Run the script
    main();
    ```
  </Step>

  <Step title="Run the Application">
    To run the application, use the following command:

    ```bash lines theme={null}
    node app.js
    ```
  </Step>
</Steps>

## How It Works

1. The script establishes a connection to Cartesia's TTS WebSocket.
2. It sets up a local WebSocket server to communicate with Twilio.
3. An ngrok tunnel is created to expose the local WebSocket server to the internet.
4. A call is initiated using Twilio, connecting to the ngrok tunnel.
5. When the call connects, the script sends the predefined message to Cartesia's TTS.
6. Cartesia converts the text to speech and sends audio chunks back.
7. The script forwards these audio chunks to Twilio, which plays them on the call.

## Customization

* To change the spoken message, modify the `partialResponse` variable.
* Adjust the voice parameters in the `voice` object to change the TTS voice characteristics.
* Modify the `audioChunksReceived` threshold to control when the call should end.

## Troubleshooting

* If you encounter any issues, check the console logs for detailed error messages.
* Ensure all required environment variables are correctly set.
* If you see `invalid tunnel configuration`, make sure you're using the better supported `@ngrok/ngrok` package and not `ngrok`.


# CLI documentation
Source: https://docs.cartesia.ai/line/cli


Create, deploy, and manage voice agents from the command line.

## Installation

<Warning>By running the quick install commands, you are accepting Cartesia's [Terms of Service (TOS)](https://cartesia.ai/legal/terms.html). Please make sure to review the full TOS here before proceeding.</Warning>

Install and download from our servers:

```zsh lines theme={null}
curl -fsSL https://cartesia.sh | sh
```

Update to the latest version:

```zsh lines theme={null}
cartesia update
```

## Quick Start

<Steps>
  <Step title="Login with API key">
    Authenticate with your Cartesia API key.
    To make an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys) and select your organization.

    ```zsh lines theme={null}
    cartesia auth login  # paste your API key when prompted
    ```
  </Step>

  <Step title="Clone an example agent">
    Clone an example agent from the Line repository.

    ```zsh lines theme={null}
    cartesia create my-agent
    # Choose any example you like.
    cd my-agent
    ```
  </Step>

  <Step title="Initialize your agent">
    Give your agent a name and link it to your organization.

    ```zsh lines theme={null}
    cartesia init
    ```
  </Step>

  <Step title="Deploy your agent">
    Deploy your agent to make it available in the playground.

    ```zsh lines theme={null}
    cartesia deploy
    ```
  </Step>
</Steps>

## Features

### Initialize a Project

Link any directory to a new or existing Cartesia agent:

```zsh lines theme={null}
cartesia init
```

Create a project from an example:

```zsh lines theme={null}
cartesia create
```

<Tip>
  Inside a project directory, the CLI auto-detects the agent. Run `cartesia status` to see the current agent ID.
</Tip>

### Chat with Your Agent

Test your agent's text reasoning locally.

Terminal 1. Run your text logic fastapi server:

```zsh lines theme={null}
PORT=8000 uv run python main.py
```

Terminal 2. Run the CLI to chat with your agent:

```zsh lines theme={null}
cartesia chat 8000
```

## Commands

### Authentication

To get an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys), select your organization, and generate a new key.

```zsh lines theme={null}
cartesia auth login
```

To validate the existing API key:

```zsh lines theme={null}
cartesia auth status
```

To logout (clears cached credentials):

```zsh lines theme={null}
cartesia auth logout
```

### Voice Agents

Deploy your agent to Cartesia cloud.

```zsh lines theme={null}
cartesia deploy
```

List out all the agents in your organization:

```zsh lines theme={null}
cartesia agents ls
```

#### Managed Deployments

Versions of your agent running on Cartesia's cloud. Each deployment rebuilds the environment, instantiates your project, and runs a health check.

To see all of your deployments:

```zsh lines theme={null}
cartesia deployments ls
```

Check the status of a deployment:

```zsh lines theme={null}
cartesia status [<deployment-id> or <agent-id>]
```

#### Self-Hosted Agent Code

While Cartesia's managed deployments are the simplest way to deploy low-latency voice agents, if you'd like to manage your own deployments of your agent code, you can pass us a URL for your agent to connect to during calls.

Connect an existing agent to your self-hosted code:

```zsh lines theme={null}
cartesia connect --agent-id <agent-id> --url https://my-agent.example.com
```

Or run without `--agent-id` to interactively select an existing agent or create a new one:

```zsh lines theme={null}
cartesia connect --url https://my-agent.example.com
```

Disconnect an agent from your self-hosted code:

```zsh lines theme={null}
cartesia disconnect --agent-id <agent-id>
```

### Environment Variables

Create, list, and remove environment variables for your agent.

Set environment variables for your agent:

```zsh lines theme={null}
cartesia env set API_KEY=FOOBAR MY_CONFIG=FOOBAZ
```

<Warning icon="lock">
  Environment variables are encrypted for storage and can only be accessed by your code.
</Warning>

Port environment variables from a `.env` file:

```zsh lines theme={null}
cartesia env set --from .env
```

```text .env theme={null}
API_KEY=FOOBAR
MY_CONFIG=FOOBAZ
```

Remove an environment variable:

```zsh lines theme={null}
cartesia env rm <env-var-name>
```

### Help Menu

For more details on any command:

```zsh lines theme={null}
cartesia --help
```


# Release Notes
Source: https://docs.cartesia.ai/line/developer-tools/release-notes

Updates to the Line SDK and platform.

## March 2026

Platform-wide API, PVC, and client library updates for this month are in [Changelog 2026](/changelog/2026) (March 2026).

***

## February 4, 2026

### AgentUpdateCall Output Event

Added `AgentUpdateCall` event for dynamically updating call configuration during a conversation:

```python theme={null}
from line.events import AgentUpdateCall

# In an agent's process method:
yield AgentUpdateCall(voice_id="5ee9feff-1265-424a-9d7f-8e4d431a12c7")
yield AgentUpdateCall(pronunciation_dict_id="dict-123")
```

| Field                   | Description                          |
| ----------------------- | ------------------------------------ |
| `voice_id`              | Updates the agent's voice            |
| `pronunciation_dict_id` | Updates the pronunciation dictionary |

All fields are optional—only set fields are updated. See [Events](/line/sdk/events#dynamic-configuration) for details.

***

## February 1, 2026

### Line SDK v0.2 — Major Release

We're releasing **Line SDK v0.2**, a complete redesign of the voice agent framework focused on simplicity, streaming performance, and seamless LLM integration. This release introduces a new async iterable architecture that replaces the previous event bus system.

<Warning>
  **Breaking Changes**: v0.2 is not backwards compatible with v0.1.x. See the [Migration Guide](#migration-guide-from-v0-1-x-to-v0-2) below for detailed upgrade instructions.
</Warning>

<Info>
  **What's changing?** Line SDK v0.2 makes it much simpler to build voice agents. Instead of manually wiring together multiple components (systems, bridges, nodes), you now write a single function that returns your agent. The SDK handles audio, interruptions, and conversation flow automatically.
</Info>

**Why upgrade?**

* **Faster development** — Build agents in hours instead of days with less boilerplate code
* **Easier maintenance** — Fewer moving parts means fewer bugs and simpler debugging
* **Better reliability** — Built-in error handling, retries, and fallback models
* **More flexibility** — Switch between 100+ AI providers (OpenAI, Anthropic, Google, etc.) without code changes
* **Powerful tools** — Add capabilities like web search, call transfers, and multi-agent handoffs with one line of code

***

## What's New in v0.2

### Simplified Agent Architecture

The new architecture replaces the `VoiceAgentSystem`, `Bus`, `Bridge`, and `ReasoningNode` pattern with a single async iterable function:

```python theme={null}
import os
from line import CallRequest
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import AgentEnv, VoiceAgentApp

async def get_agent(env: AgentEnv, call_request: CallRequest):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)
```

**Benefits:**

* Less boilerplate code
* No manual event routing or bridge configuration
* Automatic conversation history management
* Built-in interruption handling
* Quick, and easy tool definition

### Built-in LLM Support via LiteLLM

`LlmAgent` provides unified access to 100+ LLM providers through [LiteLLM](https://github.com/BerriAI/litellm):

```python theme={null}
# OpenAI
LlmAgent(model="gpt-5-nano", api_key=os.getenv("OPENAI_API_KEY"), ...)

# Anthropic
LlmAgent(model="anthropic/claude-haiku-4-5-20251001", api_key=os.getenv("ANTHROPIC_API_KEY"), ...)

# Google Gemini
LlmAgent(model="gemini/gemini-2.5-flash-preview-09-2025", api_key=os.getenv("GEMINI_API_KEY"), ...)

# With fallbacks
LlmAgent(
    model="gpt-5-nano",
    config=LlmConfig(fallbacks=["anthropic/claude-haiku-4-5-20251001", "gemini/gemini-2.5-flash-preview-09-2025"]),
    ...
)
```

### Declarative Tool System

Define agent capabilities using simple decorators. Three tool types cover all common scenarios:

| Tool Type       | Decorator           | What It Does                                                    | Example Use Case                                  |
| --------------- | ------------------- | --------------------------------------------------------------- | ------------------------------------------------- |
| **Loopback**    | `@loopback_tool`    | Fetches information, then the agent speaks the answer naturally | Looking up order status, checking account balance |
| **Passthrough** | `@passthrough_tool` | Takes an immediate action without additional AI processing      | Ending a call, transferring to a phone number     |
| **Handoff**     | `@handoff_tool`     | Transfers the conversation to a different specialized agent     | Routing to Spanish support, escalating to billing |

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool, passthrough_tool, handoff_tool
from line.events import AgentEndCall

@loopback_tool
async def get_weather(ctx, city: Annotated[str, "City name"]) -> str:
    """Get current weather for a city."""
    return f"72°F and sunny in {city}"

@passthrough_tool
async def end_call(ctx):
    """End the call."""
    yield AgentEndCall()

@handoff_tool
async def transfer_to_support(ctx, event):
    """Transfer to support agent."""
    async for output in support_agent.process(ctx.turn_env, event):
        yield output
```

### Background Tool Execution

Long-running tools can execute in the background without blocking the LLM:

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool(is_background=True)
async def check_bank_balance(ctx, account_id: Annotated[str, "Account ID"]):
    """Check account balance (may take a few seconds)."""
    yield "Checking your balance..."  # Immediate acknowledgment
    balance = await api.get_balance(account_id)  # Long operation
    yield f"Your balance is ${balance:.2f}"  # Triggers new LLM completion
```

### Built-in Tools

Common operations available out of the box:

```python theme={null}
from line.llm_agent import end_call, send_dtmf, transfer_call, web_search, agent_as_handoff

agent = LlmAgent(
    tools=[
        end_call,                    # End the call
        send_dtmf,                   # Send DTMF tones
        transfer_call,               # Transfer to phone number
        web_search,                  # Real-time web search
        agent_as_handoff(other_agent, name="transfer_to_billing"),
    ],
    ...
)
```

### Multi-Agent Workflows

Create sophisticated agent routing with `agent_as_handoff`:

```python theme={null}
spanish_agent = LlmAgent(
    model="gpt-5-nano",
    config=LlmConfig(system_prompt="Speak only in Spanish.", ...),
    ...
)

main_agent = LlmAgent(
    tools=[
        agent_as_handoff(
            spanish_agent,
            handoff_message="Transferring to Spanish support...",
            name="transfer_to_spanish",
            description="Transfer when user requests Spanish.",
        ),
    ],
    ...
)
```

### Structured Event System

Events are how your agent communicates with the outside world. **Output events** are actions your agent takes (speaking, ending calls). **Input events** are things that happen during a call (user speaks, call starts).

**Output Events** (agent → harness):

* `AgentSendText` — Send text to be spoken
* `AgentEndCall` — End the call
* `AgentTransferCall` — Transfer to another number
* `AgentSendDtmf` — Send DTMF tone
* `AgentToolCalled` / `AgentToolReturned` — Tool execution tracking
* `LogMetric` / `LogMessage` — Observability

**Input Events** (harness → agent):

* `CallStarted` / `CallEnded` — Call lifecycle
* `UserTurnStarted` / `UserTurnEnded` — User speaking
* `UserTextSent` / `UserDtmfSent` — User content
* `AgentHandedOff` — Handoff notification

All input events include a `history` field with the complete conversation context.

### Enhanced Configuration

Fine-tune how your agent thinks and responds. `LlmConfig` lets you control the AI's personality, response length, creativity, and reliability:

```python theme={null}
LlmConfig(
    system_prompt="You are a helpful assistant.",
    introduction="Hello! How can I help?",

    # Sampling parameters
    temperature=0.7,
    max_tokens=1024,
    top_p=0.95,

    # Resilience
    num_retries=2,
    fallbacks=["gpt-5-nano"],
    timeout=30.0,

    # Provider-specific options
    extra={"reasoning_effort": "high"},
)
```

***

## Migration Guide from v0.1.x to v0.2

This guide walks you through upgrading your existing v0.1.x agents to v0.2. The migration involves updating imports, simplifying your agent setup, and adopting the new tool system. Most agents can be migrated in under an hour.

### Overview of Changes

| v0.1.x                                | v0.2                                      |
| ------------------------------------- | ----------------------------------------- |
| `VoiceAgentSystem` + `Bus` + `Bridge` | `VoiceAgentApp` with `get_agent` callback |
| `ReasoningNode` subclasses            | `LlmAgent` or custom `Agent` protocol     |
| `call_handler(system, request)`       | `get_agent(env, request) -> Agent`        |
| Manual event routing                  | Automatic event dispatch with filters     |
| `process_context()` method            | `process(env, event)` async iterable      |

### Step 1: Update Imports

```python theme={null}
# v0.1.x
from line.voice_agent_app import VoiceAgentApp
from line.voice_agent_system import VoiceAgentSystem
from line.bridge import Bridge
from line.nodes import ReasoningNode
from line.events import (
    AgentSpeechSent,
    UserTranscriptionReceived,
    EndCall,
    TransferCall,
)

# v0.2
from line.voice_agent_app import VoiceAgentApp, AgentEnv
from line.llm_agent import LlmAgent, LlmConfig
from line.llm_agent import end_call, transfer_call, loopback_tool, passthrough_tool
from line.events import (
    AgentSendText,
    AgentEndCall,
    AgentTransferCall,
    UserTurnEnded,
    CallStarted,
)
```

### Step 2: Replace VoiceAgentSystem with get\_agent

In v0.1.x, event routing was configured manually via `bridge.on()`. In v0.2, event dispatch is automatic with customizable **run** and **cancel filters**.

<CodeGroup>
  ```python v0.1.x theme={null}
  from line.voice_agent_app import VoiceAgentApp
  from line.voice_agent_system import VoiceAgentSystem
  from line.bridge import Bridge
  from line.nodes import ReasoningNode
  from line.events import (
      UserTranscriptionReceived,
      UserStoppedSpeaking,
      DTMFInputEvent,
  )

  class MyReasoningNode(ReasoningNode):
      async def process_context(self, context):
          # Your LLM logic here
          response = await call_llm(context.messages)
          yield AgentResponse(content=response)

  async def call_handler(system: VoiceAgentSystem, call_request):
      node = MyReasoningNode(system_prompt="You are helpful.")
      bridge = Bridge(node)

      system.with_speaking_node(node, bridge)

      # Manual event routing with bridge.on()
      bridge.on(UserTranscriptionReceived).map(node.add_event)
      bridge.on(UserStoppedSpeaking).stream(node.generate).broadcast()

      # DTMF events required explicit routing
      bridge.on(DTMFInputEvent).map(node.handle_dtmf)

      await system.start()
      await system.send_initial_message("Hello!")
      await system.wait_for_shutdown()

  app = VoiceAgentApp(call_handler=call_handler)
  ```

  ```python v0.2 theme={null}
  import os
  from line import CallRequest
  from line.voice_agent_app import VoiceAgentApp, AgentEnv
  from line.llm_agent import LlmAgent, LlmConfig, end_call
  from line.events import (
      CallStarted,
      UserTurnEnded,
      UserDtmfSent,
      UserTurnStarted,
      CallEnded,
  )

  async def get_agent(env: AgentEnv, call_request: CallRequest):
      agent = LlmAgent(
          model="gpt-5-nano",
          api_key=os.getenv("OPENAI_API_KEY"),
          tools=[end_call],
          config=LlmConfig(
              system_prompt="You are helpful.",
              introduction="Hello!",
          ),
      )

      # Default: just return the agent (uses default filters)
      return agent

  async def get_agent_with_dtmf(env: AgentEnv, call_request: CallRequest):
      """Alternative: include DTMF events in processing."""
      agent = LlmAgent(...)

      # Return an AgentSpec tuple to customize filters
      run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded]
      cancel_filter = [UserTurnStarted]
      return (agent, run_filter, cancel_filter)

  app = VoiceAgentApp(get_agent=get_agent)
  ```
</CodeGroup>

#### Run and Cancel Filters

Filters control your agent's behavior during a call:

* **Run filters** determine what triggers your agent to respond (e.g., when the user finishes speaking)
* **Cancel filters** determine what interrupts your agent (e.g., when the user starts talking over the agent)

You can customize these by returning a tuple instead of just the agent:

```python theme={null}
from typing import Union, Tuple

AgentSpec = Union[Agent, Tuple[Agent, run_filter, cancel_filter]]
```

| Filter             | Purpose                                    | Default                                   |
| ------------------ | ------------------------------------------ | ----------------------------------------- |
| **run\_filter**    | Events that trigger agent processing       | `[CallStarted, UserTurnEnded, CallEnded]` |
| **cancel\_filter** | Events that cancel in-progress agent tasks | `[UserTurnStarted]`                       |

**Example: Agent that responds to DTMF input**

```python theme={null}
from line.events import (
    CallStarted, CallEnded, UserTurnEnded, UserTurnStarted, UserDtmfSent
)

async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)

    # Include UserDtmfSent in run_filter to process DTMF
    run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded]
    cancel_filter = [UserTurnStarted]

    return (agent, run_filter, cancel_filter)
```

**Example: Agent that doesn't get interrupted**

```python theme={null}
async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)

    # Empty cancel_filter = agent won't be interrupted
    run_filter = [CallStarted, UserTurnEnded, CallEnded]
    cancel_filter = []

    return (agent, run_filter, cancel_filter)
```

**Example: Custom filter function**

```python theme={null}
def my_run_filter(event: InputEvent) -> bool:
    """Only process events during business hours."""
    if isinstance(event, CallStarted):
        return is_business_hours()
    return isinstance(event, (UserTurnEnded, CallEnded))

async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)
    return (agent, my_run_filter, [UserTurnStarted])
```

### Step 3: Migrate Event Handling

<CodeGroup>
  ```python v0.1.x theme={null}
  # Event names
  AgentSpeechSent        # Agent spoke
  UserTranscriptionReceived  # User spoke
  EndCall                # End call
  TransferCall           # Transfer call

  # Manual event handling in ReasoningNode
  class MyNode(ReasoningNode):
      async def process_context(self, context):
          for event in context.events:
              if isinstance(event, UserTranscriptionReceived):
                  user_message = event.transcription
  ```

  ```python v0.2 theme={null}
  # Event names
  AgentSendText          # Output: send text to speak
  AgentTextSent          # Input: confirmation text was spoken
  UserTurnEnded          # Input: user finished speaking
  AgentEndCall           # Output: end call
  AgentTransferCall      # Output: transfer call

  # Events include history automatically
  async def process(self, env, event):
      if isinstance(event, UserTurnEnded):
          # Access user's message
          user_message = event.content[0].content

          # Access full conversation history
          for past_event in event.history:
              if isinstance(past_event, UserTextSent):
                  print(f"User previously said: {past_event.content}")
  ```
</CodeGroup>

### Step 4: Migrate Custom Tools

<CodeGroup>
  ```python v0.1.x theme={null}
  # Manual tool handling in ReasoningNode
  class MyNode(ReasoningNode):
      async def process_context(self, context):
          # Parse tool calls from LLM response
          if tool_call := extract_tool_call(response):
              result = await self.execute_tool(tool_call)
              # Manually add to context and call LLM again
              context.add_tool_result(result)
              response = await call_llm(context)
  ```

  ```python v0.2 theme={null}
  from typing import Annotated
  from line.llm_agent import loopback_tool, passthrough_tool
  from line.events import AgentSendText, AgentEndCall

  # Declarative tool definitions
  @loopback_tool
  async def get_account_balance(ctx, account_id: Annotated[str, "Account ID"]):
      """Look up account balance."""
      balance = await api.get_balance(account_id)
      return f"${balance:.2f}"

  @passthrough_tool
  async def end_call_with_message(ctx, message: Annotated[str, "Goodbye message"]):
      """End call with a custom message."""
      yield AgentSendText(text=message)
      yield AgentEndCall()

  # Tools are passed to LlmAgent
  agent = LlmAgent(
      tools=[get_account_balance, end_call_with_message],
      ...
  )
  ```
</CodeGroup>

### Step 5: Migrate Multi-Agent Patterns

<CodeGroup>
  ```python v0.1.x theme={null}
  # Manual agent switching
  class MainNode(ReasoningNode):
      def __init__(self, spanish_node):
          self.spanish_node = spanish_node
          self.use_spanish = False

      async def process_context(self, context):
          if self.should_switch_to_spanish(context):
              self.use_spanish = True
              # Complex manual state management
  ```

  ```python v0.2 theme={null}
  from line.llm_agent import agent_as_handoff

  spanish_agent = LlmAgent(
      model="gpt-5-nano",
      config=LlmConfig(system_prompt="Speak only in Spanish."),
      ...
  )

  main_agent = LlmAgent(
      tools=[
          agent_as_handoff(
              spanish_agent,
              handoff_message="Transferring...",
              name="transfer_to_spanish",
              description="Use when user requests Spanish.",
          ),
      ],
      ...
  )
  ```
</CodeGroup>

### Removed APIs

The following APIs from v0.1.x have been removed with no direct replacement:

| Removed               | Alternative                                  |
| --------------------- | -------------------------------------------- |
| `VoiceAgentSystem`    | Use `VoiceAgentApp` with `get_agent`         |
| `Bus`                 | Events are dispatched automatically          |
| `Bridge`              | Use run/cancel filters on `AgentSpec`        |
| `ReasoningNode`       | Use `LlmAgent` or implement `Agent` protocol |
| `ConversationHarness` | Handled internally by `ConversationRunner`   |
| `EventsRegistry`      | Use typed event classes directly             |

### Custom Agent Protocol

If you need custom logic beyond `LlmAgent`, implement the `Agent` protocol:

```python theme={null}
from typing import AsyncIterable
from line.events import (
    InputEvent,
    OutputEvent,
    AgentSendText,
    CallStarted,
    UserTurnEnded,
)

class CustomAgent:
    """Custom agent implementing the Agent protocol."""

    async def process(self, env, event: InputEvent) -> AsyncIterable[OutputEvent]:
        if isinstance(event, CallStarted):
            yield AgentSendText(text="Hello from custom agent!")
        elif isinstance(event, UserTurnEnded):
            # Your custom logic here
            user_message = event.content[0].content
            response = await your_custom_logic(user_message, event.history)
            yield AgentSendText(text=response)
```

***

## Breaking Changes Summary

This section provides a quick reference for all breaking changes. Use this as a checklist when migrating your code.

### Event Renames

| v0.1.x                      | v0.2                                               |
| --------------------------- | -------------------------------------------------- |
| `AgentSpeechSent`           | `AgentSendText` (output) / `AgentTextSent` (input) |
| `UserTranscriptionReceived` | `UserTextSent` / `UserTurnEnded`                   |
| `UserStartedSpeaking`       | `UserTurnStarted`                                  |
| `UserStoppedSpeaking`       | `UserTurnEnded`                                    |
| `AgentStartedSpeaking`      | `AgentTurnStarted`                                 |
| `AgentStoppedSpeaking`      | `AgentTurnEnded`                                   |
| `EndCall`                   | `AgentEndCall`                                     |
| `TransferCall`              | `AgentTransferCall`                                |
| `DTMFInputEvent`            | `UserDtmfSent`                                     |
| `DTMFOutputEvent`           | `AgentSendDtmf`                                    |

<Note>
  **Output vs. Input events**: `AgentSendText` is an output event you **yield** to make the agent speak. `AgentTextSent` is an input event you **receive** confirming what was spoken (appears in history).
</Note>

### Structural Changes

* **History in events**: All input events now include an optional `history` field with complete conversation context. When `history` is `None`, the event is inside a history list; when it contains a list, the event has full context attached.
* **Tool events**: `ToolCall`/`ToolResult` replaced with structured `AgentToolCalled`/`AgentToolReturned`
* **Event IDs**: All events now have stable `event_id` fields for tracking

### Configuration Changes

| v0.1.x                            | v0.2                                  |
| --------------------------------- | ------------------------------------- |
| `CallRequest.agent.system_prompt` | `LlmConfig.system_prompt`             |
| `CallRequest.agent.introduction`  | `LlmConfig.introduction`              |
| Manual LLM parameters             | `LlmConfig` with full LiteLLM support |

<Tip>
  Use `LlmConfig.from_call_request(call_request, fallback_system_prompt="...", fallback_introduction="...")` to automatically inherit configuration from the Cartesia Playground while providing sensible defaults. See [Agents documentation](/line/sdk/agents#accessing-call-metadata-in-your-agent-logic) for details.
</Tip>

***

## New Dependencies

v0.2 introduces the following dependencies:

```
litellm              # Multi-provider LLM support
pydantic             # Type validation for events
phonenumbers >= 9.0  # Phone number validation for transfer_call
```

Optional dependencies for examples:

```
exa-py               # Exa web search integration
duckduckgo-search    # Fallback web search
```

***

## Getting Help

* **Documentation**: [Line SDK Overview](/line/sdk/overview)
* **Examples**: [github.com/cartesia-ai/line/examples](https://github.com/cartesia-ai/line/tree/main/examples)
* **Support**: [support@cartesia.ai](mailto:support@cartesia.ai)


# Metrics
Source: https://docs.cartesia.ai/line/evaluations/metrics


The Line platform includes a suite of tools for evaluating how your Agent is performing, both during development phase and in production.
You have full control over how metrics for evaluating your agent are defined.

<Frame>
  <iframe />
</Frame>

## System Metrics

By default, all calls made by a Line Agent have a set of system metrics automatically calculated to help evaluate performance.

| System Metric                  | Description                                                                                                  |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------ |
| system\_call\_success          | A boolean status indicating if the call disconnects unexpectedly, for example due to reasoning code crashing |
| system\_text\_to\_speech\_ttfb | The time to first byte of audio generated by the TTS model on the first turn of the conversation             |

### LLM as a Judge

An LLM-as-a-Judge metric is created in the playground by setting a name and specifying a prompt. You can try out different prompts in
the playground against existing call transcripts by copying a call id into the metric creation field and clicking evaluate
to generate a sample output.

<Frame>
  <iframe />
</Frame>

<Tip title="Prompt Tips" icon="gavel">
  Write your LLM as a Judge metrics to return a single value and description
  field.
</Tip>

A metric name can only include lower case letters, digits, and ‘-’, ‘\_’, or ‘.’ characters so that you can manage it
from a cli. Metric names must also be unique within your organization.

## Assigning Metrics

Once a metric is created, it can be assigned to an Agent via the playground from the Agent page. All subsequent calls made
to or from that Agent will have metric results calculated and available to view in the console and API. Note
that when you assign a metric to an existing Agent, it won’t be automatically run on previous calls.

<Frame>
  <img alt="Assign a metric" />
</Frame>


# Metrics Results
Source: https://docs.cartesia.ai/line/evaluations/results

View the results from metrics run against all calls handled by your agent.

Metrics results are accessible via both API and the playground.

Each metric result contains relevant information to help you analyze your calls. Some fields include:

```
- metric_id
- metric_name
- agent_id
- call_id
- summary
- transcript
- deployment_id
- value
- status
```

To view the full schema, visit the API [List Metric Results](/api-reference/agents/metrics/list-metric-results).

## API

To get metrics via the API, you can specify a few filter parameters including `call_id`, `agent_id` and more. You can retrieve these metric results or export them into a CSV. [List Metric Results](/api-reference/agents/metrics/list-metric-results) and [Export Metric Results](/api-reference/agents/metrics/export-metric-results) have the same query parameters available and differ only in the response format.

#### Example Request for CSV Results

<CodeGroup>
  ```zsh cURL lines theme={null}
  curl --location 'https://api.cartesia.ai/agents/metrics/export?metric_id={metric_id}&limit=100&starting_after={previous_next_page_metric_id}' \
  --header 'Cartesia-Version: 2025-04-16' \
  --header 'Authorization: Bearer {YOUR_API_KEY}'
  ```

  ```python Python lines theme={null}
  import requests

  url = "https://api.cartesia.ai/agents/metrics/export"
  params = {
      "metric_id": "{metric_id}",
      "limit": 100,
      "starting_after": "{previous_next_page_metric_id}"
  }
  headers = {
      "Content-Type": "application/json",
      "Cartesia-Version": "2025-04-16",
      "Authorization": "Bearer <YOUR_API_KEY>"
  }

  response = requests.get(url, headers=headers, params=params)

  if response.status_code == 200:
      # Save CSV to file
      with open("metrics.csv", "w", encoding="utf-8") as f:
          f.write(response.text)
      print("CSV file saved as metrics.csv")
  else:
      print(f"Error {response.status_code}: {response.text}")
  ```

  ```typescript Javascript lines theme={null}
  const response = await fetch(
    "https://api.cartesia.ai/agents/metrics/export?metric_id={metric_id}&limit=100&starting_after={previous_next_page_metric_id}",
    {
      method: "GET",
      headers: {
        "Content-Type": "application/json",
        "Cartesia-Version": "2025-04-16",
        Authorization: "Bearer <your_api_key>",
      },
    }
  );
  ```
</CodeGroup>

## Console

Metrics are visible in the playground for a specific call record.


# Deployments
Source: https://docs.cartesia.ai/line/infrastructure/deployments


Deployments are instances of your agent running on Cartesia's servers.

<Frame>
  <img alt="Deployments" />
</Frame>

# State

Only deployments in the `ready` state can handle inbound or outbound calls. At any time, only one deployment is active.
Deployments that fail health checks will not receive traffic.

# Creating a deployment

Use `cartesia deploy` or push to a linked GitHub repository to create a deployment.

Cartesia servers:

1. Build the virtual environment
2. Load `main.py` and instantiate a FastAPI app
3. Run a health check
4. Set the deployment to `ready` and start receiving traffic

<Info>
  Line supports Python 3.9–3.13 (specify in `pyproject.toml`). FastAPI servers only; more frameworks coming soon.
</Info>

<Tip title="Pre-Call Initialization" icon="phone-volume">
  **Pre-Call Initialization**

  Inbound calls will ring for five seconds to allow your application logic to warm up any required resources and establish
  connections.
</Tip>


# Observability
Source: https://docs.cartesia.ai/line/infrastructure/observability

Get full visibility into how your Agent is performing.

Monitor every deployment and call.

<Frame>
  <iframe />
</Frame>

## Deployment

Each deployment generates a unique ID. View logs in the console.

<Frame>
  <img alt="Sample Deployment Logs" />
</Frame>

## Call Logs

You can click into a call and view any logging statements generated by your reasoning code.

<Frame>
  <iframe />
</Frame>

## Transcripts

Each call has a transcript with independently separated transcribed audio and text to be generated. When you export these
transcripts with the API or CLI, these include more granular turn level timestamps.

<Frame>
  <img alt="Sample Call Transcripts" />
</Frame>

## Loggable Events

Record events without tying them to tool calls.

### SDK

In the SDK, yield `LogMessage` events from your agent or tools to record custom events:

```python theme={null}
from line.events import LogMessage

@loopback_tool
async def process_order(ctx, order_id: Annotated[str, "Order ID"]):
    """Process a customer order."""
    result = await api.process_order(order_id)

    # Log a custom event
    yield LogMessage(
        name="order_processed",
        level="info",
        message=f"Processed order {order_id}",
        metadata={"status": result.status, "order_id": order_id}
    )

    return f"Order {order_id} processed: {result.status}"
```

Events are automatically sent to the platform when yielded.

### Websocket

If you're not using the SDK and instead just relying on the bare websocket, logging events will look like this:

```json theme={null}
{
  "type": "log_event",
  "event": "event_name",
  "metadata": {
    "key": "value"
  }
}
```

### Playground

You can view these events in the Playground under the `Transcript` tab of the call.

## Loggable Metrics

Record metrics at any point in your workflow.

### SDK

In the context of the SDK, we can log a metric by broadcasting the `LogMetric` event.
Here's a snippet from the form filling template that exhibits this:

```python theme={null}
# Record the answer in form manager
success = self.form_manager.record_answer(answer)

if success:
  # Log metric for the answered question
  if current_question:
    metric_name = current_question["id"]
    yield LogMetric(name=metric_name, value=answer)
    logger.info(f"📊 Logged metric: {metric_name}={answer}")
```

The user bridge is subscribed to the `LogMetric` event by default, and it will
log it over the websocket by default when it sees that `LogMetric` has been broadcast.

### Websocket

If you're not using the SDK and instead just relying on the bare websocket, logging metrics will look like this:

```json theme={null}
{
  "type": "log_metric",
  "name": "metric_name",
  "value": "metric_value"
}
```

### Playground

You can view these events in the Playground under the `Transcript` tab of the call.

<Frame>
  <img alt="Loggable Metrics in the Playground" />
</Frame>

## Call Recordings

Call recordings can be downloaded from the playground.

<Frame>
  <img alt="Sample Call Recordings" />
</Frame>

## Webhooks

Cartesia sends webhook events to your **HTTPS** endpoint throughout the call lifecycle. Expose **`POST`** + **`application/json`** and verify the **`x-webhook-secret`** header matches your stored secret.

<Frame>
  <img alt="Sample Call Webhooks" />
</Frame>

### Verify the webhook secret

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    if request.headers.get("x-webhook-secret") != os.environ["LINE_WEBHOOK_SECRET"]:
        return jsonify({"error": "unauthorized"}), 401
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    if (req.headers["x-webhook-secret"] !== process.env.LINE_WEBHOOK_SECRET)
      return res.status(401).json({ error: "unauthorized" });
    ```
  </Tab>
</Tabs>

### Event types

| Event                | When                           | Typed field |
| -------------------- | ------------------------------ | ----------- |
| `call_started`       | Call session begins            | `call`      |
| `call_completed`     | Call ends normally             | `call`      |
| `call_failed`        | Call ends with error           | `call`      |
| `call_turn`          | Each conversational turn       | `turn`      |
| `post_call_analysis` | After async analysis completes | `analysis`  |

### Envelope fields

Every webhook event includes these top-level fields:

| Field        | Description                   |
| ------------ | ----------------------------- |
| `type`       | Event type (see table above). |
| `call_id`    | Call identifier.              |
| `agent_id`   | Agent that handled the call.  |
| `webhook_id` | Webhook config id.            |
| `timestamp`  | RFC 3339 event time.          |

### `call`

Present on `call_started`, `call_completed`, and `call_failed` events. Matches the [GET /agents/calls/\{call\_id}](/api-reference/agents/calls/get-call) response. Some events (e.g. `call_started`) may omit fields like `end_time` that do not yet have a valid value.

| Field                     | Description                                                                                                                                    |
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `id`                      | Call identifier.                                                                                                                               |
| `agent_id` / `agent_name` | Agent details.                                                                                                                                 |
| `status`                  | `started`, `completed`, or `failed`.                                                                                                           |
| `start_time` / `end_time` | RFC 3339 timestamps.                                                                                                                           |
| `end_reason`              | Why the call ended (e.g. `client_hangup`, `agent_hangup`, `inactivity`). See [EndReason](/api-reference/agents/calls/get-call) for all values. |
| `transcript`              | Array of turns (see `turn` below).                                                                                                             |
| `telephony_params`        | `from`, `to`, `direction`, `call_sid`, `connection_type`.                                                                                      |
| `error_message`           | Error detail (failed calls only).                                                                                                              |
| `metadata`                | User-supplied metadata passed at call start.                                                                                                   |
| `summary`                 | Call summary (if available at event time).                                                                                                     |

### `turn`

Present on `call_turn` events. One turn per agent or user utterance.

| Field                               | Description                                           |
| ----------------------------------- | ----------------------------------------------------- |
| `role`                              | `assistant` or `user`.                                |
| `text`                              | Turn text.                                            |
| `start_timestamp` / `end_timestamp` | Seconds from call start.                              |
| `tts_ttfb`                          | Agent TTS time-to-first-byte (seconds), when present. |
| `tool_calls`                        | Tool calls made during this turn, when present.       |

### `analysis`

Present on `post_call_analysis` events. Sent after async analysis completes (currently summary generation; evaluations and metrics will be added here in the future).

| Field     | Description                |
| --------- | -------------------------- |
| `summary` | 1-2 sentence call summary. |

### Example: `call_completed`

```json theme={null}
{
  "type": "call_completed",
  "call_id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
  "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
  "webhook_id": "agent_webhook_P3MgdLf1cpaucZJ7xWehCC",
  "end_reason": "client_hangup",
  "timestamp": "2026-04-16T01:08:50.061907836Z",
  "call": {
    "id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
    "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
    "agent_name": "My Agent",
    "status": "completed",
    "start_time": "2026-04-16T01:08:37.413659Z",
    "end_time": "2026-04-16T01:08:50.036327Z",
    "end_reason": "client_hangup",
    "telephony_params": {
      "from": "websocket",
      "to": "agent_rwh4HGMgyhK7rM5ucVqbiC",
      "connection_type": "websocket"
    },
    "transcript": [
      {
        "role": "assistant",
        "text": "Hi there! How can I help you today?",
        "start_timestamp": 0.41,
        "end_timestamp": 3.2,
        "tts_ttfb": 0.065
      },
      {
        "role": "user",
        "text": "I want to schedule an appointment.",
        "start_timestamp": 3.5,
        "end_timestamp": 5.8
      }
    ]
  }
}
```

### Example: `post_call_analysis`

```json theme={null}
{
  "type": "post_call_analysis",
  "call_id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
  "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
  "webhook_id": "agent_webhook_P3MgdLf1cpaucZJ7xWehCC",
  "timestamp": "2026-04-16T01:08:50.955058787Z",
  "analysis": {
    "summary": "The caller requested to schedule an appointment. The agent confirmed availability and booked a slot."
  }
}
```

### Test your endpoint

```bash theme={null}
curl -sS -X POST "https://your-server.example/webhooks/cartesia" \
  -H "Content-Type: application/json" \
  -H "x-webhook-secret: YOUR_WEBHOOK_SECRET" \
  -d '{
    "type": "call_completed",
    "call_id": "ac_test_123",
    "agent_id": "agent_demo",
    "webhook_id": "agent_webhook_test",
    "timestamp": "2026-01-01T00:00:00.000000000Z",
    "call": {
      "id": "ac_test_123",
      "agent_id": "agent_demo",
      "agent_name": "Test Agent",
      "status": "completed",
      "end_reason": "client_hangup",
      "transcript": []
    }
  }'
```

<Note>
  For backwards compatibility, `call_completed` and `call_failed` events also include `body` (transcript array) and a top-level `end_reason`. These are deprecated — use `call.transcript` and `call.end_reason` instead.
</Note>


# Scaling
Source: https://docs.cartesia.ai/line/infrastructure/scaling


## Compute Resources

Each call has access to 1GB memory and 0.5 vCPU. Contact support to increase limits.

<Card title="Contact Support" href="https://cartesia.ai/contact" />

## Concurrency

Concurrent call limits by subscription tier:

| Subscription Tier | Concurrency Limit |
| ----------------- | ----------------- |
| Free              | 8                 |
| Pro               | 12                |
| Startup           | 20                |
| Scale             | 60                |

<Tip title="Outbound Concurrency" icon="dialpad">
  **Outbound Concurrency**

  When triggering outbound calls, you are limited to triggering one call per second while the overall concurrency limits still apply.
</Tip>


# Calls API
Source: https://docs.cartesia.ai/line/integrations/calls-api


Stream audio between your application and your voice agent via WebSocket. Use this for web apps, mobile apps, or to bridge your own telephony provider.

## Quick start

```javascript theme={null}
const ws = new WebSocket(
  `wss://api.cartesia.ai/agents/stream/${agentId}`,
  {
    headers: {
      Authorization: `Bearer ${accessToken}`,
      "Cartesia-Version": "2025-04-16",
    },
  }
);

// Initialize the stream
ws.onopen = () => {
  ws.send(JSON.stringify({
    event: "start",
    config: { input_format: "pcm_44100" },
  }));
};

// Handle agent audio
ws.onmessage = (msg) => {
  const data = JSON.parse(msg.data);
  if (data.event === "media_output") {
    playAudio(atob(data.media.payload));
  }
};

// Send user audio
function sendAudio(audioData) {
  ws.send(JSON.stringify({
    event: "media_input",
    stream_id: streamId,
    media: { payload: btoa(audioData) },
  }));
}
```

Get an access token from the `/access-token` [endpoint](/api-reference/auth/access-token#body-grants-agent). See [Authenticating Client Apps](/get-started/authenticate-your-client-applications) for details.

***

## Connection

Connect to the WebSocket endpoint:

```
wss://api.cartesia.ai/agents/stream/{agent_id}
```

**Headers:**

| Header             | Value            |
| ------------------ | ---------------- |
| `Authorization`    | `Bearer {token}` |
| `Cartesia-Version` | `2025-04-16`     |

## Protocol Overview

The WebSocket connection uses JSON messages for control events and base64-encoded audio for media.

The client sends a `start` event, the server responds with `ack`, then both sides exchange audio and control events until the connection closes.

## Client events

### Start Event

Initializes the audio stream configuration.

* `config` overrides your agent's default input audio settings
* `stream_id` is optional. If not provided, the server generates one and returns it in the `ack` event

**This must be the first message sent.**

```json theme={null}
{
  "event": "start",
  "stream_id": "unique_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "introduction": "Hello, I'm an AI assistant",
    "system_prompt": "### Your Role \n You are a helpful assistant"
  },
  "metadata": {
    "to": "user@example.com",
    "from": "+1234567890"
  }
}
```

**Fields:**

* `stream_id` (optional): Stream identifier. If not provided, server generates one
* `config.input_format`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`)
* `config.voice_id` (optional): Override the agent's default TTS voice
* `agent` (optional): Allows configuring individual agent calls via API and previewing changes in introduction or prompt without publishing to production
* `metadata` (optional): Custom metadata object. These will be passed through to the agent code, but there are some special fields you can use as well:
  * `to` (optional): Destination identifier for call routing (defaults to agent ID)
  * `from` (optional): Source identifier for the call (defaults to "websocket")

### Media Input Event

Audio data sent from the client to the server. `payload` audio data should be base64 encoded.

```json theme={null}
{
  "event": "media_input",
  "stream_id": "unique_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
```

**Fields:**

* `stream_id`: Unique identifier for the Stream from the ack response
* `media.payload`: Base64-encoded audio data in the format specified in the start event

### DTMF Event

Sends DTMF (dual-tone multi-frequency) tones.

```json theme={null}
{
  "event": "dtmf",
  "stream_id": "example_id",
  "dtmf": "1"
}
```

**Fields:**

* `stream_id`: Stream identifier
* `dtmf`: DTMF digit (0-9, \*, #)

### Custom Event

Sends custom metadata to the agent.

```json theme={null}
{
  "event": "custom",
  "stream_id": "example_id",
  "metadata": {
    "user_id": "user123",
    "session_info": "custom_data"
  }
}
```

**Fields:**

* `stream_id`: Stream identifier
* `metadata`: Object containing key-value pairs of custom data

## Server events

### Ack Event

Confirms stream configuration. Returns the server-generated `stream_id` if one wasn't provided in the `start` event.

```json theme={null}
{
  "event": "ack",
  "stream_id": "example_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "system_prompt": "### Your Role \n You are a helpful assistant",
    "introduction": "Hello, I'm an AI assistant"
  }
}
```

### Media Output Event

Server sends agent audio response. `payload` is base 64 encoded audio data.

```json theme={null}
{
  "event": "media_output",
  "stream_id": "example_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
```

### Clear Event

Indicates the agent wants to clear/interrupt the current audio stream.

```json theme={null}
{
  "event": "clear",
  "stream_id": "example_id"
}
```

### Transfer Call Event

Indicates the agent wants to transfer the call to a phone number. The client is responsible for initiating the transfer on its telephony side.

```json theme={null}
{
  "event": "transfer_call",
  "stream_id": "example_id",
  "transfer": {
    "target_phone_number": "+1234567890"
  }
}
```

**Fields:**

* `stream_id`: Stream identifier
* `transfer.target_phone_number`: E.164 phone number to transfer the call to

## Connection Management

### Inactivity Timeout

The server closes idle connections after **180 seconds**. Any client message resets the timer:

* Application messages (media\_input, dtmf, custom events)
* Standard WebSocket ping frames
* Any other valid WebSocket message

When the timeout occurs, the connection is closed with:

* **Code:** 1000 (Normal Closure)
* **Reason:** `"connection idle timeout"`

### Ping/Pong Keepalive

To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive:

```python theme={null}
# Client sends ping to reset inactivity timer
pong_waiter = await websocket.ping()
latency = await pong_waiter
```

```javascript theme={null}
// Requires the Node.js `ws` library — the browser WebSocket API does not expose ping()
setInterval(() => {
  if (websocket.readyState === WebSocket.OPEN) {
    websocket.ping();
  }
}, 60000); // Send ping every 60 seconds
```

The server automatically responds to ping frames with pong frames and resets the inactivity timer upon receiving any message.

### Connection Close

The connection can be closed by either the client or server using WebSocket close frames.

**Client-initiated close:**

```python theme={null}
await websocket.close(code=1000, reason="session completed")
```

**Server-initiated close:**
When the agent ends the call, the server closes the connection with:

* **Code:** 1000 (Normal Closure)
* **Reason:** `"call ended by agent"` or `"call ended by agent, reason: {specific_reason}"` if additional context is available

## Best Practices

1. **Send `start` first** — The connection closes if any other event is sent before `start`.
2. **Choose the right audio format** — Match the format to your source: `mulaw_8000` for telephony, `pcm_44100` for web clients.
3. **Handle closes cleanly** — Always capture close codes and reasons for debugging and recovery.
4. **Keep the connection alive** — Send WebSocket ping frames every 60–90 seconds to avoid the 180-second inactivity timeout.
5. **Manage stream IDs** — Provide your own `stream_id` values to improve observability across systems.
6. **Recover from idle timeouts** — On `1000 / connection idle timeout`, reconnect and resend a `start` event.


# Overview
Source: https://docs.cartesia.ai/line/integrations/overview


Your Line agent needs audio input to work. Choose based on your use case.

## Telephony

Use [Cartesia Telephony](/line/integrations/telephony/phone-numbers) for managed phone numbers. Cartesia provisions numbers and handles the telephony infrastructure for inbound and outbound use cases.

You can also use your own telephony stack by connecting to the [Calls API](/line/integrations/calls-api).

<Note>
  Bringing your own phone numbers or CCaaS provider is on the roadmap.
</Note>

## Web and Mobile Apps

Use the [Calls API](/line/integrations/calls-api) to stream audio between your application and the agent via WebSocket.

```javascript theme={null}
const ws = new WebSocket(`wss://api.cartesia.ai/agents/stream/${agentId}`);
```

This option works great for:

* Web applications with browser microphone access
* Mobile apps with native audio capture

## Pricing

| Feature                  | Price per Minute | Notes                                 |
| ------------------------ | ---------------- | ------------------------------------- |
| Agent Calling            | \$0.06           | Base rate for all voice agent calls   |
| Telephony (add-on)       | +\$0.014         | Additional when using managed numbers |
| **Total with Telephony** | **\$0.074**      | Combined cost for phone-based calls   |

View your usage and remaining Voice Agent credits on the [Subscription](https://play.cartesia.ai/subscription) page.


# Outbound
Source: https://docs.cartesia.ai/line/integrations/telephony/outbound-dialing


Agents can make outbound dials with an API request. Simply specify a set of target phone numbers and your agent ID
to place your dial.

<Warning title="Compliance" icon="triangle-exclamation">
  **Compliance**

  You are solely responsible for remaining complaint with relevant local regulations for dialing including the Telephone
  Consumer Protection Act (TCPA).

  See Cartesia's [Acceptable Use Policy](https://cartesia.ai/legal/acceptable-use.html) for more detail.
</Warning>

<CodeGroup>
  ```bash Bash lines theme={null}
  curl -X POST "https://api.cartesia.ai/twilio/call/outbound" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $CARTESIA_API_KEY" \
  -H "Cartesia-Version: 2025-04-16" \
  -d '{
    "target_numbers": ["YOUR_PHONE_NUMBER"],
    "agent_id": "YOUR_AGENT_ID",
    "metadata": {
      "customer_id": "cust_123",
      "custom_prompt": "Be extra friendly"
    }
  }'
  ```

  ```python Python lines theme={null}
  import requests

  url = "https://api.cartesia.ai/twilio/call/outbound"

  headers = {
      "Content-Type": "application/json",
      "Authorization": "Bearer YOUR_CARTESIA_API_KEY",
      "Cartesia-Version": "2025-04-16"
  }

  payload = {
      "target_numbers": ["YOUR_PHONE_NUMBER"],
      "agent_id": "YOUR_AGENT_ID",
      "metadata": {
          "customer_id": "cust_123",
          "custom_prompt": "Be extra friendly"
      }
  }

  response = requests.post(url, headers=headers, json=payload)

  print("Status Code:", response.status_code)
  print("Response:", response.json())
  ```

  ```bash CLI theme={null}
  # Trigger an outbound call from a deployed agent to a specific number
  cartesia call <phone_number> <agent_id>
  ```
</CodeGroup>

The `metadata` field accepts any JSON object up to 1MB. This data is passed to your agent code deployment and can be accessed to customize agent behavior per call.

You can access the metadata in your agent code via the `call_request.metadata` object in your `get_agent` function.

```python theme={null}
async def get_agent(env, call_request):
    if call_request.metadata:
        logger.info(f"Received metadata: {call_request.metadata}")
    # Use metadata to customize agent behavior
    return LlmAgent(...)
```

<Note>You are limited to one outbound dial placed per second, any requests faster than one dial per second will be queued. </Note>


# Phone Numbers
Source: https://docs.cartesia.ai/line/integrations/telephony/phone-numbers


Cartesia Telephony provides managed phone numbers so your agent can receive and make real phone calls without setting up your own telephony infrastructure.

## Provisioning

The platform automatically provisions a phone number for each agent when you promote to production. When an agent is deleted, the assigned phone number is released and cannot be re-assigned to another agent.

<Note>
  Bringing your own phone numbers or CCaaS provider is on the roadmap.
</Note>

## Finding Your Phone Number

When viewing your Line agents from the Playground, you can see the provisioned phone number on the Agents page in the card:

<Frame>
  <img alt="Phone number shown in agent card" />
</Frame>

Or in the header once you navigate to the agent's page:

<Frame>
  <img alt="Phone number shown in agent header" />
</Frame>

You can also retrieve your phone number using the [CLI](/line/cli).

List all agents to see their phone numbers:

```bash theme={null}
cartesia agents ls
```

Or get detailed information for a specific agent:

```bash theme={null}
cartesia status <agent_id>
```

This returns agent information including name, deployments, and phone numbers.


# Introduction
Source: https://docs.cartesia.ai/line/introduction

Build intelligent, low-latency voice agents with Line.

## What is Line?

Line brings voice to your text agents with Cartesia's state-of-the-art speech models. We handle audio orchestration, deployment, and observability so you can focus on your agent's reasoning.

## Get Started

<CardGroup>
  <Card title="Quickstart" icon="rocket" href="./start-building/quickstart">
    Build, deploy, and call your first agent
  </Card>

  <Card title="Agent Builder" icon="sparkles" href="./start-building/agent-builder">
    Prototype and iterate on agents without code
  </Card>

  <Card title="SDK" icon="code" href="./sdk/overview">
    Write your custom reasoning logic in code
  </Card>
</CardGroup>

## Audio Orchestration

Line deploys your code in seconds in our managed runtime with auto-scaling and blazing fast audio processing, using [Ink](https://cartesia.ai/ink) for speech-to-text and [Sonic](https://cartesia.ai/sonic) for text-to-speech.

<Frame>
  <img alt="Line voice agent platform architecture" />
</Frame>

## What You Can Build

Line gives you full control over your agent's behavior through code: connect any LLM, call external APIs, query databases, and handle interruptions and turn-taking.

## Developer Tools

* **[CLI](/line/cli)** – Deploy and test agents from your terminal
* **[Call logs](/line/infrastructure/observability#call-logs)** – Debug conversations and monitor performance
* **[Evaluations](/line/evaluations/metrics)** – Measure agent quality with custom metrics
* **[Deployments](/line/infrastructure/observability#deployment)** – Track versions and roll back changes


# Agents
Source: https://docs.cartesia.ai/line/sdk/agents


Agents process input events and yield output events to control the conversation.

## What is an Agent?

An Agent controls the input/output event loop. The `process` method receives events (user speech, call start, etc.) and yields responses.

An Agent can be:

1. A **class** with a `process` method
2. A **function** with the same signature `(env, event) -> AsyncIterable[OutputEvent]`

```python theme={null}
from line.events import CallStarted, UserTurnEnded, AgentSendText

class HelloAgent:
    async def process(self, env, event):
        if isinstance(event, CallStarted):
            yield AgentSendText(text="Hello!")
        elif isinstance(event, UserTurnEnded):
            yield AgentSendText(text="I heard you!")
```

**How an Agent works:**

* Events arrive (user speaks, call starts, button pressed)
* SDK calls `agent.process(env, event)`
* Agent yields output events (speech, tool calls, handoffs)
* SDK handles audio, LLM calls, and state management

***

## LlmAgent

Use the built-in `LlmAgent` which wraps 100+ LLM providers via LiteLLM:

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",  # Or "gpt-5.2", "gemini/gemini-2.5-flash", etc.
    api_key="your-api-key",
    tools=[...],  # Optional list of tools
    config=LlmConfig(
        system_prompt="You are a helpful assistant...",
        introduction="Hello! How can I help you today?",
    ),
)
```

### Prompting

`system_prompt` to define your agent's personality and `introduction` for the greeting:

```python theme={null}
import os
from line import CallRequest
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import AgentEnv, VoiceAgentApp

SYSTEM_PROMPT = """You are a friendly customer service agent.

Rules:
- Be polite and empathetic
- Confirm understanding before taking action
-  end_call to gracefully end conversations
"""

async def get_agent(env: AgentEnv, call_request: CallRequest):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt=SYSTEM_PROMPT,
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

### Supported Models

| Provider                                                            | Model Examples                                                         |
| ------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| Anthropic                                                           | `anthropic/claude-haiku-4-5-20251001`, `anthropic/claude-sonnet-4-5`   |
| OpenAI                                                              | `gpt-5.4`, `gpt-5.2`                                                   |
| Google                                                              | `gemini/gemini-2.5-flash-preview-09-2025`, `gemini/gemini-3.0-preview` |
| And 100+ more via [LiteLLM](https://docs.litellm.ai/docs/providers) |                                                                        |

### LlmConfig Options

| Option              | Type                  | Description                                                |
| ------------------- | --------------------- | ---------------------------------------------------------- |
| `system_prompt`     | `str`                 | The system prompt defining agent behavior                  |
| `introduction`      | `Optional[str]`       | Message sent on call start. `None` or `""` to wait for r   |
| `temperature`       | `Optional[float]`     | Sampling temperature                                       |
| `max_tokens`        | `Optional[int]`       | Maximum tokens per response                                |
| `top_p`             | `Optional[float]`     | Nucleus sampling threshold                                 |
| `stop`              | `Optional[List[str]]` | Stop sequences                                             |
| `seed`              | `Optional[int]`       | Random seed for reproducibility                            |
| `presence_penalty`  | `Optional[float]`     | Presence penalty for token generation                      |
| `frequency_penalty` | `Optional[float]`     | Frequency penalty for token generation                     |
| `num_retries`       | `int`                 | Number of retries on failure (default: 2)                  |
| `fallbacks`         | `Optional[List[str]]` | Fallback models if primary fails                           |
| `timeout`           | `Optional[float]`     | Request timeout in seconds                                 |
| `reasoning_effort`  | `Optional[str]`       | `none`, `low`, `medium`, or `high`. Dependent on provider. |
| `extra`             | `Dict[str, Any]`      | Provider-specific options passed through to LiteLLM        |

### History Management

`LlmAgent` exposes a `history` attribute for structured control over the conversation history the LLM sees.

**Adding entries:**

```python theme={null}
# Append a user note (role="user" is the default)
agent.history.add_entry("The user prefers formal language.")

# Insert before a specific event
agent.history.add_entry("Context about the caller.", before=some_event)
```

**Replacing history segments:**

```python theme={null}
# Replace the entire history
agent.history.update(new_events)

# Replace everything from `start` onward
agent.history.update(new_events, start=some_event)

# Replace a specific segment
agent.history.update(new_events, start=start_event, end=end_event)
```

### Per-Turn Overrides

`process()` accepts keyword arguments that apply to just that turn without mutating the agent:

```python theme={null}
# Higher temperature for just this turn
await agent.process(env, event, config=LlmConfig(temperature=0.9))

# Swap a specific tool for one turn
await agent.process(env, event, tools=[custom_lookup_tool])

# Inject ephemeral context
await agent.process(env, event, context="The user is a VIP customer.")

# Completely override history for one turn
await agent.process(env, event, history=custom_history_list)
```

Only explicitly set `LlmConfig` fields take effect — unset fields fall through to the agent's stored config.

To change tools permanently (e.g., enabling `end_call` after a certain point), modify `agent.tools` directly instead of using per-turn overrides.

***

## Controlling the Conversational Loop

Use **event filters** to control when your agent’s `process` method runs, and which events can interrupt it.

### Default Behavior

```python theme={null}
# Agent processes these events:
run_filter = [CallStarted, UserTurnEnded, CallEnded]

# These events interrupt the agent:
cancel_filter = [UserTurnStarted]
```

This means: agent greets on call start, responds when user finishes speaking, and can be interrupted.

### Customizing Filters

Return a tuple from `get_agent` to override defaults:

```python theme={null}
from line.events import CallStarted, UserTurnEnded, UserTurnStarted, CallEnded

async def get_agent(env, call_request):
    agent = LlmAgent(...)
    
    # Customize behavior
    run_filter = [CallStarted, UserTurnEnded, CallEnded]
    cancel_filter = [UserTurnStarted]
    
    return (agent, run_filter, cancel_filter)
```

### Common Customizations

**More responsive (process partial transcriptions):**

```python theme={null}
from line.events import CallStarted, UserTurnEnded, UserTextSent, CallEnded

run_filter = [CallStarted, UserTurnEnded, UserTextSent, CallEnded]
cancel_filter = [UserTurnStarted]
```

This makes your agent start processing before the user finishes speaking, creating a more responsive experience.

**Uninterruptible turns:**

If you want a single message to complete without being interrupted by the user, mark the output as `interruptible=False` when sending it with `AgentSendText`.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(
    text="Before we continue, I need to share a quick disclaimer.",
    interruptible=False,
)
```

**Custom logic with functions:**

```python theme={null}
def business_hours_only(event):
    hour = datetime.now().hour
    if isinstance(event, (CallStarted, CallEnded)):
        return True
    return isinstance(event, UserTurnEnded) and 9 <= hour < 17

return (agent, business_hours_only, [UserTurnStarted])
```

<Tip>
  For advanced patterns like guardrails, routing, and agent wrappers, see [Advanced Patterns](./patterns#agent-wrappers).
</Tip>

***

## Handling Incoming Calls

When a call arrives, you can inspect caller information and configure how your agent responds before it starts.

1. A call arrives from a web client or telephony provider
2. Your `pre_call_handler` receives a `CallRequest` with caller details
3. You return configuration (voice, language) or reject the call
4. Your `get_agent` function creates an agent using the enriched request

### Parsing the CallRequest

Contains information about the incoming call:

| Field           | Type             | Description                                     |
| --------------- | ---------------- | ----------------------------------------------- |
| `call_id`       | `str`            | Unique identifier for the call                  |
| `from_`         | `str`            | Caller identifier (phone number or client ID)   |
| `to`            | `str`            | Called number or agent ID                       |
| `agent_call_id` | `str`            | Agent call ID for logging/correlation           |
| `metadata`      | `Optional[dict]` | Custom data passed from your client application |
| `agent`         | `AgentConfig`    | Prompts configured in Playground or via API     |

The `agent` field contains an `AgentConfig` with:

| Field           | Type            | Description                                                        |
| --------------- | --------------- | ------------------------------------------------------------------ |
| `system_prompt` | `Optional[str]` | System prompt configured in Playground or via the Calls API        |
| `introduction`  | `Optional[str]` | Introduction message configured in Playground or via the Calls API |

### Returning a PreCallResult

Use `pre_call_handler` to set voice, language, or reject calls before your agent starts:

```python theme={null}
from line.voice_agent_app import CallRequest, PreCallResult, VoiceAgentApp

async def pre_call_handler(call_request: CallRequest):
    return PreCallResult(
        metadata={"tier": "premium"},  # Merged into call_request.metadata
        config={
            "tts": {
                "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
                "model": "sonic-3",
                "language": "en",
            }
        }
    )

app = VoiceAgentApp(get_agent=get_agent, pre_call_handler=pre_call_handler)
```

Your client application can pass metadata (user ID, language preference, account tier) in the call request. Your `pre_call_handler` reads this and configures TTS/STT accordingly.

#### Configuration Options

**TTS Options:**

| Option                  | Type   | Description                                                                              |
| ----------------------- | ------ | ---------------------------------------------------------------------------------------- |
| `voice_id`              | string | Voice identifier (UUID)                                                                  |
| `model`                 | string | TTS model (`sonic-3`, `sonic-turbo`)                                                     |
| `language`              | string | Language code (`en`, `es`, `hi`, etc.)                                                   |
| `pronunciation_dict_id` | string | [Custom pronunciation dictionary](/build-with-cartesia/sonic-3/custom-pronunciations) ID |

**STT Options:**

| Option     | Type   | Description                          |
| ---------- | ------ | ------------------------------------ |
| `language` | string | Language code for speech recognition |

#### Rejecting Calls

Return `None` to reject a call with a 403 status:

```python theme={null}
async def pre_call_handler(call_request: CallRequest):
    if is_blocked(call_request.from_):
        return None  # Rejects with 403
    return PreCallResult()
```

#### Custom Pronunciations

Use a [pronunciation dictionary](/build-with-cartesia/sonic-3/custom-pronunciations) to control how specific words are spoken:

```python theme={null}
async def pre_call_handler(call_request: CallRequest):
    return PreCallResult(
        config={
            "tts": {
                "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
                "model": "sonic-3",
                "pronunciation_dict_id": "your-dict-id",
            }
        }
    )
```

### Accessing call metadata in your Agent logic

The `CallRequest` is available in `get_agent`:

```python theme={null}
async def get_agent(env, call_request):
    # Log call information
    logger.info(f"Call {call_request.call_id} from {call_request.from_}")

    # Access metadata passed from your application (or added in pre_call_handler)
    customer_id = call_request.metadata.get("customer_id") if call_request.metadata else None
    customer_name = call_request.metadata.get("customer_name") if call_request.metadata else None

    # Build a personalized system prompt using metadata
    base_prompt = call_request.agent.system_prompt or "You are a helpful customer service agent."

    if customer_id:
        base_prompt += f"\n\nCurrent customer ID: {customer_id}"
    if customer_name:
        base_prompt += f"\nCustomer name: {customer_name}"

    return LlmAgent(
        model="gpt-5-nano",
        api_key=os.getenv("OPENAI_API_KEY"),
        config=LlmConfig(
            system_prompt=base_prompt,
            introduction=call_request.agent.introduction,
        ),
    )
```

`LlmConfig.from_call_request()` handles the priority chain automatically:

1. `CallRequest.agent.system_prompt` value (if set)
2. Your fallback value (if provided)
3. SDK default

```python theme={null}
async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig.from_call_request(
            call_request,
            fallback_system_prompt="You are a sales assistant.",
            fallback_introduction="Hi! How can I help with your purchase?",
            temperature=0.7,  # Additional LlmConfig options
        ),
    )
```

Using `CallRequest` lets you iterate on system prompts from the Playground instantly, while code handles the technical configuration and fallback defaults.

### Letting The User Speak First

Set `introduction` to an empty string to wait for the user to speak first:

```python theme={null}
config=LlmConfig.from_call_request(
    call_request,
    fallback_system_prompt=SYSTEM_PROMPT,
    fallback_introduction="",
)
```

***

## Custom Agent Function

For advanced use cases, you can build agents from scratch as functions:

```python theme={null}
from line.events import UserTurnEnded, AgentSendText, CallStarted

async def my_agent(env, event):
    if isinstance(event, CallStarted):
        yield AgentSendText(text="Hello! How can I help?")
    elif isinstance(event, UserTurnEnded):
        user_text = event.content[0].content if event.content else ""
        yield AgentSendText(text=f"You said: {user_text}")
```

## Custom Agent Class

Or as classes with state:

```python theme={null}
class GreetingAgent:
    def __init__(self, greeting: str):
        self.greeting = greeting
        self.greeted = False

    async def process(self, env, event):
        if isinstance(event, CallStarted) and not self.greeted:
            yield AgentSendText(text=self.greeting)
            self.greeted = True
```

<Tip>
  Most developers can use `LlmAgent` with tools rather than building custom agents from scratch! Custom agents are powerful when you need full control over the event processing logic without LLM reasoning.
</Tip>


# Events
Source: https://docs.cartesia.ai/line/sdk/events


Events are typed Python objects for communication between your agent and the Cartesia platform. Your agent receives **input events** from the harness and yields **output events** to control the conversation.

<Tip>
  To learn which events trigger your agent and how to customize this behavior (e.g., responding to DTMF, preventing interruptions), see [Controlling the Conversational Loop](/line/sdk/agents#controlling-the-conversational-loop).
</Tip>

## Input Events

Input events are received by your agent from the Cartesia harness. All input events include an optional `history` field containing the complete conversation history. When `history` is `None`, the event is being used within a history list; when `history` contains a list, the event has the full conversation context attached.

### Call Lifecycle

| Event         | Description            |
| ------------- | ---------------------- |
| `CallStarted` | The call has connected |
| `CallEnded`   | The call has ended     |

```python theme={null}
from line.events import CallStarted, CallEnded

async def process(self, env, event):
    if isinstance(event, CallStarted):
        yield AgentSendText(text="Hello! How can I help?")
    elif isinstance(event, CallEnded):
        # Perform cleanup
        pass
```

### User Turn Events

| Event             | Description                                                     |
| ----------------- | --------------------------------------------------------------- |
| `UserTurnStarted` | The user started speaking (triggers interruption by default)    |
| `UserTurnEnded`   | The user finished speaking (triggers new agent turn by default) |
| `UserTextSent`    | User text content (within `UserTurnEnded.content`)              |
| `UserDtmfSent`    | User pressed a DTMF button                                      |

```python theme={null}
from line.events import UserTurnEnded, UserTextSent

if isinstance(event, UserTurnEnded):
    for content in event.content:
        if isinstance(content, UserTextSent):
            user_message = content.content
```

### Agent Turn Events (in history)

| Event              | Description                |
| ------------------ | -------------------------- |
| `AgentTurnStarted` | Agent started its turn     |
| `AgentTurnEnded`   | Agent finished its turn    |
| `AgentTextSent`    | Agent text that was spoken |
| `AgentDtmfSent`    | DTMF tone sent by agent    |

### Handoff Event

| Event            | Description                           |
| ---------------- | ------------------------------------- |
| `AgentHandedOff` | Control transferred to a handoff tool |

### Custom Event

| Event            | Description                                                                                                        |
| ---------------- | ------------------------------------------------------------------------------------------------------------------ |
| `UserCustomSent` | Custom metadata sent from the client via the WebSocket [`custom` event](/line/integrations/calls-api#custom-event) |

Received when your client application sends a `custom` WebSocket event to the call stream. The event carries a `metadata` dict with whatever key-value pairs the client included:

```python theme={null}
from line.events import UserCustomSent

async def process(self, env, event):
    if isinstance(event, UserCustomSent):
        action = event.metadata.get("action")
        # React to client-side triggers (e.g., button clicks, form submissions)
```

***

## Output Events

Output events are yielded by your agent to control the conversation.

### Speech

You can choose to send messages with `AgentSendText`.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(text="Hello! How can I help you today?")
```

By default, users can interrupt the agent. However, if you have a disclaimer or another important message that you wish be uninterruptible, you can set the `interruptible` flag as false.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(
    text="Before we continue, I need to share a quick disclaimer.",
    interruptible=False,
)
```

### Call Control

```python theme={null}
from line.events import AgentEndCall, AgentTransferCall, AgentSendDtmf

# End the call
yield AgentEndCall()

# Transfer to another number
yield AgentTransferCall(target_phone_number="+14155551234")

# Send DTMF tone
yield AgentSendDtmf(button="1")
```

### Dynamic Configuration

Update call settings (voice, pronunciation, language) mid-conversation using `AgentUpdateCall`:

```python theme={null}
from line.events import AgentUpdateCall

# Change voice
yield AgentUpdateCall(voice_id="5ee9feff-1265-424a-9d7f-8e4d431a12c7")

# Change pronunciation dictionary
yield AgentUpdateCall(pronunciation_dict_id="dict-123")

# Change language
yield AgentUpdateCall(language="es")

# Update multiple settings at once
yield AgentUpdateCall(
    voice_id="spanish-voice-id",
    pronunciation_dict_id="spanish-dict-id",
    language="es"
)
```

**AgentUpdateCall Parameters:**

| Field                   | Type                     | Description                                                                       |
| ----------------------- | ------------------------ | --------------------------------------------------------------------------------- |
| `type`                  | `Literal["update_call"]` | Event type identifier (automatically set)                                         |
| `voice_id`              | `Optional[str]`          | Updates the agent's voice                                                         |
| `pronunciation_dict_id` | `Optional[str]`          | Updates the pronunciation dictionary                                              |
| `language`              | `Optional[str]`          | Updates the language used on speech-to-text (STT) and text-to-speech (TTS) models |

All fields are optional—only set fields are updated.

### Tool Events

These are emitted by `LlmAgent` to track tool execution:

```python theme={null}
from line.events import AgentToolCalled, AgentToolReturned

# Emitted when LLM calls a tool
yield AgentToolCalled(
    tool_call_id="call_123",
    tool_name="get_weather",
    tool_args={"city": "San Francisco"}
)

# Emitted when tool returns
yield AgentToolReturned(
    tool_call_id="call_123",
    tool_name="get_weather",
    tool_args={"city": "San Francisco"},
    result="72°F and sunny"
)
```

### Logging

```python theme={null}
from line.events import LogMetric, LogMessage

# Log a metric
yield LogMetric(name="response_time_ms", value=150)

# Log a message
yield LogMessage(
    name="order_lookup",
    level="info",
    message="Found order #12345",
    metadata={"order_id": "12345"}
)
```

### Custom Events

Send arbitrary metadata from your agent to the harness:

```python theme={null}
from line.events import AgentSendCustom

yield AgentSendCustom(metadata={"action": "show_form", "form_id": "checkout"})
```

Pair with [`UserCustomSent`](#custom-event) for bidirectional metadata exchange.

### Voice & Language Control

Change voice or speech recognition language mid-call:

```python theme={null}
from line.events import AgentUpdateCall

# Switch to Spanish voice and speech recognition
yield AgentUpdateCall(voice_id="spanish-voice-id", language="es")

# Enable multilingual auto-detect mode
yield AgentUpdateCall(language="multilingual")
```

The `language` field sets the ASR (speech recognition) language. Pass any language code supported by [Ink STT](/build-with-cartesia/stt-models), or `"multilingual"` for automatic language detection.

***

## Event History

All input events include an optional `history` field containing the conversation history. When `history` is `None`, the event is inside a history list; when it contains a list, full conversation context is attached. `LlmAgent` handles this automatically—you only need to understand history if building custom agents.

### Accessing History

```python theme={null}
from line.events import UserTextSent, AgentTextSent

async def process(self, env, event):
    for past_event in event.history:
        if isinstance(past_event, UserTextSent):
            print(f"User said: {past_event.content}")
        elif isinstance(past_event, AgentTextSent):
            print(f"Agent said: {past_event.content}")
```

<Accordion title="Event types in history">
  Events in the history list have `history=None` to avoid redundant nesting. The event types are the same as regular input events:

  | Event Type         | Description               |
  | ------------------ | ------------------------- |
  | `CallStarted`      | Call began                |
  | `UserTurnStarted`  | User started speaking     |
  | `UserTextSent`     | User's transcribed speech |
  | `UserDtmfSent`     | User's DTMF button press  |
  | `UserTurnEnded`    | User finished speaking    |
  | `AgentTurnStarted` | Agent started responding  |
  | `AgentTextSent`    | Agent's spoken text       |
  | `AgentDtmfSent`    | Agent's DTMF tone         |
  | `AgentTurnEnded`   | Agent finished responding |
  | `CallEnded`        | Call ended                |
</Accordion>

<Accordion title="How LlmAgent processes history">
  `LlmAgent` automatically converts the event history to LLM messages:

  * **User messages**: From `UserTextSent` events
  * **Assistant messages**: From `AgentTextSent` events
  * **Tool calls**: From `AgentToolCalled` and `AgentToolReturned` events

  This means the LLM sees full context including previous tool calls and results, enabling it to reference that information without making redundant API calls.
</Accordion>

<Accordion title="Custom agents: Using history">
  If building a custom agent (not using `LlmAgent`), you can use history for context, summarization, or pattern detection:

  ```python theme={null}
  class CustomAgent:
      async def process(self, env, event):
          user_turns = sum(
              1 for e in event.history
              if isinstance(e, UserTurnEnded)
          )

          if user_turns > 5:
              yield AgentSendText(text="We've been chatting for a while! Is there anything else I can help with?")
  ```
</Accordion>


# SDK Overview
Source: https://docs.cartesia.ai/line/sdk/overview


The [Line SDK](https://github.com/cartesia-ai/line/) is a Python framework for building voice agents. Handles audio infrastructure, speech recognition, and conversation flow.

```bash theme={null}
uv add cartesia-line
```

<Note>
  New to Line? Start with the [Quickstart](/line/start-building/quickstart) to build and deploy your first agent.
</Note>

## Core Concepts

| Component                                           | Purpose                                                                 |
| --------------------------------------------------- | ----------------------------------------------------------------------- |
| [`Agent`](./agents)                                 | Controls the input/output event loop via a `process` method             |
| [`LlmAgent`](./agents#llmagent)                     | Built-in agent that wraps 100+ LLM providers via LiteLLM                |
| [`Tools`](./tools)                                  | Functions your agent can call—database lookups, handoffs, web search    |
| [`VoiceAgentApp`](./agents#handling-incoming-calls) | HTTP server that connects your agent to Cartesia's audio infrastructure |

```python theme={null}
import os
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import VoiceAgentApp

async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)
```

The agent speaks the `introduction` when a call starts, then responds to whatever the user says using the LLM.

## Features

* **Real-time interruption support** — Handles audio interruptions and turn-taking out-of-the-box.
* **Tool calling** — Connect to databases, APIs, and external services
* **Multi-agent handoffs** — Route conversations between specialized agents
* **Web search** — Built-in tool for real-time information lookup

## Add Capabilities

### Look up information

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool
async def get_order_status(ctx, order_id: Annotated[str, "The order ID"]):
    """Look up an order's current status."""
    order = await db.get_order(order_id)
    return f"Order {order_id} is {order.status}"
```

### Handoff to another agent

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, agent_as_handoff, end_call

spanish_agent = LlmAgent(
    model="gpt-5-nano",
    api_key=os.getenv("OPENAI_API_KEY"),
    tools=[end_call],
    config=LlmConfig(
        system_prompt="You speak only in Spanish.",
        introduction="¡Hola! ¿Cómo puedo ayudarte?",
    ),
)

main_agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[
        end_call,
        agent_as_handoff(
            spanish_agent,
            name="transfer_to_spanish",
            description="Transfer when user requests Spanish.",
        ),
    ],
    config=LlmConfig(...),
)
```

### Search the web

```python theme={null}
from line.llm_agent import end_call, web_search

agent = LlmAgent(
    tools=[end_call, web_search],  # Add built-in web search
    ...
)
```

See [Tools](./tools) for the full guide.

## Code Examples

| Example                                                                                   | Description                                        |
| ----------------------------------------------------------------------------------------- | -------------------------------------------------- |
| [Basic Chat](https://github.com/cartesia-ai/line/tree/main/examples/basic_chat)           | Simple conversational agent                        |
| [Chat Supervisor](https://github.com/cartesia-ai/line/tree/main/examples/chat_supervisor) | Fast chat model with powerful reasoning escalation |
| [Form Filler](https://github.com/cartesia-ai/line/tree/main/examples/form_filler)         | Collect structured data via conversation           |
| [Multi-Agent](https://github.com/cartesia-ai/line/tree/main/examples/transfer_agent)      | Hand off between specialized agents                |

### Integrations

| Integration                                                                                   | Description              |
| --------------------------------------------------------------------------------------------- | ------------------------ |
| [Exa Web Research](https://github.com/cartesia-ai/line/tree/main/example_integrations/exa)    | Real-time web search     |
| [Browserbase](https://github.com/cartesia-ai/line/tree/main/example_integrations/browserbase) | Fill web forms via voice |

## Next Steps

<CardGroup>
  <Card title="Agents" icon="robot" href="./agents">
    Configure prompts, LLMs, and conversation flow
  </Card>

  <Card title="Tools" icon="wrench" href="./tools">
    Add custom tools and multi-agent handoffs
  </Card>
</CardGroup>


# Advanced Patterns
Source: https://docs.cartesia.ai/line/sdk/patterns


Patterns for production voice agents: observability, tool design, multi-agent systems, and guardrails.

## Complete Example: Multi-Agent Customer Service

This example combines prompting, all three tool types, and multi-agent handoffs:

```python theme={null}
import os
from typing import Annotated
from line import CallRequest
from line.llm_agent import (
    LlmAgent, LlmConfig, loopback_tool, passthrough_tool,
    agent_as_handoff, end_call
)
from line.events import AgentSendText, AgentTransferCall
from line.voice_agent_app import AgentEnv, VoiceAgentApp

# Loopback tool: Fetch order info for LLM to contextualize
@loopback_tool
async def get_order_status(ctx, order_id: Annotated[str, "The order ID"]):
    """Look up order status by ID."""
    order = await db.get_order(order_id)
    return f"Order {order_id}: {order.status}, delivers {order.delivery_date}"

# Passthrough tool: Deterministic transfer action
@passthrough_tool
async def transfer_to_human(ctx):
    """Transfer to a human agent."""
    yield AgentSendText(text="Let me connect you with a team member who can help further.")
    yield AgentTransferCall(target_phone_number="+18005551234")

SYSTEM_PROMPT = """You are a friendly customer service agent for Acme Corp.

You can:
- Look up order status using get_order_status
- Transfer to a human agent using transfer_to_human
- Transfer to Spanish support using transfer_to_spanish
- End calls politely using end_call

Rules:
- Always confirm the order ID before looking it up
- Offer to transfer to a human if you can't resolve the issue
- Transfer to Spanish support if the user speaks Spanish or requests it
- Be empathetic and professional
"""

async def get_agent(env: AgentEnv, call_request: CallRequest):
    # Spanish-speaking specialist agent
    spanish_agent = LlmAgent(
        model="gpt-5-nano",
        api_key=os.getenv("OPENAI_API_KEY"),
        tools=[get_order_status, transfer_to_human, end_call],
        config=LlmConfig(
            system_prompt="Eres un agente de servicio al cliente amigable para Acme Corp. Habla solo en español.",
            introduction="¡Hola! Gracias por llamar a Acme Corp. ¿Cómo puedo ayudarte hoy?",
        ),
    )

    # Main English-speaking agent with handoff capability
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[
            get_order_status,
            transfer_to_human,
            agent_as_handoff(
                spanish_agent,
                handoff_message="Transferring you to our Spanish-speaking team...",
                name="transfer_to_spanish",
                description="Transfer to Spanish support when user speaks Spanish or requests it.",
            ),
            end_call,
        ],
        config=LlmConfig(
            system_prompt=SYSTEM_PROMPT,
            introduction="Hi! Thanks for calling Acme Corp. How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

***

## Observability

### Log Metrics

Track performance and business metrics:

```python theme={null}
from line.events import LogMetric, LogMessage

@loopback_tool
async def process_order(ctx, order_id: Annotated[str, "Order ID"]):
    """Process a customer order."""
    import time
    start = time.time()

    result = await api.process_order(order_id)

    # Log timing metric
    yield LogMetric(name="order_processing_ms", value=(time.time() - start) * 1000)

    # Log business event
    yield LogMessage(
        name="order_processed",
        level="info",
        message=f"Processed order {order_id}",
        metadata={"status": result.status}
    )

    return f"Order {order_id} processed: {result.status}"
```

### Built-in LLM Agent Metrics

`LlmAgent` automatically emits three timing metrics on every turn — no code needed:

| Metric               | Description                                                                            |
| -------------------- | -------------------------------------------------------------------------------------- |
| `llm_first_chunk_ms` | Time from start of response generation to first chunk (text or tool call) from the LLM |
| `llm_first_text_ms`  | Time from start of response generation to first text chunk                             |
| `agent_turn_ms`      | Total agent processing time for the turn                                               |

***

## Tool Patterns

### Validation in Tools

Validate inputs before processing:

```python theme={null}
@loopback_tool
async def book_appointment(
    ctx,
    date: Annotated[str, "Date in YYYY-MM-DD format"],
    time: Annotated[str, "Time in HH:MM format"]
):
    """Book an appointment."""
    from datetime import datetime

    try:
        dt = datetime.strptime(f"{date} {time}", "%Y-%m-%d %H:%M")
    except ValueError:
        return "Invalid date or time format. Please use YYYY-MM-DD and HH:MM."

    if dt < datetime.now():
        return "Cannot book appointments in the past."

    # Proceed with booking
    return f"Appointment booked for {dt.strftime('%B %d at %I:%M %p')}"
```

### Async Operations in Tools

Handle long-running operations with proper timeout handling:

```python theme={null}
import asyncio

@loopback_tool
async def search_inventory(ctx, query: Annotated[str, "Search query"]):
    """Search inventory with timeout protection."""
    try:
        result = await asyncio.wait_for(
            inventory_api.search(query),
            timeout=5.0
        )
        return f"Found {len(result.items)} items matching '{query}'"
    except asyncio.TimeoutError:
        return "Search is taking longer than expected. Please try a more specific query."
```

### Error Handling

Handle errors gracefully in tools:

```python theme={null}
@loopback_tool
async def get_account_info(ctx, account_id: Annotated[str, "Account ID"]):
    """Look up account information."""
    try:
        account = await api.get_account(account_id)
        return f"Account {account_id}: Balance ${account.balance:.2f}"
    except AccountNotFoundError:
        return f"Account {account_id} not found."
    except Exception as e:
        logger.error(f"Error fetching account: {e}")
        return "Sorry, I couldn't retrieve that account information right now."
```

***

## Agent Wrappers

Agent wrappers add cross-cutting behavior (logging, validation, routing) without modifying the underlying agent.

### Guardrails: Safety and Content Filtering

Wrappers are ideal for implementing guardrails that filter unsafe content in both directions:

```python theme={null}
class GuardrailsAgent:
    def __init__(self, inner_agent, safety_api):
        self.inner = inner_agent
        self.safety_api = safety_api

    async def process(self, env, event):
        # Pre-processing: Check user input for unsafe content
        if isinstance(event, UserTurnEnded):
            user_text = event.content[0].content if event.content else ""

            if await self.safety_api.is_unsafe(user_text):
                yield AgentSendText(text="I'm here to help with appropriate requests. Let's keep our conversation respectful.")
                return

        # Post-processing: Check agent output for safety issues
        async for output in self.inner.process(env, event):
            if isinstance(output, AgentSendText):
                if await self.safety_api.is_unsafe(output.text):
                    yield LogMessage(
                        name="safety_violation",
                        level="warning",
                        message=f"Blocked unsafe output: {output.text[:100]}..."
                    )
                    yield AgentSendText(text="I apologize, but I can't provide that information.")
                    continue

            yield output
```

Common guardrail patterns:

* Content safety filtering (toxicity, hate speech, PII)
* Rate limiting and abuse prevention
* Compliance checks (HIPAA, financial regulations)
* Brand safety (off-brand responses)

### Routing Between Multiple Agents

Dynamically switch between specialized agents based on conversation context:

```python theme={null}
class RouterAgent:
    def __init__(self, default_agent, specialists: dict):
        self.default = default_agent
        self.specialists = specialists
        self.current = default_agent

    async def process(self, env, event):
        # Switch agent based on user input
        if isinstance(event, UserTurnEnded):
            user_text = event.content[0].content if event.content else ""

            if "billing" in user_text.lower():
                self.current = self.specialists.get("billing", self.default)
            elif "technical" in user_text.lower():
                self.current = self.specialists.get("technical", self.default)

        async for output in self.current.process(env, event):
            yield output
```

Use with `LlmAgent`:

```python theme={null}
async def get_agent(env, call_request):
    return RouterAgent(
        default_agent=LlmAgent(
            model="gpt-5-nano",
            api_key=os.getenv("OPENAI_API_KEY"),
            config=LlmConfig(system_prompt="You are a helpful assistant..."),
        ),
        specialists={
            "billing": LlmAgent(
                model="gpt-5-nano",
                api_key=os.getenv("OPENAI_API_KEY"),
                config=LlmConfig(system_prompt="You are a billing specialist..."),
            ),
            "technical": LlmAgent(
                model="anthropic/claude-haiku-4-5-20251001",
                api_key=os.getenv("ANTHROPIC_API_KEY"),
                config=LlmConfig(system_prompt="You are a technical support specialist..."),
            ),
        }
    )
```

### Best Practices

Keep wrappers focused on a single responsibility. Use `async for` and `yield` to preserve streaming. Stack simple wrappers rather than building one complex one.

```python theme={null}
# Composable wrappers
agent = LoggingWrapper(
    ValidationWrapper(
        LlmAgent(...)
    )
)
```

***

## Example Implementations

Full working examples demonstrating these patterns:

| Example                                                                                       | Pattern             | Description                                            |
| --------------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------ |
| [Form Filler](https://github.com/cartesia-ai/line/tree/main/examples/form_filler)             | Stateful tools      | Walk users through a YAML-defined form with validation |
| [Multi-Agent Transfer](https://github.com/cartesia-ai/line/tree/main/examples/transfer_agent) | agent\_as\_handoff  | English/Spanish agent handoff                          |
| [Chat Supervisor](https://github.com/cartesia-ai/line/tree/main/examples/chat_supervisor)     | Background research | Separate agents for talking and longer-thinking        |


# Tools
Source: https://docs.cartesia.ai/line/sdk/tools


Tools let your agent perform actions and retrieve information. The SDK supports three tool paradigms that differ in how they affect conversation flow.

## Defining Tools

Any properly annotated function can be a tool. The SDK uses the function's docstring as the description and type annotations for parameters:

```python theme={null}
from typing import Annotated

async def get_weather(
    ctx,
    city: Annotated[str, "The city to check weather for"],
    units: Annotated[str, "celsius or fahrenheit"] = "fahrenheit"
):
    """
    Look up the current weather in a given city.
    """
    return f"72°F and sunny in {city}"
```

<Note>
  The first parameter of every tool must be `ctx` (the tool context). This provides access to conversation state and is required for forward compatibility. Your tool parameters follow after `ctx`.
</Note>

***

## Tool Types

<Note>
  Plain functions passed to `tools` are automatically wrapped as loopback tools. Use decorators (`@loopback_tool`, `@passthrough_tool`, `@handoff_tool`) for explicit control.
</Note>

### Loopback Tools (`@loopback_tool`)

The default behavior. The tool's result is sent back to the LLM, which can then continue generating a response.

```python theme={null}
from line.llm_agent import loopback_tool

@loopback_tool
async def get_account_balance(ctx, account_id: Annotated[str, "The account ID"]):
    """Look up the balance for a customer account."""
    balance = await api.get_balance(account_id)
    return f"${balance:.2f}"
```

**Use for:** Information retrieval, calculations, API queries.

### Passthrough Tools (`@passthrough_tool`)

Output events go directly to the user, bypassing the LLM. Use for deterministic actions.

```python theme={null}
from line.llm_agent import passthrough_tool
from line.events import AgentSendText, AgentEndCall

@passthrough_tool
async def end_call_with_message(ctx, message: Annotated[str, "Goodbye message"]):
    """End the call with a custom goodbye message."""
    yield AgentSendText(text=message)
    yield AgentEndCall()
```

**Use for:** Call control (`EndCall`, `TransferCall`, `SendDtmf`), deterministic responses.

### Handoff Tools (`@handoff_tool`)

Transfers control to another handler. All future events are routed to the handoff target instead of the original agent.

```python theme={null}
from typing import Annotated
from line.llm_agent import handoff_tool
from line.events import AgentHandedOff, AgentSendText, UserTurnEnded, AgentEndCall

@handoff_tool
async def run_satisfaction_survey(
    ctx,
    customer_name: Annotated[str, "The customer's name"],
    event
):
    """Hand off to a customer satisfaction survey at the end of the call."""
    if isinstance(event, AgentHandedOff):
        # First call - send introduction
        yield AgentSendText(
            text=f"Thank you for your call, {customer_name}. "
            "Please stay on the line for a brief satisfaction survey. "
            "On a scale of 1 to 5, how would you rate your experience today?"
        )
        return

    # Subsequent calls - handle survey responses
    if isinstance(event, UserTurnEnded):
        user_response = event.content[0].content if event.content else ""
        yield AgentSendText(text=f"You rated us {user_response}. Thank you for your feedback!")
        yield AgentEndCall()
```

**Use for:** Custom multi-step flows, specialized handlers with their own logic.

When using a handoff tool, the `event` parameter receives different values depending on timing:

* **First call**: `event` is `AgentHandedOff` — use this to send a transition message
* **Subsequent calls**: `event` is the actual `InputEvent` (`UserTurnEnded`, etc.)

Once a handoff occurs, the original agent no longer receives events. The handoff tool function handles all future conversation turns.

<Tip>
  To hand off to another `LlmAgent`, use the [`agent_as_handoff`](#agent_as_handoff) helper instead of writing a raw `@handoff_tool`. It handles the delegation automatically.
</Tip>

***

## Built-in Tools

```python theme={null}
from line.llm_agent import end_call, send_dtmf, transfer_call, web_search

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[end_call, send_dtmf, transfer_call, web_search],
    config=LlmConfig(...),
)
```

| Tool            | Description                                | When to Use                                                   |
| --------------- | ------------------------------------------ | ------------------------------------------------------------- |
| `end_call`      | Ends the call                              | User says "goodbye" or the agent's objective has been met     |
| `transfer_call` | Transfers to another number (E.164 format) | Escalating to human agents, routing to departments            |
| `web_search`    | Searches the web for real-time info        | Current events, live prices, recent news the LLM doesn't know |

**Examples:**

```python theme={null}
# End call: Let the LLM decide when conversation is complete
tools=[end_call]  # LLM calls this when user says "thanks, bye!"

# Transfer: Route to human support
tools=[transfer_call]  # LLM calls transfer_call(target_phone_number="+18005551234")

# Web search with custom context size
tools=[web_search(search_context_size="high")]  # More context for complex queries
```

### `end_call`

Ends the current call and disconnects. The actual hangup occurs after the agent's final speech completes, so the user hears the full goodbye message before disconnection.

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, end_call

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[end_call],
    config=LlmConfig(...),
)
```

By default, `end_call` uses a conservative policy that only ends the call when:

* The user's objective is fully complete
* The user explicitly says goodbye
* The agent has said a natural goodbye

#### Custom Description

We recommend providing a custom description tailored to your use case. The description **fully replaces** the default—it is not appended—so include complete instructions with explicit do/don't guidance.

```python theme={null}
from line.llm_agent import end_call

# Restaurant reservation agent
tools=[end_call(description="""Ends the call and disconnects.

Call when ALL of the following are true:
- The reservation is confirmed with date, time, party size, and name.
- You have repeated the reservation details back to the guest.
- The guest confirms the details are correct or says goodbye.

Do not call when:
- The guest asks to modify the reservation.
- Details are missing or unconfirmed.
- The guest says 'okay' or 'thanks' without an explicit goodbye.

If unsure, ask: 'Is there anything else I can help you with for your reservation?'
""")]

# Order confirmation agent
tools=[end_call(description="""Ends the call and disconnects.

Call when ALL of the following are true:
- The order is placed and confirmed.
- You have provided the order number and estimated delivery time.
- The customer acknowledges with a goodbye phrase.

Do not call when:
- The customer has questions about their order.
- Payment has not been confirmed.
- The customer says 'got it' without saying goodbye.
""")]
```

| Parameter     | Type            | Description                                                                                                 |
| ------------- | --------------- | ----------------------------------------------------------------------------------------------------------- |
| `description` | `Optional[str]` | Fully replaces the default description. Include complete instructions for when the LLM should end the call. |

### `agent_as_handoff`

Creates a handoff tool from another `Agent`—the easiest way to implement multi-agent workflows.

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, agent_as_handoff, end_call, UpdateCallConfig

spanish_agent = LlmAgent(
    model="gpt-5-nano",
    api_key=os.getenv("OPENAI_API_KEY"),
    tools=[end_call],
    config=LlmConfig(
        system_prompt="You speak only in Spanish.",
        introduction="¡Hola! ¿Cómo puedo ayudarte?",
    ),
)

main_agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[
        end_call,
        agent_as_handoff(
            spanish_agent,
            handoff_message="Transferring to Spanish support...",
            update_call=UpdateCallConfig(
                voice_id="spanish-voice-id",
                pronunciation_dict_id="spanish-pronunciation-dict-id"
            ),
            name="transfer_to_spanish",
            description="Use when user requests Spanish.",
        ),
    ],
    config=LlmConfig(...),
)
```

| Parameter         | Type                         | Description                                                                             |
| ----------------- | ---------------------------- | --------------------------------------------------------------------------------------- |
| `agent`           | `Agent`                      | The agent to hand off to                                                                |
| `handoff_message` | `Optional[str]`              | Message spoken before the handoff                                                       |
| `update_call`     | `Optional[UpdateCallConfig]` | Optional config to update call settings (voice, pronunciation, language) before handoff |
| `name`            | `Optional[str]`              | Tool name for the LLM                                                                   |
| `description`     | `Optional[str]`              | When the LLM should use this tool                                                       |

When called, `agent_as_handoff` automatically sends the handoff message, updates the call settings if specified, triggers the new agent's introduction, and routes all future events to it.

<Tip>
  See [Advanced Patterns](/line/sdk/patterns) for a complete multi-agent example with loopback, passthrough, and handoff tools.
</Tip>

***

## Long-Running Tools

By default, tool calls are terminated when the agent is interrupted (though any reasoning and tool call response values already produced are preserved for use in the next generation).

For tools that are expected to take a long time to complete, set `is_background=True`. The tool will continue running in the background until completion regardless of interruptions, then loop back to the LLM to produce a response.

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool(is_background=True)
async def search_database(ctx, query: Annotated[str, "Search query"]) -> str:
    """Search the database - may take several seconds."""
    results = await slow_database_search(query)
    return format_results(results)

@loopback_tool(is_background=True)
async def generate_report(ctx, report_type: Annotated[str, "Type of report"]) -> str:
    """Generate a detailed report - runs in background."""
    report = await compile_report(report_type)
    return report
```

<Note>
  Background tools are useful when:

  * The operation may take longer than typical user patience (e.g., complex searches, report generation)
  * You want the user to be able to speak while the operation completes
  * The result should be delivered even if the user interrupts with another question
</Note>


# Agent Builder
Source: https://docs.cartesia.ai/line/start-building/agent-builder


Prototype voice agents in the Playground. Test prompts, configure voices, and deploy in seconds.

## Create your agent

Go to [play.cartesia.ai/agents](https://play.cartesia.ai/agents) and select **Start in Playground**.

<Frame>
  <img alt="Create your first voice agent options" />
</Frame>

Customize your agent's behavior, voice, and greeting.

<Frame>
  <img alt="Dynamic agent configuration interface" />
</Frame>

**System Prompt** — Define your agent's role and guidelines. You can also provide a natural language description of your agent and the platform will generate a structured system prompt.

**Voice** — Choose from Cartesia's voice library. Preview voices before selecting.

**Initial Message** — Set the greeting your agent speaks when calls start. Check **Skip agent introduction** to have the agent wait for the user to speak first.

**Background Sound** — Add ambient audio for call center atmospheres or office environments.

**Preview** changes before publishing.

## Continue building in code

Connect your Playground agent to GitHub to customize with code.

<Steps>
  <Step title="Connect to GitHub">
    On your agent page, click **Connect to GitHub**. Authorize Cartesia to create a repository.
  </Step>

  <Step title="Clone locally">
    ```bash theme={null}
    git clone https://github.com/your-org/your-agent.git
    cd your-agent
    ```
  </Step>

  <Step title="Install dependencies">
    ```bash theme={null}
    uv pip install .
    ```
  </Step>

  <Step title="Edit your agent">
    Open `main.py` to add tools, custom logic, or modify the prompt.
  </Step>

  <Step title="Deploy">
    Push to deploy your changes.

    ```bash theme={null}
    git push
    ```
  </Step>
</Steps>

## Next steps

<CardGroup>
  <Card title="Quickstart" icon="rocket" href="/line/start-building/quickstart">
    Build agents with the SDK
  </Card>

  <Card title="Agents" icon="robot" href="/line/sdk/agents">
    Prompts, voices, and pre-call configuration
  </Card>
</CardGroup>


# Quickstart
Source: https://docs.cartesia.ai/line/start-building/quickstart


Build an agent, deploy it, and make your first call within minutes.

## Prerequisites

* A free Cartesia account ([sign up here](https://play.cartesia.ai))
* Python 3.9+
* An LLM API key (Anthropic, OpenAI, Google, etc.)
* [uv](https://docs.astral.sh/uv/) (Python package and project manager)

## Install the CLI

```bash theme={null}
curl -fsSL https://cartesia.sh | sh
cartesia auth login
```

## Install uv

Install [uv](https://docs.astral.sh/uv/), a fast Python package manager to manage dependencies and virtual environments.

```bash theme={null}
curl -LsSf https://astral.sh/uv/install.sh | sh
```

## Create your agent

Create a new project and install dependencies. uv will automatically set up a virtual environment and manage your packages.

```bash theme={null}
uv init my-voice-agent && cd my-voice-agent
uv add cartesia-line
```

Create `main.py`:

```python theme={null}
import os
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import VoiceAgentApp

async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001", # Or "gpt-5-nano", "gemini/gemini-2.5-flash", etc.
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

## Test locally

Start your agent server.

```bash theme={null}
ANTHROPIC_API_KEY=your-api-key PORT=8000 uv run python main.py
```

In a separate terminal, chat with your agent by simply running:

```bash theme={null}
cartesia chat 8000
```

This lets you test your agent's reasoning before deploying.

## Deploy

Link your project and deploy.

```bash theme={null}
cartesia init    # Choose "Create new" and name your agent
cartesia deploy
```

Your agent deploys in under 30 seconds on Cartesia's managed runtime.

## Set environment variables

Configure your API key for the deployed agent.

```bash theme={null}
cartesia env set ANTHROPIC_API_KEY=your-api-key
```

Or import from a `.env` file:

```bash theme={null}
cartesia env set --from .env
```

## Make a call

Call your agent from your phone.

```bash theme={null}
cartesia call +1XXXXXXXXXX
```

Or visit the [Playground](https://play.cartesia.ai/agents) to call from the web.

## Next steps

<CardGroup>
  <Card title="Add tools" icon="wrench" href="/line/sdk/tools">
    Connect databases, APIs, and external services
  </Card>

  <Card title="Configure prompts" icon="robot" href="/line/sdk/agents">
    Customize system prompts and conversation flow
  </Card>

  <Card title="Calls API" icon="globe" href="/line/integrations/calls-api">
    Connect web clients via WebSocket
  </Card>

  <Card title="Agent Builder" icon="sparkles" href="/line/start-building/agent-builder">
    Build agents visually in the Playground
  </Card>
</CardGroup>


# Air-Gapped Deployments
Source: https://docs.cartesia.ai/self-hosted/air-gapped

Deploy Cartesia without internet connectivity to licensing servers

For deployments without internet connectivity to Cartesia's licensing servers, you can run in air-gapped mode. This mode uses an offline license file instead of real-time authentication.

<Note>Download your offline license file from the [on-prem portal](https://play.cartesia.ai/on-prem). See [Provisioned Resources](/self-hosted/provisioned-resources) for details.</Note>

## Configuration

<Tabs>
  <Tab title="Terraform">
    ```hcl theme={null}
    # In your .tfvars file
    authenticate               = false
    license_proxy_persistence  = true   # Required for air-gapped mode
    ```
  </Tab>

  <Tab title="Helm">
    ```yaml theme={null}
    infra:
      authenticate: false
    licenseProxy:
      persistence:
        enabled: true
        storageClass: gp2  # Use appropriate storage class for your cluster
    ```
  </Tab>
</Tabs>

## Loading a License

In air-gapped mode, the `/license` endpoint is exposed for license management.

### Via Port-Forward

```bash theme={null}
kubectl port-forward svc/cartesia-license-proxy 8080:8080 -n cartesia
```

In another terminal:

```bash theme={null}
curl -X POST http://localhost:8080/license -d '<license-json>'
```

### Via Ingress

If ingress is enabled:

```bash theme={null}
curl -X POST https://<your-domain>/license -d '<license-json>'
```

## Retrieving Audit Logs

The `/audit` endpoint is available in air-gapped mode for retrieving usage audit logs:

```bash theme={null}
curl -X GET https://<your-domain>/audit --output audits.tar
```

These audit logs contain usage metadata for billing reconciliation. No transcript data is included, which you can validate by looking at the contents of the output.


# Architecture
Source: https://docs.cartesia.ai/self-hosted/architecture

Overview of the core components in a Cartesia self-hosted deployment.

Cartesia's self-hosted services support a configurable trade-off between latency and throughput for both TTS and STT deployments.

<Frame>
  <img alt="Self-hosted Architecture" />
</Frame>

## Core Components

### API Server

The API Server is the entrypoint for all requests for your self-hosted Cartesia Service. It handles incoming REST API requests and WebSocket connections.

### PubSub Controller (NATS)

We leverage an async communication protocol between the API server and the model containers to manage smooth low latency request handling. This design allows :

* Model containers to leave and join the cluster freely.
* Efficient stateful management of long running request lifecycles.
* Coordination between the API server and Model containers for the lowest latency pathways for a request.

### Model Workers (Engine)

Cartesia provides batched engine workers for both TTS and STT. The core parameter to customize here is the `batch_size (B)`. We'll discuss tradeoffs
for this and other parameters in the Performance Tuning sections.

### License Proxy Server

We deploy a single service which talks to our cloud environment for authenticating and ensuring license validity of the self-hosted deployment.  We
do this for several reasons, primarily: this becomes the only service making outbound calls, thus making it easier to configure network security
policies.

Proxy allows you to choose the level of isolation you want:

* `Connected`: The deployment validates licensing by pinging our cloud periodically and sends telemetry regarding usage.
* `Air-gapped`: Completely isolated offering, where you work with an offline license.  In air-gapped mode, we work with you directly to get usage
  information via audit-logs.

For most customers, we recommend deploying in `Connected` mode, however if you have need for completely isolated deployments,
our GTM team can work with you in setting things up.

For both `Connected` and `Air-gapped` mode, we have grace periods configured, so we don't immediately terminate the operations on getting disconnected or license expiring.


# Autoscaling
Source: https://docs.cartesia.ai/self-hosted/auto-scaling


## Pod Auto-Scaling (KEDA)

KEDA ScaledObjects use Prometheus-based metrics with two triggers:

| Trigger     | Metric                                            | Threshold | Condition               |
| ----------- | ------------------------------------------------- | --------- | ----------------------- |
| Worker Load | inferno\_worker\_load / inferno\_worker\_capacity | 0.8 (80%) | Always active           |
| Queue-based | api\_queue\_size / capacity (overflow mode)       | 1.0       | Only when minReplicas=0 |
| Queue-based | api\_unserviceable\_requests\_size                | 0.9       | Only when minReplicas=0 |

Scaling behavior:

* Polling interval: 15 seconds
* Scale-up stabilization: 30 seconds
* Scale-down stabilization: 900 seconds (15 min)
* Scale-down policy: Remove 1 pod per 60 seconds

## Cluster/Node Auto-Scaling

<Tabs>
  <Tab title="AWS EKS">
    Uses the Cluster Autoscaler:

    * Scan interval: 10 seconds
    * Scale-down delay: 10 minutes after node add
    * Scale-down unneeded time: 10 minutes
    * Expander: least-waste (bin-packing)
    * Metric: Pending pods that can't be scheduled due to insufficient resources
  </Tab>

  <Tab title="GCP GKE">
    Uses the Native Autoscaler:

    * Profile: BALANCED
    * Resource limits: CPU (1-128), Memory (1-512GB), nvidia-l4 GPUs (0-8)
    * Metric: Pending pods + resource utilization
  </Tab>
</Tabs>

## Metrics Used for Scaling

The autoscaling triggers above use [Prometheus metrics](/self-hosted/metrics) exposed by the application. See the [Metrics and Monitoring](/self-hosted/metrics) page for the full list of available metrics.


# Changelog
Source: https://docs.cartesia.ai/self-hosted/changelog

Release history for Cartesia self-hosted deployments

## sonic-20260310

<AccordionGroup>
  <Accordion title="Add voices API">
    New `POST /onprem/add-voices` endpoint to migrate voices from the Cartesia cloud to your self-hosted deployment. Supports up to 50 voices per request.

    See [Managing Artifacts](/self-hosted/managing-artifacts) for details.
  </Accordion>

  <Accordion title="Add pronunciation dictionaries API">
    New `POST /onprem/add-pdict` endpoint to migrate pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment. Supports up to 50 dictionaries per request.

    See [Managing Artifacts](/self-hosted/managing-artifacts) for details.
  </Accordion>

  <Accordion title="Hot reload">
    New artifacts (voices, migration files) are picked up automatically without requiring a rollout. Enabled by default.

    ```hcl theme={null}
    enable_hot_reload = false  # to disable
    ```

    <Warning>
      Hot reload does not support PVC voices. Migrations with `include_loras: true` require a restart of the worker pods.
    </Warning>
  </Accordion>
</AccordionGroup>


# Cloud Service Provisioning
Source: https://docs.cartesia.ai/self-hosted/cloud-service-provisioning

Deploy Cartesia using Amazon SageMaker Jumpstart

Amazon SageMaker Jumpstart provides the quickest path to deploying Cartesia's self-hosted solution with managed infrastructure, automatic scaling, and integrated monitoring. This deployment method is ideal for teams new to self-hosted AI or those wanting managed infrastructure.

To get started, visit the [Sonic 3 on AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-w2bmik3jypagm) to subscribe.

## Overview

SageMaker Jumpstart deployment offers:

* **Managed Infrastructure**: AWS handles server provisioning and maintenance
* **Automatic Scaling**: Built-in auto-scaling based on demand
* **Integrated Monitoring**: CloudWatch integration for metrics and logging
* **Pay-per-use**: Cost optimization through on-demand resource allocation
* **Quick Setup**: Deploy in minutes using pre-configured notebooks

## Prerequisites

### AWS Account Requirements

* AWS account with SageMaker access
* Sufficient service limits for GPU instances (ml.g6e.xlarge)
* IAM role with Sagemaker Full Access and Marketplace Subscription Access (ViewSubscriptions, Unsubscribe, Subscribe)
* VPC configuration (optional, for private deployment)

## Getting Started

To get started with deploying an inference endpoint for Sonic 3 on Sagemaker, please refer to [the steps in this notebook](https://github.com/cartesia-ai/cartesia-aws/blob/main/Sonic-3-Jumpstart.ipynb)

## Inference Setup

Sonic 3 supports only real time inference on Sagemaker. Please select `ml.g6e.xlarge` as your inference endpoint instance type. Each instance is capable of serving 8 concurrent requests. In order to get the best performance, Sagemaker suggests that you reuse the client-to-SageMaker connection, as it can save the time to re-establish the connection. In boto3, you can configure max\_pool\_connections . Multiple requests will reuse the connections, which avoids the cost of establishing new TCP/TLS connections for each request.

## Inputs and Outputs

### Input Summary

The response streaming endpoint takes in a JSON object as the input that specifies the transcript, voice, language, and output format for the generation

### Input Parameters

| Parameter                   | Description                                                                                                                                                                                                                                                                                                                           | Type      | Required |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | -------- |
| `context_id`                | A unique ID provided by the client to identify the request. It can be any string value and helps with tracking or debugging.                                                                                                                                                                                                          | `string`  | Yes      |
| `transcript`                | The text that will be converted into speech. You can include additional controls (e.g., emotion, speed, volume) as supported by Sonic 3 models.<br /><a href="https://docs.cartesia.ai/build-with-cartesia/sonic-3/volume-speed-emotion">Docs</a>                                                                                     | `string`  | Yes      |
| `language`                  | The language code of the transcript text.<br /><br />Supported codes:<br />`en`, `fr`, `de`, `es`, `pt`, `zh`, `ja`, `hi`, `it`, `ko`, `nl`, `pl`, `ru`, `sv`, `tr`, `tl`, `bg`, `ro`, `ar`, `cs`, `el`, `fi`, `hr`, `ms`, `sk`, `da`, `ta`, `uk`, `hu`, `no`, `vi`, `bn`, `th`, `he`, `ka`, `id`, `te`, `gu`, `kn`, `ml`, `mr`, `pa` | `string`  | Yes      |
| `output_format`             | Must match the `raw` option from the Cartesia TTS SSE API. Only `raw` is supported.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-output-format">Docs</a>                                                                                                                                                         | `string`  | Yes      |
| `voice`                     | Matches the `voice` field from the Cartesia TTS SSE API. Only **mode = `id`** is supported.<br /><br />Example:<br />`{ "mode": "id", "id": "voice_123" }`<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-voice">Docs</a>                                                                                          | `object`  | Yes      |
| `generation_config`         | Optional configuration object matching the API schema.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-generation-config">Docs</a>                                                                                                                                                                                  | `object`  | No       |
| `add_timestamps`            | Whether to include word-level timestamps in the output.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-add-timestamps">Docs</a>                                                                                                                                                                                    | `boolean` | No       |
| `add_phoneme_timestamps`    | Whether to include phoneme-level timestamps in the output.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-add-phoneme-timestamps">Docs</a>                                                                                                                                                                         | `boolean` | No       |
| `use_normalized_timestamps` | Whether timestamps should be normalized (0–1 range).<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-use-normalized-timestamps">Docs</a>                                                                                                                                                                            | `boolean` | No       |

### Data Sample

```json theme={null}
{
    "context_id": "0",
    "transcript": "The detective burst through the door. 'We've got maybe five minutes before they realize we're here, so listen carefully and listen well: <speed ratio='1.5'/> the artifact is hidden beneath the old courthouse, exactly three feet below the cornerstone, and <volume ratio='0.5'/>whatever you do, DO NOT touch it with your bare hands!' She paused, catching her breath. 'Now... here's the important part... <speed ratio='0.6'/>you need to... very slowly... very carefully... wrap it in the copper wire first... then the silk cloth... then seal it in the lead box.' <volume ratio='2.0'/> Footsteps echoed in the hallway. 'GO GO GO! They're coming up the stairs RIGHT NOW!'",
    "language": "en",
    "output_format": {
        "container": "raw",
        "sample_rate": 44100,
        "encoding": "pcm"
    },
    "voice_id": {
        "mode": "id",
        "id": "bf0a246a-8642-498a-9950-80c35e9276b5"
    }
}
```

### Output Details

#### Output Events

Sagemaker sends back the response events in a [Response Stream](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_ResponseStream.html). The payload is sent to you as base 64 encoded blobs. Due to Sagemaker limitation, it may truncate one event into several segements. Or API always attach a linebreak to the end of each complete event, such that you can reconciliate them on client side. Each event we send back is a json object that contains the generated audio chunk and some metadatas. The event can be one of the following types, identified by `event.type`:

##### Chunk Event

A chunk event always contains at most 20 ms worth of audio chunk in the output format and sample rate you specified.

| Parameter       | Description                                                                                                            | Type     | Required |
| --------------- | ---------------------------------------------------------------------------------------------------------------------- | -------- | -------- |
| `type`          | The type of response event. For chunk events, this value is always `"chunk"`.                                          | `string` | Yes      |
| `context_id`    | Optional identifier for the response context. Useful for correlating responses with requests or sessions.              | `string` | No       |
| `status_code`   | The HTTP-like status code representing the success or error state of the chunk event.                                  | `int`    | Yes      |
| `done`          | Indicates whether this is the final chunk (`true`) or if more chunks are expected (`false`).                           | `bool`   | Yes      |
| `data`          | The base 64 encoded chunk of audio data. Each chunk represents a portion of the full audio output.                     | `string` | Yes      |
| `sampling_rate` | The sampling rate (in Hz) of the audio data in this chunk (e.g., `44100` or `8000`).                                   | `int`    | Yes      |
| `step_time`     | The time (in seconds) representing the generation step for this chunk, useful for synchronization or latency tracking. | `float`  | Yes      |

##### Done Event

A done event signals the completion of the generation. Done events are identified by `event.type == "done"` and `event.done == True`.

##### Timestamp Event

A **timestamp event** provides timing information for recognized words or tokens.

| Parameter         | Description                                                                        | Type                | Required |
| ----------------- | ---------------------------------------------------------------------------------- | ------------------- | -------- |
| `type`            | The response type. Always `"timestamps"`.                                          | `string`            | Yes      |
| `context_id`      | Optional identifier correlating this timestamp event with its request/session.     | `string`            | No       |
| `status_code`     | Status code indicating success or failure.                                         | `int`               | Yes      |
| `done`            | Indicates whether this is the final timestamp event.                               | `bool`              | Yes      |
| `word_timestamps` | A dictionary describing word-level timestamps (format may vary by implementation). | `dict<string, any>` | Yes      |

##### Phoneme Timestamp Event

A **phoneme timestamp event** provides timing data at the phoneme level, typically for detailed speech analysis.

| Parameter            | Description                                                            | Type                | Required |
| -------------------- | ---------------------------------------------------------------------- | ------------------- | -------- |
| `type`               | The response type. Always `"phoneme_timestamps"`.                      | `string`            | Yes      |
| `context_id`         | Optional identifier for correlating this event with a request/session. | `string`            | No       |
| `status_code`        | Processing status code.                                                | `int`               | Yes      |
| `done`               | Indicates whether this is the final phoneme timestamp event.           | `bool`              | Yes      |
| `phoneme_timestamps` | A dictionary containing phoneme-level timing information.              | `dict<string, any>` | Yes      |

## Error Handling

If an error occurs during the generation type, Sagemaker will send back the error as a [Model Error](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_ResponseElements:~:text=Status%20Code%3A%20500-,ModelError,-Model%20\(owned%20by\)). To handle the error, you may inspect the `OriginalStatusCode` field of the error object (See examples for error handling in python).

### 422 Errors

A 422 error indicates that your input is not of the correct format. You may see more details in the `Message` field.

### 429 Errors

A 429 error indicates that the model container you are hitting does not have capacity to serve requests at the point. Our models serve at most 4 concurrent generation requests at a time. If you are running multiple inference container replicas, we suggest that you use load-aware routing in sagemaker by configuring the parameters `RoutingConfig` inside the `ProductionVariants` configuration, Set it to `LEAST_OUTSTANDING_REQUESTS` for optimal load distribution.

## Container Logs

You should be able to see container logs in cloudwatch. Most logs should be emitted with a request id. The server side request id is of the format `{uuid}-{client supplied context id}`.


# Docker
Source: https://docs.cartesia.ai/self-hosted/docker-compose

Deploy Cartesia on bare-metal or VM nodes using Docker Compose or Docker Swarm

<Note>Docker Compose and Docker Swarm deployment are currently in **beta**. Connect with the Cartesia team for support.</Note>

Deploy Cartesia TTS on a **single machine** with Docker Compose, or across a **multi-node cluster** with Docker Swarm.

|                 | Docker Compose                                       | Docker Swarm                              |
| --------------- | ---------------------------------------------------- | ----------------------------------------- |
| **Nodes**       | Single host                                          | Multiple hosts (managers + workers)       |
| **GPU scaling** | Multiple workers via `WORKER_REPLICAS` (one per GPU) | Workers scheduled on labeled GPU nodes    |
| **MIG support** | Auto-detected via `--mig` flag                       | Per-node via node labels and `--mig` flag |
| **Networking**  | Bridge (default)                                     | Overlay (Swarm-managed)                   |

## Prerequisites

* One or more machines with Docker installed (your user must be in the `docker` group)
* **Compose only:** Docker Compose V2 (`docker compose`)
* **Swarm only:** nodes meet Docker's [Swarm networking requirements](https://docs.docker.com/engine/swarm/networking/)
* At least one NVIDIA GPU with drivers installed. MIG (Multi-Instance GPU) partitioning is supported on compatible NVIDIA GPUs
* GPU nodes have the **nvidia Docker runtime set as default** (see below)
* The `cartesia-kube` repo downloaded as described in [Downloading cartesia-kube](/self-hosted/getting-started#downloading-kube)
* A Cartesia API key file (`container_key`) and a GCS service account JSON file, provided during onboarding

### GPU runtime check

On each GPU node, verify the NVIDIA runtime:

```bash theme={null}
nvidia-smi

docker info | grep "Default Runtime"
# Expected: Default Runtime: nvidia

docker run --rm nvidia/cuda:12.3.1-base-ubuntu22.04 nvidia-smi
```

If `nvidia` is not the default runtime, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) and run:

```bash theme={null}
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker
```

**If using MIG:** After enabling MIG and creating instances on the host, verify they are visible:

```bash theme={null}
nvidia-smi -L
# Each MIG instance appears as a MIG-... UUID line beneath its parent GPU.
# The deploy script reads these UUIDs automatically — no manual configuration required.
```

<Note>MIG must be enabled and instances created on the host before deploying. Recreating MIG instances generates new UUIDs; redeploy the stack if this happens.</Note>

***

## Step 1 — Prepare secrets

Place these files on the host (Compose) or **manager node** (Swarm):

* `container_key` — file containing your Cartesia API key
* `service-account.json` — GCS service account JSON with `roles/artifactregistry.reader` (image pull) and `roles/storage.objectViewer` (GCS sync)

Make the deploy script executable:

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    chmod +x local/scripts/deploy-compose.sh
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    chmod +x local/scripts/deploy-swarm.sh
    ```
  </Tab>
</Tabs>

***

## Step 2 — Initialize the cluster (Swarm only)

Skip this step if you are using Docker Compose.

On the **manager node**:

```bash theme={null}
docker swarm init --advertise-addr <MANAGER_IP>
```

Copy the `docker swarm join` command from the output. On **each additional node**, run:

```bash theme={null}
docker swarm join --token <TOKEN> <MANAGER_IP>:2377
```

Label each node from the manager. Use `docker node ls` to list node IDs:

```bash theme={null}
docker node update --label-add cpu=true <node-id>   # CPU services (API, NATS, etc.)
docker node update --label-add gpu=true <node-id>   # Standard GPU workers
```

**If using MIG:** Label MIG-enabled nodes with `mig=true` and a comma-separated list of their MIG instance UUIDs (obtained from `nvidia-smi -L` on that node). Do **not** apply `gpu=true` to MIG nodes.

```bash theme={null}
docker node update --label-add mig=true <node-id>
docker node update --label-add 'mig.uuids=MIG-<uuid1>,MIG-<uuid2>' <node-id>
```

Mixed clusters with both standard GPU nodes and MIG nodes are supported — the deploy script handles scheduling for both automatically.

***

## Step 3 — Configure environment

Set [environment variables](#configuration) before deploying. Use a `.env` file in `local/` (see `local/.env.example`) or export them in your shell.

```bash theme={null}
export IMAGE_REGISTRY="YOUR_IMAGE_REGISTRY"
export RELEASE_TAG="YOUR_RELEASE_TAG"
export MODEL_NAME="YOUR_MODEL_NAME"

export CONTAINER_KEY_FILE=/path/to/cartesia-api-key
export GCS_SA_FILE=/path/to/service-account.json

# Optional
export WORKER_REPLICAS=1
export WORKER_CAPACITY=4
export BUCKET_NAME=""
export CLUSTER_NAME="cartesia-compose"   # or "cartesia-swarm"
export USE_MIG=0                         # set to 1 to enable MIG mode (or pass --mig to the deploy script)
```

See [Configuration](#configuration) for full details on each variable.

***

## Step 4 — Deploy

<Tabs>
  <Tab title="Compose">
    From the repo root:

    ```bash theme={null}
    # Standard deployment
    ./local/scripts/deploy-compose.sh

    # With MIG support (auto-detects MIG instances via nvidia-smi)
    ./local/scripts/deploy-compose.sh --mig
    ```

    When `--mig` is used, the script auto-detects MIG instance UUIDs from `nvidia-smi`, generates a per-slice worker configuration, and scales the standard worker to zero.
  </Tab>

  <Tab title="Swarm">
    On the **manager node**:

    ```bash theme={null}
    # Standard deployment
    ./local/scripts/deploy-swarm.sh

    # With MIG support (reads UUIDs from node labels)
    ./local/scripts/deploy-swarm.sh --mig
    ```

    This will:

    1. Verify that nodes are labeled (fails with instructions if not).
    2. Create encrypted Swarm secrets from your key and service account files.
    3. Deploy all services. With `--mig`, one dedicated worker service is created per MIG instance, each pinned to its node.
  </Tab>
</Tabs>

<Warning>
  TTS workers take a few minutes to load the model into GPU memory. During this time, TTS requests will return errors even though containers appear healthy. Wait for the ready signal:

  <Tabs>
    <Tab title="Compose">
      ```bash theme={null}
      cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs -f tts-worker 2>&1 | grep -i "ready"
      ```
    </Tab>

    <Tab title="Swarm">
      ```bash theme={null}
      docker service logs cartesia_tts-worker -f 2>&1 | grep -i "ready"
      ```
    </Tab>
  </Tabs>
</Warning>

***

## Step 5 — Verify

Check that services are running:

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml ps
    ```

    If deployed with MIG, verify each worker sees exactly one MIG device:

    ```bash theme={null}
    # List all running services (MIG workers appear as tts-worker-mig-0, tts-worker-mig-1, etc.)
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml -f docker-compose.mig.generated.yaml ps
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    docker stack services cartesia
    ```

    If deployed with MIG, verify MIG worker services are scheduled and running:

    ```bash theme={null}
    docker stack ps cartesia --filter 'name=cartesia_tts-worker-mig'
    ```
  </Tab>
</Tabs>

Test the API:

```bash theme={null}
curl http://localhost:5000/status
```

Test TTS:

```bash theme={null}
curl -s -X POST "http://localhost:5000/tts/bytes" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Cartesia-Version: 2024-06-10" \
  -d '{
    "model_id": "sonic-3",
    "transcript": "Hello from Cartesia.",
    "voice": {"mode": "id", "id": "00510a15-4216-4fdc-a0ab-05d74cd9f795"},
    "language": "en",
    "output_format": {"container": "mp3", "sample_rate": 44100, "bit_rate": 128000}
  }' --output test.mp3
```

***

## Troubleshooting

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    cd local

    docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs api
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs tts-worker

    # Restart everything
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml down
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml up -d
    ```

    If the API exits with `no servers available for connection` (NATS not ready), restart the API after the stack is up:

    ```bash theme={null}
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml up -d && docker compose -f docker-compose.base.yaml -f docker-compose.yaml restart api
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    docker stack ps cartesia --no-trunc

    docker service logs cartesia_api
    docker service logs cartesia_tts-worker

    # Restart the stack
    docker stack rm cartesia
    sleep 10
    cd local && docker stack deploy --with-registry-auth -c docker-compose.base.yaml -c docker-compose.swarm.yaml cartesia
    ```
  </Tab>
</Tabs>

***

## Configuration

Set these environment variables before running the deploy script. You receive `IMAGE_REGISTRY`, `RELEASE_TAG`, and `MODEL_NAME` from Cartesia during onboarding. If you mirror images into your own registry, use your mirror URL for `IMAGE_REGISTRY`.

### Required

| Variable             | Description                                                        |
| -------------------- | ------------------------------------------------------------------ |
| `IMAGE_REGISTRY`     | Container image registry URL (Cartesia registry or your mirror).   |
| `RELEASE_TAG`        | Image tag for the release you are deploying (updates per release). |
| `MODEL_NAME`         | TTS model identifier for the worker image.                         |
| `CONTAINER_KEY_FILE` | Path to file containing your Cartesia API key.                     |
| `GCS_SA_FILE`        | Path to GCS service account JSON file.                             |

### Optional

| Variable            | Default                               | Description                                                                                                                     |
| ------------------- | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| `WORKER_REPLICAS`   | `1`                                   | Number of TTS worker containers. For Compose, set to your GPU count on the host. For Swarm, scale to match your GPU node count. |
| `WORKER_CAPACITY`   | `4`                                   | Max concurrent TTS requests per worker. Lower if you run out of GPU memory.                                                     |
| `BUCKET_NAME`       | *(empty)*                             | GCS bucket for migrations/LoRAs. Leave empty to disable sync.                                                                   |
| `CLUSTER_NAME`      | `cartesia-compose` / `cartesia-swarm` | Identifier for logs and metrics.                                                                                                |
| `GCS_SYNC_INTERVAL` | `300`                                 | GCS sync interval in seconds.                                                                                                   |
| `USE_MIG`           | `0`                                   | Set to `1` to enable MIG mode.                                                                                                  |


# Getting Started
Source: https://docs.cartesia.ai/self-hosted/getting-started

Prerequisites and initial setup for Cartesia self-hosted deployments

# Prerequisites

Before deploying Cartesia's self-hosted solution, you'll need:

## Enterprise Contract

Cartesia's self-hosted products generally require an enterprise contract. Please reach out to [support@cartesia.ai](mailto:support@cartesia.ai) to request a conversation with our Go-to-Market team.

## Infrastructure

### Hardware Requirements

Cartesia models require GPUs running NVidia devices from the Ampere family or newer, with at least 24GB GPU Memory. We'll provide more specifics
depending on how you run your GPU clusters.  See [Hardware Selection](/self-hosted/hardware-selection) for more details.

### Deployment Options

You can deploy a self-hosted Cartesia cluster in one of 3 ways that we provide today:

* Via Helm Charts on a Managed Kubernetes Cluster with the right hardware.
* Via Docker Compose / Docker Swarm on bare-metal or VM nodes (beta).
* Via managed endpoints on Sagemaker Jumpstart.

Since all of our code executes in containers, you can go with a much more customized deployment option as well.

# Setup Stages

<Note>We highly recommend trying out our cloud offering first, since you can test your application and integrate it without all the work required for self-hosting.</Note>

<Steps>
  <Step title="Create Cartesia Account">
    Sign up at [play.cartesia.ai](https://play.cartesia.ai) and create an API key.
    Navigate to [play.cartesia.ai/keys](https://play.cartesia.ai/keys) and select your organization.
  </Step>

  <Step title="Request Enterprise Access">
    Contact [support@cartesia.ai](mailto:support@cartesia.ai) for getting enterprise access.

    If you're deploying on [AWS Sagemaker](/self-hosted/cloud-service-provisioning), you can request directly on the cloud platform itself.
  </Step>

  <Step title="Choose Deployment Method">
    Select your preferred deployment approach based on your infrastructure:

    * [**Managed Kubernetes**](/self-hosted/managed-kubernetes)
    * [**Docker**](/self-hosted/docker-compose) (beta)
    * [**Cloud Service Provisioning**](/self-hosted/cloud-service-provisioning)

    Depending on how you're deploying, you'll also decide on the hardware at this stage.
  </Step>

  <Step title="Deploy">
    Once approved, you'll receive access to:

    * Google Cloud Storage bucket containing cartesia-kube and related artifacts (Docker images, voices, LoRA weights)
    * Private Docker registry credentials
    * Helm chart repositories
    * Terraform configuration examples
    * Deployment documentation and support
    * An offline license (required if you are doing an [air-gapped deployment](/self-hosted/air-gapped))

    See [Provisioned Resources](/self-hosted/provisioned-resources) for a full breakdown of what's included and how to access each resource, including downloading cartesia-kube.

    Download cartesia-kube from the GCS bucket and follow the guide for your chosen deployment method to get up and running. The provided configurations work out of the box, but can be customized to fit your infrastructure needs.
  </Step>

  <Step title="Post Deployment">
    Post deployment, we provide some resources to validate and benchmark your deployment on your own hardware. See [Testing and Benchmarking](/self-hosted/testing-and-benchmarking).
    If you're looking to setup monitoring on the deployment, checkout [Metrics](/self-hosted/metrics)
  </Step>
</Steps>


# Hardware Selection
Source: https://docs.cartesia.ai/self-hosted/hardware-selection


Cartesia's models are portable enough to run on widely available GPU hardware.

In the table below we show the recommended concurrency for our TTS and STT model workers.

| GPU  | Sonic Concurrency | Ink Concurrency |
| ---- | ----------------- | --------------- |
| A10G | 4                 |                 |
| L40S | 4                 |                 |
| A100 | 4                 |                 |
| H100 | 8                 | 16              |

See [Metrics](/self-hosted/metrics) for more details on performance metrics.

When choosing hardware you need to consider the tradeoffs between latency (TTFA), and throughput.
See the table below for the metrics on the different set of GPUs we test on:

<Tabs>
  <Tab title="H100">
    | Concurrency | TTFA (ms) | RTF Avg | RTF P95 | Throughput (chars/s) |
    | ----------- | --------- | ------- | ------- | -------------------- |
    | 1           | 95        | 0.20    | 0.25    | 30                   |
    | 2           | 115       | 0.25    | 0.35    | 50                   |
    | 4           | 165       | 0.30    | 0.55    | 90                   |
    | 8           | 280       | 0.40    | 0.70    | 165                  |
  </Tab>

  <Tab title="L40s">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 90              | 0.20          | 0.20          | 50                   |
    | 2           | 120             | 0.25          | 0.25          | 90                   |
    | 4           | 180             | 0.30          | 0.45          | 145                  |
    | 8           | 185             | 0.30          | 0.55          | 180                  |
  </Tab>

  <Tab title="A100">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 130             | 0.30          | 0.30          | 45                   |
    | 2           | 180             | 0.30          | 0.35          | 70                   |
    | 4           | 280             | 0.40          | 0.40          | 120                  |
    | 8           | 260             | 0.40          | 0.60          | 135                  |
  </Tab>

  <Tab title="A10g">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 140             | 0.30          | 0.30          | 40                   |
    | 2           | 205             | 0.35          | 0.35          | 60                   |
    | 4           | 335             | 0.45          | 0.50          | 100                  |
    | 8           | 600             | 0.65          | 0.70          | 155                  |
  </Tab>
</Tabs>

With these you'll setup your per worker configurations.  For handling your application's scaling requirements, you'll need to configure autoscaling behavior.  See [autoscaling](/self-hosted/auto-scaling) for more details.


# Introduction
Source: https://docs.cartesia.ai/self-hosted/introduction


Cartesia's models can be self-hosted into customer provisioned cloud environments, such as GCP, AWS, or on-premise data centers.

## Why Self-Host

Cartesia's public API is globally available for the lowest latency, complete with GDPR, SOC 2 Type II, PCI Level 1,
and HIPAA compliance with enterprise contract options for Service Level Agreements (SLA) and Business Associate Agreement (BAA), and more.

However certain use cases may still warrant Self-Hosted Voice AI and Cartesia supports both private cloud and on-premise hosting options.
In those circumstances we recommend a self-hosted offering that is feature complete and as performant as the cloud offering.

### Colocation

With self-hosted deployments, you can choose to colocate your Voice AI models with other offerings
and establish your own SLAs around uptime and throughput. Colocated TTS would save a lot on network latencies depending on where
your datacenters are located.

### Isolation (Single Tenant)

Even though we provide a tenant level isolation in our cloud offering, nothing will beat the isolation you can achieve by self-hosting.

### Security

Self-hosted deployments allow you to maintain tight security posture without running Voice AI traffic over the internet to our public APIs. The self-hosted deployments will only contact
the Cartesia server to authenticate model access and report usage information. Usage information is limited to metadata such as character count and voice id, and does not contain any transcript information.
We also support [air-gapped deployments](/self-hosted/air-gapped) where there's no contact to our cloud, instead your deployment works with an offline license.

### Sovereignty

You can choose to host your Voice AI offering in any geographic region with GPU availability to meet jurisdictional requirements.

## Supported Products

| Product       | Support                   |
| ------------- | ------------------------- |
| Sonic 2       | Kubernetes                |
| Sonic 3       | Kubernetes, AWS SageMaker |
| Ink Whisper   | Kubernetes                |
| Voice Agents  | Not supported             |
| Voice Cloning | Not supported             |


# Managed Kubernetes
Source: https://docs.cartesia.ai/self-hosted/managed-kubernetes

Deploy Cartesia on AWS EKS and GCP GKE

Cartesia provides Terraform configurations that deploy both infrastructure and the application, or you can deploy the Helm chart directly to an existing cluster.

<Note>Complete configurations are provided at deployment time by your Cartesia representative.</Note>

## Terraform Deployment

Terraform creates the cluster, networking, GPU drivers, and deploys Cartesia via Helm.
This is the fastest way for you to get started with self-hosting Cartesia.

<Note>Download cartesia-kube from the GCS bucket as described in [Downloading cartesia-kube](/self-hosted/provisioned-resources#deployment-configurations).</Note>

```bash theme={null}
# Download and extract cartesia-kube from GCS (see Downloading cartesia-kube guide)
cd cartesia-kube

# Copy example config for your platform
cp aws-terraform.tfvars.example aws-terraform.tfvars  # or gcp-terraform.tfvars.example

# Deploy from the platform directory
cd infra/aws/cartesia-eks  # or infra/gcp/cartesia-gke
terraform init
terraform apply -var-file="../../../aws-terraform.tfvars" \
                -var "cartesia_api_key=$CARTESIA_API_KEY" \
                -var "service_account_json=$(cat /path/to/service-account.json)"
```

### Configuration

<Tabs>
  <Tab title="AWS EKS">
    ```hcl theme={null}
    region = "us-west-2"
    name = "cartesia-production"

    eks_admin_users = ["arn:aws:iam::123456789:user/admin"]

    node_groups = {
      default = {
        ami_type = "AL2023_x86_64_STANDARD"
        instance_types = ["m7a.4xlarge"]
        min_size = 1
        max_size = 3
        desired_size = 1
      }
      gpu = {
        ami_type = "AL2023_x86_64_NVIDIA"
        instance_types = ["g5.2xlarge", "g5.4xlarge"]
        min_size = 1
        max_size = 5
        desired_size = 2
        disk_size = 100
        labels = { "nvidia.com/gpu.present" = "true" }
      }
    }

    # Ingress (optional)
    enable_ingress = true
    ingress_route = "api.cartesia.yourdomain.com"
    certificate_arn = "arn:aws:acm:us-west-2:123456789:certificate/abc123"

    # Hot reload (enabled by default)
    enable_hot_reload = true
    ```
  </Tab>

  <Tab title="GCP GKE">
    ```hcl theme={null}
    project_id = "your-gcp-project"
    region = "us-central1"
    zone = "us-central1-a"
    name = "cartesia-production"

    gke_admin_users = ["user@yourdomain.com"]

    node_pools = {
      default = {
        machine_type = "e2-standard-8"
        min_count = 1
        max_count = 3
        initial_node_count = 1
      }
      gpu = {
        machine_type = "g2-standard-8"
        accelerator_type = "nvidia-l4"
        accelerator_count = 1
        min_count = 1
        max_count = 5
        initial_node_count = 2
        disk_size_gb = 100
      }
    }

    # Ingress (optional)
    enable_ingress = true
    ingress_route = "api.cartesia.yourdomain.com"

    # Hot reload (enabled by default)
    enable_hot_reload = true
    ```
  </Tab>
</Tabs>

See [Managing Artifacts](/self-hosted/managing-artifacts) for details on hot reload and adding voices and pronunciation dictionaries to your deployment.

### Worker Configuration

Workers are defined in your tfvars file:

```hcl theme={null}
workers = [
  {
    name = "tts-worker"
    workerArgs = {
      model = "<model-name>"
      image = "cartesia-sonic-<model-name>"
      gpuType = "nvidia.com/gpu"
      capacity = 4
      operation = "TTS"
      useCB = true
      useLora = false
    }
    autoscaling = {
      enabled = true
      threshold = 0.6
      minReplicas = 1
      maxReplicas = 10
    }
  }
]
```

All the model workers have the images with prefix `cartesia-sonic-` followed by the specific model name. For instance, sonic-3 would use `cartesia-sonic-rosy-dragon`.

## Helm-Only Deployment

For existing Kubernetes clusters, deploy the Helm chart directly.

### 1. Install Prerequisites

If you want autoscaling and metrics, install KEDA and Prometheus first:

```bash theme={null}
# Prometheus
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace
```

### 2. Create Secrets

```bash theme={null}
kubectl create namespace cartesia

kubectl create secret docker-registry gar-pull-secret \
  --namespace cartesia \
  --docker-server=us-docker.pkg.dev \
  --docker-username=_json_key \
  --docker-password="$(cat /path/to/service-account.json)"
```

### 3. Configure values.yaml

```yaml theme={null}
clusterName: cartesia-production

infra:
  provider: gcp  # or aws
  authenticate: true
  imageRegistry: us-docker.pkg.dev/cartesia-external/self-serve
  imagePullSecret: gar-pull-secret
  gcsSecretName: gar-pull-secret
  serviceAccount: cartesia-image-sa

release:
  version: "1.0.0"
  releaseTag: "sonic-20251118"

filesystem:
  storageClass:
    name: standard-rwo

ingress:
  enabled: true
  routes:
    - api.cartesia.yourdomain.com
  globalStaticIpName: cartesia-ingress-ip  # GKE only

metrics:
  enabled: true

legacyComponents:
  enabled: false

workers:
  - name: tts-worker
    workerArgs:
      model: <model-name>
      image: cartesia-sonic-<model-name>
      gpuType: nvidia.com/gpu
      capacity: 4
      operation: TTS
      useCB: true
      useLora: false
    autoscaling:
      enabled: true
      threshold: "0.6"
      minReplicas: 1
      maxReplicas: 10
```

### 4. Deploy

```bash theme={null}
cd cartesia-kube/cartesia
helm upgrade --install cartesia . \
  --values values.yaml \
  --namespace cartesia
```

### Verify

```bash theme={null}
kubectl get pods -n cartesia
kubectl get ingress -n cartesia
```

## Autoscaling

Cartesia supports two levels of autoscaling for Kubernetes deployments.

### Cluster Autoscaler

Scales nodes based on pending pods. Enable in your tfvars:

```hcl theme={null}
enable_cluster_autoscaler = true
```

Node groups/pools will scale within their configured `min_size`/`max_size` bounds when pods can't be scheduled due to insufficient resources.

### Pod Autoscaler (KEDA)

Scales worker pods based on load metrics. Enable in your tfvars:

```hcl theme={null}
enable_pod_autoscaler = true
enable_metrics = true  # Required for KEDA
```

KEDA uses two scaling triggers:

* **Queue depth** - Scales when unserviceable requests accumulate
* **Worker load** - Scales when GPU utilization exceeds threshold

### Per-Worker Scaling

Each worker can have its own scaling configuration:

```hcl theme={null}
workers = [
  {
    name = "tts-worker"
    workerArgs = { ... }
    autoscaling = {
      enabled = true
      threshold = 0.6      # Scale up when load > 60%
      minReplicas = 1
      maxReplicas = 10
    }
  }
]
```

Or in Helm values.yaml:

```yaml theme={null}
workers:
  - name: tts-worker
    workerArgs: { ... }
    autoscaling:
      enabled: true
      threshold: "0.6"
      minReplicas: 1
      maxReplicas: 10
```

### Scaling Behavior

* **Scale up**: 30 second stabilization window
* **Scale down**: 900 second (15 min) stabilization window to avoid flapping
* Workers scale independently based on their individual load


# Managing Artifacts
Source: https://docs.cartesia.ai/self-hosted/managing-artifacts

Add voices and pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment

<Note>
  Hot reload and the on-prem migration APIs (`add-voices`, `add-pdict`) require release tag `sonic-20260310` or later.
</Note>

## Hot reload

New voice artifacts are picked up automatically by your self-hosted deployment without requiring an API server restart. Hot reload is enabled by default.

When a migration file lands in your GCS bucket, the API server detects and applies it automatically. No API server restarts or Helm upgrades are needed.

To disable hot reload, set `enable_hot_reload` to `false` in your tfvars — see [Managed Kubernetes](/self-hosted/managed-kubernetes) for full configuration.

```hcl theme={null}
enable_hot_reload = false
```

<Warning>
  Hot reload does not support PVC voices. If you migrate voices with `include_loras: true`, you must restart the worker pods for the LoRA checkpoints to take effect.
</Warning>

## Adding voices

Add voices from the Cartesia voice library to your self-hosted deployment using the `POST /onprem/add-voices` endpoint. You can migrate up to 50 voices per request. The migration runs asynchronously — voices typically become available on your self-hosted deployment within 4–5 minutes.

```bash theme={null}
curl -X POST "https://api.cartesia.ai/onprem/add-voices" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_ids": ["a0e99841-438c-4a64-b679-ae501e7d6091"],
    "model_id": "sonic-3",
    "include_loras": true
  }'
```

<Note>
  This endpoint is called against the **Cartesia cloud API** (`api.cartesia.ai`), not your self-hosted deployment. Your API key must belong to an organization with an active on-prem contract.
</Note>

### Request body

| Parameter       | Type       | Required | Description                                                                       |
| --------------- | ---------- | -------- | --------------------------------------------------------------------------------- |
| `voice_ids`     | `string[]` | Yes      | Voice IDs or aliases to add. Maximum 50 per request.                              |
| `model_id`      | `string`   | Yes      | The model the voices will be used with (e.g., `"sonic-3"`, `"sonic-english"`).    |
| `include_loras` | `boolean`  | No       | Set to `true` to include LoRA checkpoints for cloned voices. Defaults to `false`. |

### Headers

| Header             | Required | Description                                   |
| ------------------ | -------- | --------------------------------------------- |
| `X-API-Key`        | Yes      | Your Cartesia API key.                        |
| `Cartesia-Version` | No       | API version header. Defaults to `2025-04-16`. |

### Error responses

| Status | Condition                                                                    |
| ------ | ---------------------------------------------------------------------------- |
| `400`  | Missing or empty `voice_ids`, missing `model_id`, or more than 50 voice IDs. |
| `403`  | No on-prem access, or a requested voice is not accessible.                   |
| `422`  | Malformed request body.                                                      |
| `500`  | Internal server error.                                                       |

## Verifying a voice

After migration completes, verify a voice is available on your self-hosted deployment with `GET /voices/<id>`.

```bash theme={null}
curl -X GET "http://<your-host>:<port>/voices/<voice-id>" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```

## Adding pronunciation dictionaries

Add pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment using the `POST /onprem/add-pdict` endpoint. You can migrate up to 50 dictionaries per request. The migration runs asynchronously — dictionaries typically become available on your self-hosted deployment within 4–5 minutes.

```bash theme={null}
curl -X POST "https://api.cartesia.ai/onprem/add-pdict" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pronunciation_dict_ids": ["a0e99841-438c-4a64-b679-ae501e7d6091"]
  }'
```

<Note>
  This endpoint is called against the **Cartesia cloud API** (`api.cartesia.ai`), not your self-hosted deployment. Your API key must belong to an organization with an active on-prem contract, and must own each dictionary being migrated.
</Note>

### Request body

| Parameter                | Type       | Required | Description                                                                                      |
| ------------------------ | ---------- | -------- | ------------------------------------------------------------------------------------------------ |
| `pronunciation_dict_ids` | `string[]` | Yes      | Pronunciation dictionary IDs to add. Maximum 50 per request. Duplicates are removed server-side. |

### Headers

| Header             | Required | Description                                   |
| ------------------ | -------- | --------------------------------------------- |
| `X-API-Key`        | Yes      | Your Cartesia API key.                        |
| `Cartesia-Version` | No       | API version header. Defaults to `2025-04-16`. |

### Error responses

| Status | Condition                                                                |
| ------ | ------------------------------------------------------------------------ |
| `400`  | Missing or empty `pronunciation_dict_ids`, or more than 50 entries.      |
| `403`  | No on-prem access, or a requested dictionary is not owned by the caller. |
| `404`  | A requested dictionary ID does not exist.                                |
| `422`  | Malformed request body.                                                  |
| `500`  | Internal server error.                                                   |

## Verifying a pronunciation dictionary

After migration completes, verify a dictionary is available on your self-hosted deployment with `GET /pronunciation-dicts/<id>`.

```bash theme={null}
curl -X GET "http://<your-host>:<port>/pronunciation-dicts/<dict-id>" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```


# Metrics and Monitoring
Source: https://docs.cartesia.ai/self-hosted/metrics


Cartesia's inference cluster includes support for [Prometheus](https://prometheus.io/), an open source
metrics and monitoring solution.

All metrics are scraped every 5 seconds via PodMonitor on port 8080 `/metrics`.

## Prometheus Metrics

| Metric Name                       | Description                                                               | Normal Range                                          |
| --------------------------------- | ------------------------------------------------------------------------- | ----------------------------------------------------- |
| `inferno_worker_load`             | # of concurrent chunks the worker is processing now                       | \< Capacity                                           |
| `inferno_worker_capacity`         | # of concurrent chunks a worker can process                               | [hardware](/self-hosted/hardware-selection) dependent |
| `inferno_worker_ttfa`             | Time to First Audio                                                       | \< 200 ms                                             |
| `inferno_worker_rtf`              | [Real time factor](https://openvoice-tech.net/index.php/Real-time-factor) | \< 1                                                  |
| `api_queue_size`                  | Request queue size per offering                                           | Low                                                   |
| `api_unserviceable_requests_size` | Unserviceable requests count                                              | 0                                                     |


# Provisioned Resources
Source: https://docs.cartesia.ai/self-hosted/provisioned-resources

Reference for all resources provisioned as part of your self-hosted deployment

When your enterprise contract is finalized, Cartesia provisions the following resources for your account. All provisioned resources are available for download from the [on-prem portal](https://play.cartesia.ai/on-prem).

<Note>The on-prem portal is only accessible under the organization that has on-prem enabled. If you don't see it, switch to that organization in the account switcher.</Note>

## Service Account

A service account is created for your account, this service account has the following accesses:

* Access to a private artifact registry, which is used to host cartesia provided container images.
* Access to a common storage bucket: `gs://cartesia-onprem` containing the deployment configurations.
* Access to a private storage bucket: `gs://cartesia-{{name}}` used for hosting customer specific artifacts.

Download the JSON key for this service account from the [on-prem portal](https://play.cartesia.ai/on-prem).

Activate the service account before accessing resources hosted on GCloud:

```bash theme={null}
gcloud auth activate-service-account --key-file=/path/to/service-account.json
gsutil ls gs://cartesia-onprem/  # Verify access
```

## Deployment Configurations

The `cartesia-onprem` bucket contains versioned repository `cartesia-kube` which holds all of our deployment configurations.

```
gs://cartesia-onprem/
  cartesia-kube/
    latest/
      cartesia-kube-latest.tar.gz   # Latest release archive
      VERSION                        # Current version string
    releases/
      <version>/
        SHA256SUMS                   # Checksums for verification
```

<Note>Voice model files and LoRA weights are provided in a separate bucket or as part of `cartesia-kube`. Your Cartesia representative will confirm the exact paths during onboarding.</Note>

Download and verify the latest release:

```bash theme={null}
BUCKET="cartesia-onprem"

gsutil cp gs://${BUCKET}/cartesia-kube/latest/cartesia-kube-latest.tar.gz .
gsutil cp gs://${BUCKET}/cartesia-kube/latest/VERSION .

LATEST_VERSION=$(cat VERSION)
gsutil cp gs://${BUCKET}/cartesia-kube/releases/${LATEST_VERSION}/SHA256SUMS .

sha256sum -c SHA256SUMS  # macOS: shasum -a 256 -c SHA256SUMS
tar -xzf cartesia-kube-latest.tar.gz
```

Once extracted, `cartesia-kube` contains everything needed for all deployment methods:

```
cartesia-kube/
  benchmarking/          # Load testing and benchmarking tools
  cartesia/              # Helm chart + Docker Compose configs
    scripts/
      swarm/             # Docker Swarm deploy scripts
    templates/           # Kubernetes resource templates
      autoscaler/
      resources/
      services/
  infra/                 # Terraform configs
    aws/
      cartesia-eks/      # EKS deployment
    gcp/
      cartesia-gke/      # GKE deployment
```

## Container Registry

Images are hosted at `us-docker.pkg.dev/cartesia-external/self-serve` and tagged with a release tag (e.g. `sonic-20251118`). The full image reference format is:

```
us-docker.pkg.dev/cartesia-external/self-serve/<image-name>:<release-tag>
```

### Images

| Image Name                   | Description                        |
| ---------------------------- | ---------------------------------- |
| `cartesia-api`               | API server                         |
| `cartesia-license-proxy`     | License validation and enforcement |
| `cartesia-sonic-rosy-dragon` | TTS worker — sonic-3               |
| `cartesia-sonic-royal-plant` | TTS worker — sonic-2               |
| `cartesia-sonic-voice-clone` | TTS worker — voice cloning         |

NATS uses a public image and does not need to be pulled from the Cartesia registry.

### Listing Available Tags

List available image tags sorted by most recent:

```bash theme={null}
gcloud artifacts docker images list \
  us-docker.pkg.dev/cartesia-external/self-serve/cartesia-api \
  --include-tags \
  --sort-by="~UPDATE_TIME"
```

Replace `cartesia-sonic-rosy-dragon` with any image name from the table above. The `~` prefix sorts in descending order, showing the latest tags first.

### Mirroring to a Private Registry

For air-gapped or network-restricted environments, mirror images to your own registry before deployment.

Authenticate Docker with the service account:

```bash theme={null}
cat /path/to/service-account.json | \
  docker login -u _json_key --password-stdin https://us-docker.pkg.dev
```

Pull, retag, and push each image. For example:

```bash theme={null}
CARTESIA_REGISTRY="us-docker.pkg.dev/cartesia-external/self-serve"
PRIVATE_REGISTRY="your-registry.example.com/cartesia"
RELEASE_TAG="sonic-20251118"
IMAGE="cartesia-api"

docker pull ${CARTESIA_REGISTRY}/${IMAGE}:${RELEASE_TAG}
docker tag  ${CARTESIA_REGISTRY}/${IMAGE}:${RELEASE_TAG} ${PRIVATE_REGISTRY}/${IMAGE}:${RELEASE_TAG}
docker push ${PRIVATE_REGISTRY}/${IMAGE}:${RELEASE_TAG}
```

Repeat for each image in the table above.

Then set `infra.imageRegistry` (Helm) to your private registry URL.


# Testing and Benchmarking
Source: https://docs.cartesia.ai/self-hosted/testing-and-benchmarking

Validate and benchmark your Cartesia self-hosted deployment

Once your deployment is running, you can test it using the following commands. Ensure you have network access to your service via port-forwarding or an ingress.

## List Voices

```bash theme={null}
curl "http://<your-host>:<port>/voices" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```

## Text-to-Speech

```bash theme={null}
curl -X POST "http://<your-host>:<port>/tts/bytes" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "sonic-2",
    "transcript": "Hello, this is a test of the Cartesia text-to-speech API.",
    "voice": {
      "mode": "id",
      "id": "bf0a246a-8642-498a-9950-80c35e9276b5"
    },
    "output_format": {
      "container": "wav",
      "encoding": "pcm_f32le",
      "sample_rate": 44100
    },
    "language": "en"
  }' > output.wav
```

## Benchmarking

We provide a benchmarking tool in the [cartesia-kube](https://github.com/cartesia-ai/cartesia-kube) repository for measuring TTS performance metrics like TTFA and latency.

```bash theme={null}
cd cartesia-kube/benchmarking

export CARTESIA_API_KEY="your-api-key"
export CARTESIA_API_URL="wss://your-ingress-host"

# Run with default concurrency (4)
uv run tts_benchmark.py

# Run with custom concurrency
uv run tts_benchmark.py --concurrency 8
```

See the [benchmarking README](https://github.com/cartesia-ai/cartesia-kube/tree/main/benchmarking) for detailed usage and output format.


# MCP
Source: https://docs.cartesia.ai/tools/ai/mcp


The **`cartesia-mcp`** package exposes Cartesia through the **Model Context Protocol (MCP)** so MCP-capable clients—**Cursor**, **Claude Desktop**, **OpenAI Agents**, and similar—can list voices, run **TTS**, and use other Cartesia-backed tools via the protocol instead of custom scripts.

You need a [Cartesia API key](https://play.cartesia.ai/keys). The [PyPI package](https://pypi.org/project/cartesia-mcp/) currently requires **Python 3.13 or newer** as its minimum; confirm the supported version on PyPI before you install.

**Installation**, the **uvx** shortcut, and **MCP client configuration** (executable path, environment variables, Claude Desktop or Cursor) are documented in the **[cartesia-mcp](https://github.com/cartesia-ai/cartesia-mcp)** README so setup stays in sync with releases.

<Card title="cartesia-mcp" icon="github" href="https://github.com/cartesia-ai/cartesia-mcp">
  The official Cartesia MCP Server
</Card>


# JavaScript/TypeScript
Source: https://docs.cartesia.ai/tools/client-libraries/javascript-typescript

The library that powers the Cartesia Playground.

<Card title="Cartesia JS" icon="github" href="https://github.com/cartesia-ai/cartesia-js">
  The Official TS/JS client for the Cartesia API.
</Card>


# Python
Source: https://docs.cartesia.ai/tools/client-libraries/python

The official Python library for the Cartesia API.

<Card title="Cartesia Python" icon="github" href="https://github.com/cartesia-ai/cartesia-python">
  The official Python client for the Cartesia API.
</Card>


# API Conventions
Source: https://docs.cartesia.ai/use-the-api/api-conventions


<Warning>
  All endpoints use HTTPS. HTTP is not supported. API keys that call the API
  over HTTP may be subject to automatic rotation.
</Warning>

All API requests use the following base URL: `https://api.cartesia.ai`. (For WebSockets the corresponding protocol is `wss://`.)

### Always send a `Cartesia-Version` header

Each request you send our API should have a `Cartesia-Version` header containing the date (`YYYY-MM-DD`) when you tested your integration. For WebSockets, you can alternately use the `?cartesia_version` query parameter, which will take precedence.

This will help us provide you with timely deprecation notices and enable us to provide automatic backwards compatibility where possible.

For a given `Cartesia-Version`, we will preserve existing input and output fields, but we may make non-breaking changes, such as:

1. Add optional request fields.
2. Add additional response fields.
3. Change conditions for specific error types
4. Add variants to enum-like output values.

Our versioning scheme is inspired by the [Anthropic API](https://docs.anthropic.com/en/api/versioning).

### Use API keys when making requests from a server

Create a new API key at [play.cartesia.ai/keys](https://play.cartesia.ai/keys). Include `Authorization: Bearer <api_key>` in the headers of your requests.

### Use access tokens when making requests from a client app

Never use API keys in client apps; they grant full account access and can be extracted from browser or mobile code.

Instead, your server can generate a short-lived access token and send it to the client. See the [Access Token API Reference](/api-reference/auth/access-token) for how to generate one.

* For HTTP requests, include `Authorization: Bearer <access_token>` in the headers.

* For WebSocket connections, pass the token as the `?access_token=<access_token>` query parameter since browsers can't set headers on WebSocket handshakes.

### Check response codes

Our API uses standard HTTP response codes; refer to [httpstatuses.io](https://httpstatuses.io).

### Parse structured error responses

For `Cartesia-Version` values on or after `2026-03-01`, Cartesia returns structured JSON errors.

For the full error reference (all current error codes, schemas, and field nullability), see [API Errors](/use-the-api/api-errors).

```json HTTP error response (Cartesia-Version 2026-03-01 and newer) theme={null}
{
  "error_code": "concurrency_limited",
  "title": "Too many concurrent requests",
  "message": "You have exceeded your plan's concurrency limit.",
  "request_id": "550e8400-e29b-41d4-a716-446655440000"
}
```

Field meanings:

1. `error_code`: machine-readable identifier for client logic; can be `null`.
2. `title`: short human-readable summary.
3. `message`: detailed human-readable explanation.
4. `request_id`: request identifier for support/debugging.
5. `doc_url`: optional link to docs for the specific error (when available).

Common `error_code` values today include `quota_exceeded`, `concurrency_limited`, `voice_model_mismatch`, `voice_not_found`, `model_not_found`, `language_not_supported`, `file_too_large`, `unsupported_audio_format`, and `plan_upgrade_required`.

WebSocket and SSE error events include the same error fields plus transport context:

```json WebSocket/SSE error event (Cartesia-Version 2026-03-01 and newer) theme={null}
{
  "type": "error",
  "done": true,
  "status_code": 429,
  "error_code": "concurrency_limited",
  "title": "Too many concurrent requests",
  "message": "You have exceeded your plan's concurrency limit.",
  "request_id": "550e8400-e29b-41d4-a716-446655440000:happy-monkeys-fly:8a0f5f3a-3b2f-4f28-b73e-8c5f27e2f8bb",
  "context_id": "happy-monkeys-fly"
}
```

Notes:

1. `context_id` appears for TTS WebSocket errors when available.
2. `status_code` is included in WebSocket/SSE error payloads; for HTTP, status remains in the HTTP response status line.
3. `request_id` is always a string; HTTP and SSE request IDs are UUIDs, while WebSocket request IDs may include additional context.

For `Cartesia-Version` values before `2026-03-01` (and invalid versions), legacy error formats are returned instead:

1. HTTP errors are plain text in `Title: Message` format.
2. WebSocket/SSE errors use legacy envelopes with string-only error messages.

### Pass data according to the method

All GET requests use query parameters to pass data. All POST requests use a JSON body or `multipart/form-data`.


# Compare TTS Endpoints
Source: https://docs.cartesia.ai/use-the-api/compare-tts-endpoints

How bytes, SSE, and WebSocket differ for text-to-speech, and when to use each.

Cartesia exposes three ways to turn text into speech. The same models, voices, and core parameters apply everywhere. What changes is how you connect, how audio is framed on the wire, and whether you get timestamps, continuations (streaming model output into one spoken line), or many generations on one connection.

All three endpoints stream audio as it is produced. The bytes endpoint delivers that stream as a single HTTP response body (the same pattern the playground uses). SSE and WebSocket stream too; they chunk audio into multiple events or messages, which is how per-chunk metadata such as timestamps is carried.

## Feature comparison

|           | Multiple generations per connection | Timestamps | Continuations |
| --------- | ----------------------------------- | ---------- | ------------- |
| WebSocket | Yes                                 | Yes        | Yes           |
| Bytes     | No (one `POST` per generation)      | No         | No            |
| SSE       | No (one `POST` per generation)      | Yes        | No            |

An **utterance** is one stretch of speech you want pronounced as a single unit (usually a sentence or a line of UI copy). **Continuations** let you send that utterance as several WebSocket messages that share a `context_id`. See [Stream inputs using continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) and [contexts](/use-the-api/tts-websocket/contexts).

```mermaid theme={null}
flowchart TD
    Q1{"Are you streaming text from an LLM<br>or other partial input?"}
    Q2{"Do you need timestamps on HTTP<br>without WebSocket?"}
    Q3{"Will you speak often enough that<br>repeated connect/TLS cost hurts?"}
    WS["WebSocket"]
    SSE["SSE"]
    Bytes["Bytes"]

    Q1 -- "Yes" --> WS
    Q1 -- "No" --> Q2
    Q2 -- "Yes" --> SSE
    Q2 -- "No" --> Q3
    Q3 -- "Yes" --> WS
    Q3 -- "No" --> Bytes
```

If you care about time-to-first-byte on every turn, remember that a new HTTPS request pays for TCP and TLS again; that overhead is often on the same order as TTFB for the audio itself. WebSocket amortizes that cost when you keep the socket open.

SSE is still supported for stacks that already consume Server-Sent Events or when you want timestamps while staying on HTTP. For audio only, bytes is usually the better HTTP choice (smaller encoding than JSON-wrapped chunks).

## Pick an endpoint in one minute

| What you are building                                                                                                                              | Use this                                                   | Short label                         |
| -------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------- |
| Full transcript in one request; you want a streaming HTTP body (efficient; same pattern as the playground)                                         | [`POST /tts/bytes`](/api-reference/tts/bytes)              | Stream speech (bytes)               |
| Full transcript in one request; you need timestamps without WebSocket, or your stack already uses SSE                                              | [`POST /tts/sse`](/api-reference/tts/sse)                  | Stream speech with timestamps (SSE) |
| Long-lived session, partial transcript (for example LLM tokens), lowest latency across many turns, timestamps, or several utterances on one socket | [WebSocket `/tts/websocket`](/api-reference/tts/websocket) | Live session (WebSocket)            |

If the full transcript is not known up front, use WebSocket with contexts, not bytes or SSE.

***

## Bytes (`POST /tts/bytes`)

Best for batch jobs, caching files, notifications, and anywhere one `POST` per generation is enough.

The response body streams while audio is generated. You can read progressively or buffer to the end. For many output formats this is leaner on the wire than SSE because you receive raw or file bytes instead of JSON-wrapped chunks.

Typical flow:

1. One JSON payload with the full `transcript`, voice, model, and output format (WAV, MP3, raw PCM, and so on).
2. `POST` to `/tts/bytes`.
3. Read the body as data arrives, or consume it to completion.

One request is one generation. For another line of speech, send another `POST`.

See [bytes reference](/api-reference/tts/bytes).

***

## SSE (`POST /tts/sse`)

Best when you need timestamps while staying on HTTP without WebSocket, or when your integration already uses SSE. If you only need audio and not SSE-shaped events, bytes is usually simpler. WebSocket is otherwise the full-featured option for real-time use and supports timestamps as well.

SSE remains available largely for backward compatibility and for teams that standardize on Server-Sent Events.

Typical flow:

1. Same as bytes: one JSON body with the full transcript.
2. `POST` to `/tts/sse`.
3. Consume Server-Sent Events; each event carries the next chunk until completion.

Bytes vs SSE:

|            | Bytes                                           | SSE                                            |
| ---------- | ----------------------------------------------- | ---------------------------------------------- |
| Shape      | One streaming response body (raw or file bytes) | Many SSE events (often JSON plus base64 audio) |
| Timestamps | No                                              | Yes (in the event payload)                     |

You still send one full transcript per request: SSE does not support WebSocket-style continuations across multiple `POST`s. An optional `context_id` is echoed for your logs; it does not merge multiple HTTP requests into one utterance. To send text in pieces over time, use WebSocket.

See [SSE reference](/api-reference/tts/sse).

***

## WebSocket (`/tts/websocket`)

Best for assistants, games, telephony-style stacks, or any case where the connection stays open and transcript text may arrive over time.

Why people choose WebSocket:

1. Latency: you pay connect cost once; later generations avoid extra TCP/TLS round trips (often tens to low hundreds of ms per turn).
2. Streaming input: send fragments as they arrive (for example from an LLM) and keep prosody across them. See [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) and [contexts](/use-the-api/tts-websocket/contexts).
3. Timestamps: word- or segment-level timing (model and language limits apply; see WebSocket docs).
4. Multiplexing: multiple `context_id` values on one connection for overlapping utterances.

Typical flow:

1. Open the WebSocket.
2. Send JSON messages. When one utterance is split across messages, use `context_id` and `continue`: set `continue: true` on partials, and `continue: false` on the last part of that utterance (or use the empty-transcript pattern in [contexts](/use-the-api/tts-websocket/contexts) if you cannot know the final string yet).
3. Read audio until the server finishes that context.

See [WebSocket reference](/api-reference/tts/websocket).

***

## Continuations

If you are not streaming text from a model, start with bytes or SSE, not continuations.

When you do use WebSocket streaming, keep one stable `context_id` per utterance, `continue: true` on every partial, and `continue: false` on the final message for that utterance (see [contexts](/use-the-api/tts-websocket/contexts)).

Things that break text or prosody:

* Concatenation: chunks are joined verbatim. A missing space produces `"...world!How..."` instead of `"...world! How..."`.
* SSML and numbers: avoid splitting tokens that must stay together (for example decimals in SSML). See `max_buffer_delay_ms` in the [continuations guide](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).

If you leave `continue: true` longer than you meant, contexts eventually expire on their own and audio is still generated and flushed according to server rules. It is not a runaway failure mode. You should still send `continue: false` when you know the utterance is complete so your client state matches the server. Do not reuse old `context_id` values for unrelated utterances.

***

## Why WebSocket uses `context_id` (and HTTP does not)

On `POST /tts/bytes` and `POST /tts/sse`, you send a complete transcript in one JSON body. There is no continuation protocol across requests.

`context_id` and `continue` matter on WebSocket when one utterance's text is split across multiple messages. The server concatenates chunks that share a `context_id`. `continue: true` means more text is coming; `continue: false` finalizes that utterance.

Mental model:

* Whole line of speech in one string: bytes or SSE. No context API.
* Text arrives in pieces: WebSocket, one `context_id` per utterance, with continuations.

***

## API ergonomics (all endpoints)

* For server-side calls, prefer the API key in the `Authorization` header instead of query strings (headers are less likely to appear in access logs). WebSocket URLs in browsers may need different patterns for your platform.
* Model IDs, voices, and core generation parameters match across bytes, SSE, and WebSocket. What differs is wire format, how chunks are exposed, and whether input can be streamed with continuations.

***

## Where to go next

<CardGroup>
  <Card title="Stream speech (bytes)" icon="download" href="/api-reference/tts/bytes">
    One POST, streaming response body
  </Card>

  <Card title="Stream speech with timestamps (SSE)" icon="waveform" href="/api-reference/tts/sse">
    Timestamps and SSE-chunked audio
  </Card>

  <Card title="Live session (WebSocket)" icon="plug" href="/api-reference/tts/websocket">
    Streaming input, multiplexing, lowest latency across turns
  </Card>
</CardGroup>


# Concurrency and WebSocket Limits
Source: https://docs.cartesia.ai/use-the-api/concurrency-limits-and-timeouts

Learn about concurrency limits and timeouts with the Cartesia API.

Your account is subject to two types of rate limits: WebSocket limits and generation concurrency limits.

## Concurrency limits by subscription plan

Your subscription plan determines how many requests can be processed simultaneously. Sonic Text-to-Speech (TTS) and Ink Speech-to-Text (STT) each have separate concurrency limits with the same values per plan.

| Plan       | TTS Concurrent Requests | STT Concurrent Requests |
| ---------- | ----------------------- | ----------------------- |
| Free       | 2                       | 8                       |
| Pro        | 3                       | 12                      |
| Startup    | 5                       | 20                      |
| Scale      | 15                      | 60                      |
| Enterprise | Custom                  | Custom                  |

<Note>
  Sonic (Text-to-Speech) and Ink (Speech-to-Text) services have separate concurrent request limits. For example, if you're on the Scale plan, you can have up to 15 concurrent TTS requests AND 60 concurrent STT requests running simultaneously.
</Note>

## Text-to-Speech (TTS) Concurrency

We measure TTS generation concurrency in terms of the number of unique contexts active at a given time.

* For HTTP endpoints, each request is treated as a separate context and counts toward your concurrency limit.
* For WebSockets, a unique <code>context\_id</code> defines a context—sending additional requests with the same <code>context\_id</code> does not increase your concurrency usage. This is because requests to the same context are processed sequentially.
* STT **does not** count towards your TTS concurrency limit

If you exceed your TTS concurrency limit, you will receive a `429 Too Many Requests` error. You can check your concurrency limit and upgrade it on the playground at [play.cartesia.ai](https://play.cartesia.ai).

### Interpreting TTS concurrency limits

How you interpret your TTS concurrency limit depends on how you're using the Sonic model family.

<AccordionGroup>
  <Accordion title="Conversational use cases">
    For real-time conversational use cases, such as powering voice agents, we've found that the number of parallel conversations you can support is effectively 4X your concurrency limit. This is just a rule of thumb, and depends on the types of conversations you're supporting. You can reach out to us to discuss your specific use case.

    For example, if you have a TTS concurrency limit of 15, you can typically support 60 parallel conversations.
  </Accordion>

  <Accordion title="Non-conversational use cases">
    For non-conversational use cases, such as generating speech in batch jobs, there is a more direct relationship between your concurrency limit and the number of parallel generations you can support.

    For example, if you have a TTS concurrency limit of 15, you can typically support 15 parallel TTS generations. You can use a connection pool to ensure you don't exceed your concurrency limit.
  </Accordion>
</AccordionGroup>

### TTS WebSocket limits

We limit the number of parallel TTS WebSocket connections to 10X your concurrency limit. For example, if you have a concurrency limit of 15, you can have up to 150 parallel TTS WebSocket connections.

If you exceed your WebSocket limit, you will receive a `429 Too Many Requests` error on trying to open a new WebSocket connection.

Usually, when users run into TTS WebSocket limits (even at scale), it's because they're not properly closing idle connections. Beyond closing idle connections, you can also create a connection pool to ensure you don't exceed your WebSocket limit.

### TTS WebSocket timeouts

We close idle TTS WebSocket connections after 5 minutes. We recommend closing and re-opening a new websocket connection when connections stay idle for long periods of time.

## Speech-to-Text (STT) Concurrency

Each active transcription stream counts as one concurrent request, regardless of whether you're using HTTP or WebSocket connections.

* Each concurrent HTTP or WebSocket connection counts toward your STT concurrency limit
* Idle STT WebSockets still count towards your STT concurrency limit
* TTS **does not** count towards your STT concurrency limit

If you exceed your STT concurrency limit, you will receive a `429 Too Many Requests` error.

### STT WebSocket timeouts

We close idle STT WebSocket connections after 3 minutes. We recommend closing and re-opening a new websocket connection when connections stay idle for long periods of time.


# Migrating From OpenAI Whisper to Cartesia Ink
Source: https://docs.cartesia.ai/use-the-api/migrate-from-open-ai

Use Cartesia's Batch Speech-to-Text API with OpenAI's client libraries

<Info>
  Batch Speech-to-Text: This documentation covers OpenAI SDK compatibility for Cartesia Ink's batched transcription endpoint.

  For real-time transcription, use our [Streaming STT endpoint](/api-reference/stt/stt).
</Info>

Cartesia's Batch Speech-to-Text API is compatible with OpenAI's client libraries, enabling seamless migration from OpenAI Whisper.

## Endpoints

**Cartesia Native:** `/stt` - Full feature support\
**OpenAI Compatible:** `/audio/transcriptions` - Drop-in replacement for Whisper on the OpenAI SDK

## Migration Guide for OpenAI SDK

Replace your OpenAI base URL with `https://api.cartesia.ai` to use the compatibility layer for Cartesia:

### Parameter Support

**Supported Parameters**:

* `file` - The audio file to transcribe
* `model` - Use `ink-whisper` for Cartesia's latest model
* `language` - Input audio language (ISO-639-1 format)
* `timestamp_granularities` - Include `["word"]` to get word-level timestamps

**Response Format**: Always returns JSON with transcribed text, duration, language, and optionally word timestamps.

For the complete parameter reference, see our [Batch STT API documentation](/api-reference/stt/transcribe).

### Python Example

```python theme={null}
from openai import OpenAI

client = OpenAI(
    api_key="your-cartesia-api-key",
    base_url="https://api.cartesia.ai"
)

with open("audio.wav", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        file=audio_file,
        model="ink-whisper",
        language="en",
        timestamp_granularities=["word"]
    )
    
print(transcript.text)
```

### Node.js Example

```typescript theme={null}
import OpenAI from 'openai';
import fs from 'fs';

const client = new OpenAI({
  apiKey: 'your-cartesia-api-key',
  baseURL: 'https://api.cartesia.ai'
});

const transcription = await client.audio.transcriptions.create({
  file: fs.createReadStream('audio.wav'),
  model: 'ink-whisper',
  language: 'en',
  timestamp_granularities: ['word']
});

console.log(transcription.text);
```

## Direct API Usage

Both endpoints accept identical parameters and return the same JSON response format:

### Cartesia Native Endpoint

```bash theme={null}
curl -X POST https://api.cartesia.ai/stt \
  -H "X-API-Key: your-cartesia-api-key" \
  -F "file=@audio.wav" \
  -F "model=ink-whisper" \
  -F "language=en" \
  -F "timestamp_granularities[]=word"
```

### OpenAI-Compatible Endpoint

```bash theme={null}
curl -X POST https://api.cartesia.ai/audio/transcriptions \
  -H "X-API-Key: your-cartesia-api-key" \
  -F "file=@audio.wav" \
  -F "model=ink-whisper" \
  -F "language=en" \
  -F "timestamp_granularities[]=word"
```

## Migration from OpenAI

To migrate from OpenAI's Whisper API to Cartesia:

1. **Update the base URL**: Change from `https://api.openai.com/v1` to `https://api.cartesia.ai`
2. **Update authentication**: Replace your OpenAI API key with your Cartesia API key
3. **Update model names**: Use `ink-whisper` instead of OpenAI's model names
4. **Keep the same endpoint**: Continue using `/audio/transcriptions`
5. **Avoid unsupported parameters**: Remove `prompt`, `temperature`, and `response_format` parameters
6. **Use timestamp\_granularities (Optional)**: Add `timestamp_granularities: ["word"]` to get word-level timestamps

The core functionality remains the same, with JSON responses containing transcribed text and optional word timestamps.


# Buffering
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/buffering

Control how text is buffered before speech generation to balance prosody and latency.

Cartesia supports two buffering modes for streaming TTS: **managed buffering** and **custom buffering**. The right choice depends on how much control you need over the prosody-latency tradeoff.

<Tip>
  **Start with managed buffering.** It produces natural-sounding speech with minimal integration effort. Switch to custom buffering only if you need fine-grained control.
</Tip>

## Managed buffering

Stream LLM tokens directly to Cartesia and let the API decide when to start generating speech. This is the same approach used in Cartesia's managed voice agents platform.

Set `max_buffer_delay_ms` to a value greater than 0 (the default is 3000ms) and stream text token by token.

```json theme={null}
{
  "model_id": "sonic-3",
  "transcript": "Hello",
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-context",
  "continue": true,
  "max_buffer_delay_ms": 3000
}
```

The API buffers incoming text until it has enough context to produce high-quality speech, or until `max_buffer_delay_ms` elapses—whichever comes first. This produces results similar to sentence-level aggregation while still optimizing for latency.

**When to use managed buffering:**

* You're streaming LLM output token by token
* You want natural-sounding speech without building buffering logic
* You want a simple integration with good defaults

## Custom buffering

Handle buffering yourself and send complete phrases or sentences to Cartesia. Set `max_buffer_delay_ms` to `0` so the API generates speech immediately from whatever you provide.

```json theme={null}
{
  "model_id": "sonic-3",
  "transcript": "Hello, my name is Sonic.",
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-context",
  "continue": true,
  "max_buffer_delay_ms": 0
}
```

With custom buffering, you control the prosody-latency tradeoff directly:

* **Full sentences** produce the best prosody but add latency while you wait for the sentence to complete.
* **Partial sentences** reduce latency but may result in less natural speech at chunk boundaries.

**When to use custom buffering:**

* You need precise control over when speech generation starts
* You have your own sentence detection or text aggregation logic
* You're optimizing for a specific latency target

## Avoid the middle ground

A common mistake is to aggregate text client-side into sentences or phrases *and* use the default `max_buffer_delay_ms` of 3000ms. This can cause unnecessary latency—after receiving a complete sentence, the API may wait up to 3000ms for additional input before generating speech.

Pick one approach:

* **Managed buffering:** Stream tokens with `max_buffer_delay_ms > 0` and let Cartesia handle aggregation.
* **Custom buffering:** Aggregate text yourself and set `max_buffer_delay_ms = 0`.

## Configuration reference

<ParamField type="number">
  Maximum time in milliseconds the API waits for additional input before generating speech from buffered text.

  * **Range:** 0–5000ms
  * **Default:** 3000ms
  * Set to `0` for custom buffering (no server-side buffering)
  * Set to `> 0` for managed buffering
</ParamField>

<Warning>
  If you use `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with managed buffering, make sure decimal values are not split across tokens. Submitting `1.0` as `1`, `.`, `0` will cause parsing errors.
</Warning>

## Tips for best results

* **End sentences with punctuation.** Without closing punctuation (`.`, `?`, `!`), the model may treat text as incomplete and wait for the buffer delay to elapse before generating. See [streaming inputs with continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) for more details.
* **Signal when input is done.** When a turn is complete, use `continue: false` (WebSocket) or `no_more_inputs()` (SDK) so the model doesn't wait for more text.
* **Test with realistic input patterns.** Buffering behavior depends on how text arrives—test with actual LLM output rather than pre-written text.


# Context Flushing and Flush IDs
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/context-flushing-and-flush-i-ds

Learn about managing multiple transcript generations with context flushing.

## Overview

When using [context IDs with the WebSocket API](/use-the-api/tts-websocket/contexts), all audio chunks for transcripts submitted to a single context share the same context ID. This makes it difficult to determine which audio chunks correspond to specific transcript submissions.

While this behavior works well for streaming audio, some implementations require the ability to map audio chunks back to their originating transcripts.

<Frame>
  <img alt="context_flushing" />
</Frame>

## Manual Flushing

Manual flushing creates clear boundaries between transcript submissions within the same context.

### How It Works

Each time you trigger a manual flush, the system increments a `flush_id` counter. This ID is included in corresponding response audio chunk payloads, allowing you to track which transcript generated specific audio chunks.

### Implementation

To trigger a manual flush:

1. Send a request with these parameters:
   * `continue=True` (indicates you're continuing with the same context)
   * `flush=True` (triggering the flush operation)
   * Empty transcript
   * Same context ID as your previous request

### Example Flow

```
1. Submit transcript 1 on context 1
2. Flush context 1
3. Submit transcript 2 on context 1
```

In this flow:

* All audio chunks from transcript 1 will have `flush_id=1`
* The manual flush increments the ID
* All audio chunks from transcript 2 will have `flush_id=2`

## Payload Structure

Each audio chunk payload includes a `flush_id` field that serves as a transcript identifier. This ID increments with each manual flush operation, creating a clear boundary between transcript submissions.

## When to Use Manual Flushing

Consider using manual flushing when:

* You need to associate audio chunks with their originating transcripts
* Your application architecture expects a one-to-one relationship between transcripts and response streams
* You're integrating with frameworks that assume each transcript has a corresponding generator

This feature is particularly helpful when using multiple providers, as it aligns the Cartesia API with systems that expect discrete generator responses per transcript.


# Contexts
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/contexts


<Info>
  This is a hands-on guide to input streaming using WebSocket contexts. For a conceptual overview of how input streaming works in Sonic, see the [input streaming guide](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).
</Info>

> In many real time use cases, you don't have your transcripts available upfront—like when you're generating them using an LLM. For these cases, Sonic supports input streaming.

The context IDs you pass to the Cartesia API identify speech contexts. Contexts maintain prosody between their inputs—so you can send a transcript in multiple parts and receive seamless speech in return.

To stream in inputs on a context, just pass a `continue` flag (set to `true`) for every input that you expect will be followed by more inputs. (By default, this flag is set to `false`.)

To finish a context, just set `continue` to `false`. If you do not know the last transcript in advance, you can send an input with an empty transcript and `continue` set to `false`.

<Note>Contexts automatically expire 1 second after the last audio output is streamed out. Attempting to send another input on the same context ID after expiry is not supported.</Note>

<ParamField type="boolean">
  Whether this input may be followed by more inputs.
</ParamField>

### Input Format

1. Inputs on the same context must keep all fields except `transcript`, `continue`, and `duration` the same.
2. Transcripts are concatenated verbatim, so make sure they form a valid transcript when joined together. Make sure to include any spaces between words or punctuations as necessary. For example, in languages with spaces, you should include a space at the end of the preceding transcript, e.g. transcript 1 is `Thanks for coming, ` and transcript 2 is `it was great to see you.`

### Example

Let's say you're trying to generate speech for "Hello, Sonic! I'm streaming inputs." You should stream in the following inputs (repeated fields omitted for brevity). Note: all other fields (e.g. `model_id`, `language`) are required and should be passed unchanged between requests with input streaming.

```json Input Streaming theme={null}
{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "inputs.", "continue": false, "context_id": "happy-monkeys-fly"}
```

<Tip>
  If [streaming in input tokens](/build-with-cartesia/capability-guides/stream-inputs-using-continuations), we recommend using `max_buffer_delay_ms`, which sets the maximum time the model will buffer text before starting generation.

  Without this option set, the model will start generating immediately on the first request, giving you full control over buffering of inputs.
</Tip>

If you don't know the last transcript in advance, you can send an input with an empty transcript and `continue` set to `false`:

```json Input Streaming theme={null}
{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "inputs.", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "", "continue": false, "context_id": "happy-monkeys-fly"}
```

### Output

You will only receive `done: true` after outputs for the entire context have been returned.

Outputs for a given context will always be in order of the inputs you streamed in. (That is, if you send input A and then input B on a context, you will first receive the chunks corresponding to input A, and then the chunks corresponding to input B.)

## Cancelling Requests

You may also cancel outgoing requests through the websocket.

To cancel a request, send a JSON message with the following structure:

```json WebSocket Request theme={null}
{
  "context_id": "happy-monkeys-fly",
  "cancel": true
}
```

When you send a cancel request:

1. It will only halt requests that have not begun generating a response yet.
2. Any currently generating request will continue sending responses until completion.

<Note>
  The `context_id` in the cancel request should match the `context_id` of the request you want to cancel.
</Note>


# Clone Voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices

Learn how to get the best voice clones from your audio clips.

<Frame>
  <img />
</Frame>

Voice cloning is available through the [playground](https://play.cartesia.ai) and the [API](/2024-11-13/api-reference/voices/clone). With current API versions, instant cloning uses **high-similarity** mode: clones sound more like the source clip, but may reproduce background noise. For the legacy **stability** workflow, pin API version `2024-11-13` and see [Older TTS models](/build-with-cartesia/tts-models/older-models).

For the best voice clones, we recommend following these best practices:

## General best practices for voice cloning

1. **Choose an appropriate script to speak.** You want your recording to align as closely as possible with the voice you want to generate. For example, don't read a colorless transcript in a monotone voice unless you're aiming for a monotonous clone. Instead, prepare a script that is suited to your use case and has the right energy.
2. **Speak as clearly as possible and avoid background noise.** For example, when recording yourself, try to use a high-quality microphone and be in a quiet space.
3. **Avoid long pauses.** Pauses in the recording will be mimicked by the cloned voice, such as between sentences. Ensure your recording matches the pacing you want your voice to follow.
4. **Trim your recording.** The audio you provide should roughly contain speech from start to finish. Make sure the speaker is not cut-off and that there's no excessive silence at the beginning or end. You can use a tool like Audacity or our playground make the perfect clip from your recording.
5. **Speak in the target language.** For instance, if you want the cloned voice to speak Spanish, speak Spanish in the recording. If this is not possible, you can use Cartesia's localization feature—available in the playground and in the API—to convert your clone to a different language.

## Best practices for high-similarity clones

1. **Limit your recording to ten seconds.** This is the sweet spot for high-similarity clones. A longer clip will not result in a better clone.
2. **Set `enhance` to `false` when cloning.** Unless your source clip has substantial background noise, any postprocessing will reduce the similarity of the clone to the source clip.


# Localize voices
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/localize-voices

Learn how to localize voices for your brand or product.

<LocalizeVoicesIntro />

The localization feature accepts a voice to localize, the gender of the voice, and the target language and accent to localize to, and produces a Voice that you can use to generate speech (or save as a new voice).

<Frame>
  <img />
</Frame>


# Stream Inputs using Continuations
Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/stream-inputs-using-continuations

Learn how to stream input text to Sonic TTS.

In many real-time use cases, you don't have input text available upfront—like when you're generating it on the fly using a language model. For these cases, we support input streaming through a feature we call *continuations*.

<Info>
  This guide will cover how input streaming works from the perspective of the TTS model. If you just want to implement input streaming, see [the WebSocket API reference](/api-reference/tts/tts), which implements continuations using *contexts*.
</Info>

## Continuations

Continuations are generations that extend already generated speech. They're called continuations because you're continuing the generation from where the last one left off, maintaining the *prosody* of the previous generation.

If you don't use continuations, you get sudden changes in prosody that create seams in the audio.

<Note>
  Prosody refers to the rhythm, intonation, and stress in speech. It's what makes speech flow naturally and sound human-like.
</Note>

Let's say we're using an LLM and it generates a transcript in three parts, with a one second delay between each part:

1. `Hello, my name is Sonic.`
2. ` It's very nice`
3. ` to meet you.`

To generate speech for the whole transcript, we might think to generate speech for each part independently and stitch the audios together:

<Frame>
  <img alt="no_continuations" />
</Frame>

Unfortunately, we end up with speech that has sudden changes in prosody and strange pacing:

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>

Now, let's try the same transcripts, but using continuations. The setup looks like this:

<Frame>
  <img alt="continuations" />
</Frame>

Here's what we get:

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>

As you can hear, this output sounds seamless and natural.

<Check>
  You can scale up continuations to any number of inputs. There is no limit.
</Check>

## Caveat: Streamed inputs should form a valid transcript when joined

This means that `"Hello, world!"` can be followed by `" How are you?"` (note the leading space) but not `"How are you?"`, since when joined they form the invalid transcript `"Hello, world!How are you?"`.

In practice, this means you should maintain spacing and punctuation in your streamed inputs.

<Warning>
  **End complete sentences with closing punctuation** (for example `.`, `?`, or `!`).

  If a streamed chunk does not end with sentence-ending punctuation, the model often treats it as an incomplete sentence. That can cause:

  * **Extra latency:** Text may stay in the automatic input buffer until the model sees a clearer boundary or until `max_buffer_delay_ms` elapses (**3000ms by default**), so audio starts later than you expect.
  * **Audio artifacts:** The model expects natural sentence endings; without closing punctuation, the generated audio sometimes ends with odd or distorted sounds.

  When a user-facing utterance is finished, put terminal punctuation on the final segment (and signal that no more text is coming on the context when appropriate, for example `no_more_inputs()` in the SDK or `continue: false` over the WebSocket).
</Warning>

## Automatic buffering with `max_buffer_delay_ms`

When streaming inputs from LLMs word-by-word or token-by-token, we buffer text until the optimal transcript length for our model. The default buffer is 3000ms, if you wish to modify this you can use the `max_buffer_delay_ms` parameter, though we *do not recommend making this change*.

<Warning>
  If you plan on using `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with buffering, make sure decimal values are not split up.
  Submitting `1.0` as `1`, `.`, `0` will result in unintended failure modes.
</Warning>

### How it works

When set, the model will buffer incoming text chunks until it's confident it has enough context to generate high-quality speech, or the buffer delay elapses, whichever comes first.

Without this buffer, the model would immediately start generating with each input, which could result in choppy audio or unnatural prosody if inputs are very small (like single words or tokens).

### Configuration

* **Range**: Values between 0-5000ms are supported
* **Default**: 3000ms

Use this *only* if

* you have custom buffering client side, in which case you can set this to 0
* you have choppiness even at 3000ms, in which case you can try a higher value

```js lines theme={null}
// Example WebSocket request with `max_buffer_delay_ms`
{
  "model_id": "sonic-3",
  "transcript": "Hello",  // First word/token
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-conversation-123",
  "continue": true,
  "max_buffer_delay_ms": 3000  // Buffer up to 3000ms
}
```

Let's try the following transcripts with continuations and the default `max_buffer_delay_ms=3000`: `['Hello', 'my name', 'is Sonic.', "It's ", 'very ', 'nice ', 'to ', 'meet ', 'you.']`

<AudioPlayer>
  Your browser does not support the audio element.
</AudioPlayer>


# Changelog 2024
Source: https://docs.cartesia.ai/changelog/2024

Product, API, and platform changes for 2024

<Update label="December 2024">
  ### API

  * Pricing updates; character usage columns migrated to bigint; presign URLs for Pro Voice Clone; **`voices/<id>/conditioning`** endpoint; file to dataset in presign; userID-level endpoint restrictions; Stripe Customer ID on checkout.
  * EU deployment and Hindi HC fixes.

  ### Playground

  * New model on Playground highlighting **transcript following** improvements (demo, not GA).
  * Blog and play.cartesia.ai live.

  ### Models / Voices

  * Model aliasing updated for **`sonic`** and **`sonic-preview`**; twilight-morning in API and enterprise; conditioning entries for voice clone and multilingual.
  * Embedding search for LoRA voice selection.

  ### Other

  * Infrastructure and scaling updates.
  * State of Voice blog and map.
</Update>

<Update label="November 2024">
  ### API

  * **Cartesia-Version 2024-11-13** — Upgrade to new API version; **unified clone voice endpoint**; datasets support; files endpoint pagination; FineTuneRequest status; fine-tunes API in Playground; presign URLs for Pro Voice Clone; **Flush Done** event for manual WebSocket flushing; **`<pause>`** tag for continuations.
  * GCP Enterprise.

  ### Playground

  * Changes for new API; replay suite; GCP Enterprise.

  ### Models / Voices

  * **Flush Done** event for manual flushing in WebSocket; **`<pause>`** tag for continuations within a single transcript; spelling fixes; manual flush and flush ID.
  * Empty encoding field allowed for mp3.

  ### Docs

  * API version **2024-11-13**: Sonic 2, capability guides (clone, pronunciations, speed/emotion, continuations, localize), formatting for Sonic 2.
  * Integrations: LiveKit, Pipecat, Rasa, Thoughtly, Twilio, MCP. Enterprise: SSO, organizations. See [API Conventions](/use-the-api/api-conventions).
</Update>

<Update label="October 2024">
  ### API

  * Cartesia JS bytes endpoint; gen blocks removed from character counting; health checks and middleware; **user-level queueing** with queue length cap and timeout; 10× queue size rejection; Slang (continuations) and ConditioningData; voice changer JS SDK.
  * Remove max limit from Playground.

  ### Playground

  * GCP: API and ingress for GCP US Central. Queueing: user-level queueing in API gateway; queue length cap and `queuedRequest` timeout.
  * Voice Changer: Playground UI polish; ConditioningData as part of ResolvedVoice; Slang rollout; flush on start/end of spell tags.
  * LoRA release UI; onboarding data upsert fix; welcome page submit loading state; enterprise contact links.

  ### Other

  * Canonical linking and sitemap.
  * Blog and navigation (Blog, Careers) updates.
</Update>

<Update label="September 2024">
  ### API

  * User-level queueing; queue size and websocket queueing rejection; **`api_status`** field for voice API usability; LoRA pricing and UX cleanup; **flush all audio on DONE token** (including CB); user option to obfuscate transcripts in logs.
  * LoRA and load balancer improvements.

  ### Playground

  * **Function calling**; agent creation, tests, and dev setup; voice agent infrastructure enabled.
  * LoRA: HiFi cloning endpoint and Playground page; 8 new voices on Playground; Indian accent.
  * **Voice Changer** Playground UI; JS SDK for voice changer. Language added to TTS request from `voices/[id]`; flush all audio on DONE token; user option to obfuscate transcripts in logs.

  ### Docs

  * Blog and sitemap updates.
</Update>

<Update label="August 2024">
  ### API

  * Reject invalid transcripts (docs and API gateway); `no_more_inputs` in WebSockets can use `voice_embedding` instead of `voice_id`.
  * Improved bad model id handling.

  ### Playground

  * **Localization** page in Playground and JS client; dialects and future-compatibility. Switch Playground to voice ID; allow both `id` and embedding for `TTSRequest`; archive voices (kept accessible via API).
  * Replay button; feedback form; fix multilingual recommended voices when switching back to English; better error messaging.

  ### Models / Voices

  * **LoRA** support (multiple voices per LoRA, new cache key, easy-brook-lora, vc-flowing-dream).

  ### Other

  * On-device homepage launch; proper links for "Request a demo" button.
  * **LoRA**: multiple voices per LoRA.
</Update>

<Update label="July 2024">
  ### API

  * **Voice Conversion endpoint** — New API endpoint. **Timestamps** on WebSocket endpoint; **per-generation voice controls** (speed, emotion) in API; polar-tree deployed (`sonic-multilingual`); continuous batching support; VocalWave (English) and long-generation support; `sonic-english` → vocal-wave, `sonic-multilingual` → ancient-voice aliasing.
  * **`buffer`** and **`mp3`** params on `/bytes`; MP3 streaming and WAV encoding fixes; request cancellation; empty transcript allowed when `continue=false`; Stripe webhook cache clear; subscription cancellation/reactivation; Redis cache for overage; keys endpoints.
  * Clerk-based auth in API.

  ### Playground

  * Optional **`enhance`** flag for voice cloning in JS client, Python client, and Playground; voice update endpoint and docs; gate voice cloning for free users.
  * Prevent playing audio while playback in progress; download button disabled until generation finished; API key deletion clearer with copy button; character usage indicator; subscription and checkout fixes; gating clone form for free users.

  ### Docs

  * Voice cloning docs; timestamps and continuations; user guides for voice control and Twilio; emotion control and timestamps; "phonemes" terminology.
  * Voice cloning from file.

  ### Other

  * Python client: continuations support, custom `base_url`, fallback for websockets; JS client v1.0.1: `onError` prop on useTTS.
  * Voice controls (speed, emotion) in Python client and docs.
</Update>

<Update label="June 2024">
  ### API

  * **Continuations** — Support for streaming input via SSE and Bytes; **`NoMoreInputs`** signal. **Cartesia Version** enforced via header; Playground and checkout/subscription endpoints send it.
  * 48 kHz added to valid sample rates; `.wav` byte streaming; HTTP streaming endpoint for raw bytes; API standardization (backwards-compatible); new voices endpoints; mulaw and alaw backwards compatibility; Python client v1.0.0 (overhaul, `output_format`); JS client: `pcm_s16le`, `pcm_alaw`, `pcm_mulaw` and improved typing; caching for voices; **`context_id`** in WebSocket response and docs.
  * Stripe webhooks for renewals and expiration; OpenAPI spec update.

  ### Playground

  * Multilingual: `language` parameter on voices API and in API; Playground language selection; multilingual copy on homepage; default `sonic-english` → feasible-haze.
  * Mobile layout improvements; multilingual UI papercuts; voice cloning and empty transcript styling fixes; filtering moved from `voices/[id]` to Speak page.

  ### Models / Voices

  * **`sonic-multilingual`** and **`sonic-english`** aliasing; `language` column on voices.
  * Recommended voices.

  ### Docs

  * Version **2024-06-10**: get-started, API conventions, integrations (LiveKit, Pipecat, Rasa, Thoughtly, Twilio, MCP), clone voices, embeddings/voice mixing. See [API Conventions](/use-the-api/api-conventions).

  ### Other

  * ToS changes; revised pricing tiers; legal notices on sign-in and sign-up; overage toggle in Playground.
  * Character usage limit blocks WebSocket when exceeded.
</Update>

<Update label="May 2024">
  ### API

  * **Cartesia Version** header; HTTP streaming for raw bytes; new voices endpoints; mulaw/alaw backwards compatibility; API standardization (backwards-compatible); Python client v1.0.0; JS client structure overhaul.
  * Clone voice upload fix.

  ### Playground

  * Redesign and Sonic launch copy; subscriptions page; favoriting voices; **emotion and speed sliders**; User vs Default voices; **tags** (Age, Accent) in DB and Playground; **`sample_text`** field (API Gateway and Playground); buffer streamed audio before playback; character usage indicator; API key auto-created on user creation; custom sign-in/sign-up and 404 on sign-out fix; disable generation button while audio playing; human-readable model names and skilled-cherry.
  * Character limit increase.

  ### Models / Voices

  * Human-readable model names; skilled-cherry; polar-tree (`sonic-multilingual`); continuations and output format; Python client numpy array support.
  * Voice cloning disclaimer.

  ### Docs

  * Mintlify docs added.

  ### Other

  * Stripe webhooks for subscriptions; subscription cancellation and reactivation; character usage checks on generation routes; free subscription by default; Scale plan limit (8M chars/month); checkout and receipts.
  * Custom sign-in/sign-up pages.
</Update>

<Update label="April 2024">
  ### API

  * **`model_id`** added as parameter to generate; minimum transcript length enforced; `voice` moved to `AudioGenerationRequest`; experimental router removed; speed controls and voice edit page; video generation endpoint.
  * WhisperX removed from dependencies.
</Update>

<Update label="March 2024">
  ### API

  * WebSocket interrupt support; get voice embedding route; Redis cache for API keys; streaming switched from Octet to JSON; new model `genial-planet-1346`; `voice` param required on requests; formatting support.
  * WhisperX for transcription (later removed).

  ### Playground

  * Voice cloning in the UI; connection info in JS client; audio downloadable; transcript length validation (max 400 chars, empty rejected); improved UX and crash handling when API key missing; welcome message and icons.
  * API key creation on sign-up via Clerk webhooks.

  ### Other

  * Voice cloning and connection info in JS client.
</Update>


# Changelog 2025
Source: https://docs.cartesia.ai/changelog/2025

Product, API, and platform changes for 2025

<Update label="December 2025">
  ### API

  * **sonic-3-latest** (preview) and dated **sonic-3-YYYY-MM-DD** snapshots.
  * **sonic-3-latest** added to Playground TTS with banner when selected. See [Changelog 2026](/changelog/2026).

  ### Voice changes

  * **Voice Library** — December: 25 new voices across 6 languages (12 English, 6 Hindi, 4 Arabic, 1 Spanish, 1 Japanese); 14 featured.
  * Voice library changes; featured voice badge on voice page; **`/voices/recent`** endpoint.

  ### Playground

  * **Report generation** (report button, alert when user reports).
  * **Voice move**; **archive and publish** voices.
  * **PVC**: custom PVC voices UI, multiple user errors surfaced to UI, feature flag for custom model during creation.
  * **Pronunciation dicts**: new backend APIs, generator on create/edit, case sensitivity badge.
  * **Agents**: new text-to-agent UI, create agent from **Github repo tarball**, system prompt generator for UI agent.
  * **Narrations sunset** notice; TTS History pagination; auth strategy for access-tokens.
  * **sonic-3-latest** banner and naming.

  ### Other

  * PVC, STT, and agent improvements.
  * Error handling and error codes.
</Update>

<Update label="November 2025">
  ### API

  * Improved error handling and public error responses; cache invalidation by voice ID.
  * IPVC train API (remove **`markAsReady`**); dataset files overfetch fix; default voice logic fix.

  ### Playground

  * Pronunciation dicts migrate to new backend APIs; persist visual theme to DB; PVC pipeline error and recommendations.
  * Call logs conversation view default; TTS textarea height fix; Sonic-3 model for partners shown.
  * Billing overage "blood bar" and alert fixes; PVC gate for Startup plan.
  * Pronunciation dict generator on create/edit; API version in dialog; featured voice toggle; narrations model selection.

  ### Line / Agents

  * No user audio warning (250ms); Pipecat DeepgramNovaVADFilter.
  * Call recording and artifact storage fixes.

  ### Models / Voices

  * Sonic 3 PVC and normalizer updates; LoRA and PVC error handling; expand option for dataset file count.
  * **`preview_file_url`**; **`tags_operator`** on GET /voices; restrict delete to non-public voices; **`owner_id`** check for fine tune voices; **`user_errors`** for PVCs.
  * New Arabic accents; African French and Canadian French.
</Update>

<Update label="October 2025">
  ### Model changes

  * **Sonic 3 launch (Oct 27)** — **sonic-3-2025-10-27** stable snapshot released; 42 languages; volume, speed, and emotion controls.
  * Real-time conversation with emotion and laughter; \~190ms median latency. See [Sonic 3](/build-with-cartesia/tts-models/latest) and [Volume, Speed, and Emotion](/build-with-cartesia/sonic-3/volume-speed-emotion).

  ### Other

  * Continued PVC, STT, and agent improvements; error handling and public errors; manifone voices; Sonic 3 PVC and normalizer updates.
  * Transcript buffer multilingual and Thai pronunciation dictionary fix; TTFA buffering and reporting; Voice Conversion operator reload; audio norm operator.
</Update>

<Update label="September 2025">
  ### API

  * **`user_id`** to **`owner_id`** in API (model aliasing / ownership).
  * Improved error handling and version/limit checks.

  ### Line / Agents

  * Warning if no user audio for 250ms+; Pipecat **DeepgramNovaVADFilter** for spurious `on_speech_started`.
  * Call recording and artifact storage fixes.

  ### Models / Voices

  * STT: Migrate STT providers to Deepgram where appropriate; Deepgram for non-English or language-detect agents; word-level user text chunks.
  * Sonic 3 / PVC: Sonic 3 PVC updates; Hindi Sonic 3 normalizer revert; LoRA data processing and expand option for dataset file count; PVC errors to webhook.
  * Manifone new voice; African French and Canadian French accents; partner agents can configure TTS models.

  ### Other

  * LoRA bugfixes.
</Update>

<Update label="August 2025">
  ### API

  * Production-facing agent WebSocket; **cancel endpoint** for ending live calls.
  * Improved error handling and public error codes; cache invalidation by voice ID.

  ### Playground

  * Telephony: stop billing for customer-managed numbers; Cartesia vs Twilio param separation.
  * Outbound number management columns.

  ### Line / Agents

  * **Deepgram Nova VAD** (`utterance_end_ms` configurable via **`vad_stop_secs`**).

  ### Models / Voices

  * New endpoint for **`<audio>`** tags; **accent** column on voice API; **`max_buffer_delay`** applied to continuations; eu-north-1 region.
  * **GET /voices** **`tags_operator`**; **`preview_file_url`**; restrict deleting voices to non-public; check **`owner_id`** when listing fine tune voices; **`user_errors`** for PVCs from API.
  * New Arabic accents migration.

  ### Other

  * Max rollover multiplier for credit plans.
</Update>

<Update label="July 2025">
  ### API

  * **`deploy_error`** status fix.

  ### Playground

  * **LangChain** launched voice agents with Cartesia Sonic TTS.
  * Billing: Stripe customer for enterprise if needed; call runtime logs in call logs side panel; Call Logs UI nits (from June work).

  ### Line / Agents

  * Partner pipeline parity with User Agent; **concurrency fix** (negative concurrency); agent metric LLM credit usage for evals; AgentEvaluations functionality.
  * User Code Connector WS handlers fix; agent end turn handling; summarization system prompt; **`user_prompt`** in API; transcript removed from agent metric result; deadlock fix in WS timeout.

  ### Other

  * Flushing and concurrency fixes.
</Update>

<Update label="June 2025">
  ### API

  * **UserCodeAgent** deployment URL; **cancel endpoint** for force-ending live calls via API; Agent EoUD metric; cartesia agent speed-up; user prompt stored separately in agent metrics; **`agent_evaluations`** table; async flush for aggregator; User Code Connector WS and last bot turn handling; deployment URL delay on pickup.
  * Concurrency and WS timeout fixes; improved goroutine handling; agent workers **`/chats`** timeout increase.

  ### Playground

  * **Call Logs** page for agents with data table and side panel; **Agents demo** with Twilio web dialer, visualizer, and like/dislike feedback; deployment detail page and list; **Twilio number provisioning** (Parts 1 & 2); GitConnector redeploy on commit; deployment logs; zip upload for deployment; feature flag by organization; agents gated behind feature flag; **Deepgram as default STT** for agents; orgs v2 (frontend and backend); 20K credits for organizations; enterprise free trial days and email invoice options.
  * **Credit usage**: separate TTS & STT concurrency panels; STT and Infill charts; voice page copyable fields; call runtime logs in call logs panel.

  ### Models / Voices

  * STT: Whisper large v3; serve multiple models in STT pipeline; word-level user text chunks.
  * FinetunedSTTContext fixes.
</Update>

<Update label="May 2025">
  ### API

  * Voice conversion in enterprise.

  ### Playground

  * Post–April: Following [April 2025](/changelog/2025) API changes (embeddings removed; use [Voice IDs](/build-with-cartesia/tts-models/voice-ids) and [Clone Voice](/api-reference/voices/clone)).

  ### Line / Agents

  * User code deployments from DB; **`agent_deployments`** table; STT cartesia-streaming and Pipecat streaming Whisper; Bedrock proxy for OpenAI-compatible; timestamp bug fixes and default to original timestamps.
  * Partner `/chat` and `/config` updates; DTMF support in UserCodeConnector; endpointing architecture.

  ### Models / Voices

  * STT: Batch engine utilization; Pipecat streaming Whisper.
  * Deepgram STT client `url`/`base_url` fix.

  ### Other

  * Voice clone uploads fix.
</Update>

<Update label="April 2025">
  ### Breaking

  * **sonic-2-2025-04-16** — Starting with **`sonic-2-2025-04-16`**, we're removing support for: Embeddings; **`stability`** cloning mode; Experimental controls for speed and emotion. The **`similarity`** cloning mode is dramatically better. To control speed and emotion today, use Instant Voice Cloning (e.g. FFMPEG, Voice Changer, or instant clones from **`sonic-2-2025-03-07`** embeddings). Users who need embeddings or experimental controls can use API version **`2024-11-13`** with model **`sonic-2-2025-03-07`** (both still available). See [Older models](/build-with-cartesia/tts-models/older-models).

  ### API

  * listVoices by ID for single voice; warm-monkey PVC; **access tokens** (JWT); Cartesia-Version 2024-11-13; phoneme/original timestamps language check; TTS History source; LoRA from fine-tune checkpoints; context expiry replaced by input stream delay.
  * **`sonic-2`** and **`sonic-2-2025-04-16`** ignore experimental controls on TTS generations; voice cloning supports only **`similarity`** clones.
  * Removed embeddings from all endpoints; voices may only be specified by Voice ID; **`/tts`** cannot be called with voice embeddings.
  * Deprecated **`/voices/create`** and **`/voices/mix`**.
</Update>

<Update label="March 2025">
  ### Breaking

  * **sonic-2-2025-03-07** is the last Sonic 2 snapshot supporting voice embeddings and experimental controls. Use with API version **`2024-11-13`** for legacy behavior.
  * sonic-preview → JollyTotem, RoseLion deprecated; sonic-2 alias to jolly-totem for speaker switching. See [Older models](/build-with-cartesia/tts-models/older-models).

  ### API

  * **Cartesia-Version** updated to **2024-11-13**; model latency via header on bytes endpoint; new Sonic PVC model warm-monkey; listVoices by ID (single voice); **access tokens** (JWT signing, validation); API-level check for languages supporting phoneme and original timestamps.
  * Organizations and billing; **free credits** 10k → 20k; overages product; subscription cache invalidation webhook; TTS History **source** column (api, playground, narrations); LoRA voices from base VoiceVariation and checkpoint for fine tunes.

  ### Playground

  * **sonic-2** and **sonic-turbo** aliases launched; Sonic 2 / Sonic Turbo messaging (Turbo = 40ms latency).
  * cartesia.ai/sonic and playground updates.

  ### Line / Agents

  * Agent ID in websocket URL; telephony info on partner calls; Pipecat version upgrade; partner demo tool calls; warm-monkey PVC model; prespeak and function call flow updates.
  * Twilio voice routes support agent IDs; Keypad DTMF on agent; half-duplex STT and LLM context; original timestamps support in API.

  ### Other

  * **sonic-pvc** alias and DryVoice as sonic-pvc model. **Python SDK** announced.
</Update>

<Update label="February 2025">
  ### API

  * **listVoices** by ID; localize endpoint voice name fix; 400s for bad body params; text forcing max transcript length; **OpenAI-compatible STT server**; agent with local STT; voice tags; on-device transcripts in evals; jolly-totem as default sonic-preview.
  * S2S and Agents foundational blocks.

  ### Playground

  * Instant cloning enabled for free users; voice tags; localize refactored to use conditioning; listVoices can query by ID for single voice; Sarah (Similarity) and Southern Woman migrations; on-device transcripts.
  * Narrations settings (JSONB).

  ### Line / Agents

  * Agent with local STT; foundational S2S + Agents blocks; design and pipeline work.

  ### Models / Voices

  * STT: cartesia-streaming and Pipecat streaming Whisper; on-device transcripts.
</Update>

<Update label="January 2025">
  ### API

  * **sonic-lite** added to API; EU deployment for prod API; save option for TTS bytes handler; CORS header for **Cartesia-File-ID**; Stripe credits default to `char_limit` in checkout; Redis cache for overage settings; polar-mountain and VC in EU; ListFiles paginator fix.
  * Eval break/spell tags and replacement/normalization mode.

  ### Models / Voices

  * sonic-preview routed to MisunderstoodFrog; polar-mountain added and staged; visionary-yogurt timestamp requests for any language.
  * jolly-totem as default sonic-preview.
</Update>


# Changelog 2026
Source: https://docs.cartesia.ai/changelog/2026

Product, API, and platform changes for 2026

<Update label="April 2026">
  ### Sonic 3.5

  *Sonic 3.5 is now available on `sonic-3-latest`. We'd love for you to try it and tell us what you think.*

  #### Why you should try it

  * **More natural speech, pacing, and emotional expression**, especially noticeable on expressive, conversational, and support-style transcripts.
  * **Cleaner audio quality** across all languages and voices.
  * **Better alphanumeric read-out** — confirmation codes, order numbers, phone numbers, IDs, and emails sound meaningfully more natural, in all supported languages.
  * **Step-change multilingual performance**, particularly Hebrew, Japanese, Spanish, Hindi, German, Korean, and French.
  * **English heteronyms** — tricky English heteronyms like "read," "bass," and "bow" now pronounce correctly in context.

  #### How to try it

  1. Point your API call or Playground request to the model ID `sonic-3-latest`.
  2. Keep your existing voice IDs, request shape, and prompting — no code changes required for most customers.
  3. Send us feedback on any voice or transcript that behaves differently than you expect.

  <Note>
    As with any `-latest` alias, `sonic-3-latest` can be updated without notice and is not recommended for production. Pin to a dated snapshot (e.g. `sonic-3`) for production traffic.
  </Note>

  #### What to know to be successful

  * **Spell tags still work the same way.** If you already wrap alphanumerics in `<spell>...</spell>`, you don't need to change anything — you'll just get better-sounding output. See [here](/build-with-cartesia/sonic-3-5/prompting-tips#controlling-pacing-and-spelling) for more details.
  * **If you use custom delimiters** (commas/periods between characters or groups) to control pacing, our recommended format has changed. Use **spaces between characters** and **commas between groups**, e.g. `A B C, 1 2 3` instead of `A, B, C. 1, 2, 3.`. See [Prompting tips for Sonic 3.5](/build-with-cartesia/sonic-3-5/prompting-tips) for more details.
  * **Speed and volume controls are temporarily disabled** on `sonic-3-latest`. If you rely on speed or volume augmentation (including via SSML), stay on `sonic-3` for now. We believe that Sonic 3.5 has more natural pacing and you may find that you don't need to use speed control as much when using this model.
  * **Timestamps behave slightly differently.** If you use end-of-word timestamps for interruption handling, you should not see a meaningful change. If you depend on beginning-of-word timestamps, please test carefully and reach out if you see regressions for your use case.
  * **Existing Professional Voice Clones (PVCs) do not carry over to `sonic-3-latest`.** Professional Voice Clones are pinned to the base model they were trained on (e.g. `sonic-3`) and will function as a standard voice clone for this model. For more information, see [Clone Voices (Pro)](/build-with-cartesia/capability-guides/clone-voices-pro/playground).
  * **Providing proper context to the model improves naturalness.** Please see our buffering guide [here](/use-the-api/tts-websocket/buffering) for more details.

  #### Where to look for help

  * [Sonic 3.5 model overview](/build-with-cartesia/tts-models/sonic-3-5)
  * [Prompting tips for Sonic 3.5](/build-with-cartesia/sonic-3-5/prompting-tips)
  * [Model aliases and snapshots](/build-with-cartesia/tts-models/latest#continuous-updates-and-model-snapshots)
</Update>

<Update label="March 2026">
  ### Breaking

  * **Text-to-Agent (T2A) API** — Text-to-Agent workflow for Line is **deprecated**.

  ### API

  * **Error responses** — For `Cartesia-Version: 2026-03-01`, we now return structured JSON. See [API Errors](/use-the-api/api-errors).
    * API versions before `2026-03-01` continue to return legacy error formats (for example HTTP `Title: Message`).
    * **Voices** — `PATCH /voices/{id}`: voice owners can now update accent and gender. Voice creation validates language. Invalid voice UUIDs and pronunciation-dictionary IDs return 404 instead of ambiguous errors.
  * **PVC model routing** — PVC voices require a dated model ID (e.g. **`sonic-3-2026-01-12`**) instead of **`sonic-3`**. See [Clone Voices (Pro)](/build-with-cartesia/capability-guides/clone-voices-pro/api).
  * **Voice search** — Name and metadata search is **diacritics-insensitive**.

  ### Playground

  * **Pro voice clones**
    * Clearer **language mismatch** messaging
    * **Background noise removal** is now a simple on/off control
    * **Fine-tuning model support**:
      * Removed support for older models
      * Now only **sonic-3-2026-01-12** is supported
  * **Multilingual agents** — Multilingual agent configuration is now supported in the Playground.
  * **Agents UI** — Search by **call ID** and **agent ID**.

  ### Billing

  * **Concurrency** — Organizations can receive **notifications** when concurrency nears configured **limits**.

  ### Model / voice

  * **Professional Voice Clones** — Backend updates improve stability of the professional voice cloning workflow.
  * **Accents & filters** — Additional **accent** options (e.g. **Irish**, **New Zealand**, **South African**, **Belgian**) and **locale aliases** for accent filtering in APIs and Playground.
  * **Voice Library** — **94** new voices across **17** locales (including Arabic, German, English variants, Spanish, Finnish, French, Hebrew, Hindi, Japanese, Korean, Polish, Portuguese, Swedish, Telugu, Thai, and more).

  ### Self-hosted

  * **On-premises** — API for managing voices on self-hosted deployments.

  ### Cartesia SDK

  * **cartesia-js v3.0.0** (Mar 2) — Major updates:

    * New features: `flush_id` included in chunk and voice changer binary responses; `output_format` and infill support; inline WebSocket response types; byte endpoint returns **ArrayBuffer**; improved **WebPlayer** and client export.
    * Fixes: memory leak and timing issues with abort signals/listeners, handling of empty `Content-Length`, and **TimeoutError** now includes a message.

    See [cartesia-js releases](https://github.com/cartesia-ai/cartesia-js/releases) for full details.
</Update>

<Update label="February 2026">
  ### Line

  * **[History Management API](/line/sdk/agents#history-management)**: You can add or replace the history provided to your agent, for example, to summarize a long conversation.
  * **[Custom User Events](/line/sdk/events#custom-event)**: You can send bidirectional custom events between your client and the agent. You could use this, for example, if you have a web application with UI interactions.
  * **[Uninterruptible Messages](/line/sdk/events#speech)**: You can set messages as uninterruptible. A common use case is a legal disclaimer at the beginning of a call.
  * **End Tool Call Improvements**: The default end call tool call is more conservative to prevent calls from ending prematurely.

  ### API

  * Increased reliability of API connections

  ### Cartesia SDK

  * **cartesia-python v3.0.0** (Feb 9). See full details in [cartesia-python releases](https://github.com/cartesia-ai/cartesia-python/releases).

  ### Playground

  * Shipped a new TTS page
  * Shipped a new Voice Creation page
  * Shipped a new Agents page

  ### Model changes

  * **Improved pronunciation of real-world text patterns across languages**
    * Enhanced support for structured and formatted speech patterns: numbers, dates, times, currency, phone numbers, IDs, percentages, and amounts/measurements.
    * Support for various date formats (YYYY-MM-DD, YYYY/MM/DD, 年月日).
    * Support for measurement units (meters, kg, tablespoon, gigabytes, etc.) with locale awareness.
    * Support for domestic and international phone number formats with locale-specific chunking for French, Italian, German, Portuguese, Korean, and more.
    * Improved alphanumeric ID handling with katakana/hiragana readings and Latin acronym transliteration to katakana for Japanese.
    * Improves all languages except English, Hindi & other Indic languages, Arabic, Hebrew, Chinese, Swedish, Georgian, Bulgarian, and Tagalog (targeted for future updates).
  * **Support for regional and locale-specific pronunciation within languages**
    * Regional voices use region-specific terms in addition to accent (e.g. Belgian and Swiss French "nonante" vs. Canadian and French "quatre-vingt-dix").
    * Region-specific number terminology, currency symbols, date formats, and measurement units.
    * Locale-aware date and time formatting (e.g. Russian year suffixes, French/Spanish time conventions).
    * Locale-aware currency symbol handling (e.g. \$ as "dollars" in en\_US and "pesos" in es\_MX).
    * Locale pronunciation falls back to the primary country for that language (e.g. US for English, Brazil for Portuguese). We will continue to expand locale-aware support.
    * Improves all languages except English, Hindi & other Indic languages, Arabic, Hebrew, Chinese, Swedish, Georgian, Bulgarian, and Tagalog (targeted for future updates). Existing regional pronunciation for English voices (e.g. British) is unaffected.

  ### Voice changes

  * **Voice Library**: 39 new voices across 21 locales

  ### Breaking changes effective June 1, 2026

  The following model snapshots and languages are discontinued effective June 1, 2026:

  | Model                | Snapshots                                                        | Languages                  |
  | -------------------- | ---------------------------------------------------------------- | -------------------------- |
  | `sonic`              | All                                                              | All                        |
  | `sonic-english`      | —                                                                | All                        |
  | `sonic-multilingual` | —                                                                | All                        |
  | `sonic-2`            | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | it, nl, pl, ru, sv, tr, hi |
  |                      | `sonic-2-2025-03-07`                                             | All                        |
  | `sonic-turbo`        | `sonic-turbo-2025-06-04`                                         | it, nl, pl, ru, sv, tr     |
  |                      | `sonic-turbo-2025-03-07`                                         | All                        |

  The following endpoints are discontinued effective June 1, 2026:

  | Discontinued Endpoint                      | Replacement                                |
  | ------------------------------------------ | ------------------------------------------ |
  | Voice Embedding: `POST /voices/clone/clip` | [Clone Voice](/api-reference/voices/clone) |
  | Mix Voices: `POST /voices/mix`             | —                                          |
  | Create Voice: `POST /voices`               | [Clone Voice](/api-reference/voices/clone) |

  The following endpoints stop accepting voice embeddings effective June 1, 2026:

  | Endpoint with a breaking change       | Replacement |
  | ------------------------------------- | ----------- |
  | TTS (bytes): `POST /tts/bytes`        | Voice ID    |
  | TTS (SSE): `POST /tts/sse`            | Voice ID    |
  | TTS (WebSocket): `WSS /tts/websocket` | Voice ID    |
</Update>

<Update label="January 2026">
  ### API

  * **Regionalization** — Calls routed to US, EU, APAC by origin.
  * **Parameterized outbound calls** — [Docs](/line/integrations/telephony/outbound-dialing)
  * **Pronunciation dictionaries** — [Docs](/line/sdk/agents#custom-pronunciations)

  ### Model changes

  * **Sonic-3 model versioning scheme introduced**
    * New preview track: **`sonic-3-latest`** (continuous updates for early access and feedback).
    * Stable track: **`sonic-3`** always points to the most recent stable release.
    * Immutable dated snapshots: **`sonic-3-YYYY-MM-DD`** never change.
    * Details: [Continuous updates and model snapshots](/build-with-cartesia/tts-models/latest#continuous-updates-and-model-snapshots)
  * **Promotion to stable checkpoint:** **`sonic-3-2026-01-12`**
    * Included improvements: consistent speed & volume, custom IPA pronunciations with stronger adherence, Hindi prosody improvements, Korean prosody/intonation improvements.

  ### Voice changes

  * **Featured Voices launched** — Curated set of 30+ best-performing voices (e.g. [Cathy](https://play.cartesia.ai/voices/e8e5fffb-252c-436d-b842-8879b84445b6), [Henry](https://play.cartesia.ai/voices/87286a8d-7ea7-4235-a41a-dd9fa6630feb)).
  * **Voice Library** — December: 25 new voices across 6 languages.
  * **Voice Library** — January: 9 Spanish voices (Mexican, Colombian, Castilian).

  ### Playground

  * Voice library usability improvements (test with your own scripts, call an agent per voice).
  * One-click **Report Issue** on TTS Playground.
  * Mini voice picker (recently used + saved) on TTS page.
  * PVC UI + reliability (loading skeletons, error messages, better behavior with large datasets and silence).

  ### Line

  * **Line SDK v0.2** — [Repo](https://github.com/cartesia-ai/line). Improved DX, long-running tool-call handling, **committed turns**, better turn-taking and transcription.
</Update>


# Set up an organization
Source: https://docs.cartesia.ai/enterprise/set-up-an-organization


Organization workspaces enable seamless collaboration between multiple team members. All users in an organization share the same view of resources, including voices, API keys, and datasets. The only exceptions are playground generation history and starred voices, which remain private to each individual user.

By default, your Cartesia account initializes as an organization workspace on the Free subscription plan with a limit of one member.

<Warning>
  To invite team members, you must first upgrade your organization to the
  Startup tier or higher. After upgrading, you can invite unlimited users at no
  additional cost.
</Warning>

## Manage your organization

<Steps>
  <Step title="Upgrade your current organization">
    Organizations must be upgraded to the Startup tier or above before team members can be invited. Each workspace has its own billing and credit limits, so make sure you are on the intended organization before proceeding to upgrade your subscription.

    <Frame>
      <img alt="Upgrade organization" />
    </Frame>
  </Step>

  <Step title="Invite your team">
    Once you've upgraded your organization, you can use the "Manage" button in the workspace switcher to manage it:

    <Frame>
      <img alt="Organization manage button in switcher" />
    </Frame>

    This pops up a modal where you can change your profile and invite your team:

    <Frame>
      <img alt="Organization manager modal" />
    </Frame>

    There are two membership types in an organizaton:

    1. Admin: has the ability to manage the organization profile, invitations, and members.
    2. Member: can use all functionality included in the subscription, but cannot alter organization settings.

    <Frame>
      <img alt="Organization membership types" />
    </Frame>

    You can invite unlimited team members in an organization once it is on Startup tier or higher.
  </Step>

  <Step title="Create voices, API keys, and other resources in your organization">
    Once your organization is upgraded, voices, Line agents, API keys, and other resources will be available to all users in the organization.
  </Step>
</Steps>

## Create additional organizations

If you want separate workspaces on different subscriptions, you can create another organization by going to the playground at [https://play.cartesia.ai](https://play.cartesia.ai), selecting the workspace switcher, and clicking **Create organization**.

<Frame>
  <img alt="Create organization" />
</Frame>

This will bring up a dialog where you can name your organization and upload a logo.

<Frame>
  <img alt="Organization creation dialog" />
</Frame>

Please reach out to us at [support@cartesia.ai](mailto:support@cartesia.ai) if you run into any troubles with your organization.


# Set up SSO
Source: https://docs.cartesia.ai/enterprise/set-up-sso


We support Single-Sign On (SSO) for customers on the Enterprise plan via SAML. This integration is processed through our identity provider, [Clerk](https://clerk.com).

## Set up SSO with Okta

1. Send us your SSO domain.
2. We will send you a service provider configuration, which consists of a single-sign on URL and an audience URI (SP entity ID).
3. Follow steps 2, 3, 4, and 5 in [the Clerk SSO guide](https://clerk.com/docs/authentication/enterprise-connections/saml/okta), and send us the metadata URL you get from step 6.1.

After you are done, we will complete the remaining SSO setup and send you a confirmation that SSO is enabled for your organization.


# Error Handling
Source: https://docs.cartesia.ai/examples/error-handling

Example of error handling with SDK exceptions.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def error_handling_example(client: Cartesia) -> None:
        """Example of error handling with SDK exceptions."""
        try:
            _response = client.tts.generate(
                model_id="sonic-3",
                transcript="Hello, world!",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
            )
        except BadRequestError as e:
            print(f"Bad request: {e}")
        except AuthenticationError as e:
            print(f"Auth failed: {e}")
        except NotFoundError as e:
            print(f"Not found: {e}")
        except RateLimitError as e:
            print(f"Rate limited: {e}")
        except APIError as e:
            print(f"API error: {e}")
    ```

    From [cartesia-python/examples/examples.py:545](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L545)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function errorHandling(client: Cartesia): Promise<void> {
      /** Example of error handling with SDK exceptions. */
      try {
        await client.tts.generate({
          model_id: 'sonic-3',
          transcript: 'Hello, world!',
          voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
          output_format: { container: 'wav', encoding: 'pcm_f32le', sample_rate: 44100 },
        });
      } catch (e) {
        if (e instanceof BadRequestError) {
          console.log(`Bad request: ${e.message}`);
        } else if (e instanceof AuthenticationError) {
          console.log(`Auth failed: ${e.message}`);
        } else if (e instanceof NotFoundError) {
          console.log(`Not found: ${e.message}`);
        } else if (e instanceof RateLimitError) {
          console.log(`Rate limited: ${e.message}`);
        } else if (e instanceof APIError) {
          console.log(`API error: ${e.message}`);
        } else {
          throw e;
        }
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:398](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L398)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py error_handling_example
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts errorHandling
    ```
  </Tab>
</Tabs>


# Create Infill Audio
Source: https://docs.cartesia.ai/examples/infill-create

Create infill audio between two clips.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def infill_create(client: Cartesia) -> None:
        """Create infill audio between two clips."""
        from pathlib import Path
        # Can pass file paths directly (as Path objects)
        response = client.tts.infill(
            model_id="sonic-3",
            language="en",
            transcript="Infill text",
            left_audio=Path("left.wav"),
            right_audio=Path("right.wav"),
            voice_id="6ccbfb76-1fc6-48f7-b71d-91ac6298247b",
            output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        response.write_to_file("infill_output.wav")
        print(f"Saved audio to infill_output.wav")
        print(f"Play with: ffplay -f wav infill_output.wav")
    ```

    From [cartesia-python/examples/examples.py:504](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L504)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def infill_create_async(client: AsyncCartesia) -> None:
        """Create infill audio between two clips."""
        from pathlib import Path
        response = await client.tts.infill(
            model_id="sonic-3",
            language="en",
            transcript="Infill text",
            left_audio=Path("left.wav"),
            right_audio=Path("right.wav"),
            voice_id="6ccbfb76-1fc6-48f7-b71d-91ac6298247b",
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        await response.write_to_file("infill_output_async.wav")
        print("Saved audio to infill_output_async.wav")
        print("Play with: ffplay -f wav infill_output_async.wav")
    ```

    From [cartesia-python/examples/async\_examples.py:341](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L341)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py infill_create
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py infill_create_async
    ```
  </Tab>
</Tabs>


# Next.js Full Example
Source: https://docs.cartesia.ai/examples/nextjs

A complete Next.js application with batch TTS, HTTP streaming, and WebSocket streaming.

A full Next.js app demonstrating three approaches to Cartesia TTS in the browser:
batch generation, HTTP streaming, and WebSocket streaming. Includes a server-side
token endpoint so API keys are never exposed to the client.

## Token Endpoint

```typescript app/api/token/route.ts theme={null}
import Cartesia from "@cartesia/cartesia-js";

const client = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY });

export async function POST() {
  const { token } = await client.accessToken.create({
    grants: { tts: true },
    expires_in: 300,
  });
  return Response.json({ token });
}
```

## Batch and HTTP Streaming

```tsx app/page.tsx theme={null}
"use client";

import { useRef, useState } from "react";
import Cartesia from "@cartesia/cartesia-js";

const SAMPLE_RATE = 44100;
const BYTES_PER_SAMPLE = 4; // f32le

async function getToken(): Promise<string> {
  const res = await fetch("/api/token", { method: "POST" });
  const { token } = await res.json();
  return token;
}

// =============================================================================
// Batch: waits for the full response, then plays via <audio> element
// =============================================================================

function BatchCartesiaTTSExample() {
  const audioRef = useRef<HTMLAudioElement>(null);
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      const client = new Cartesia({ token: await getToken() });
      const response = await client.tts.generate({
        model_id: "sonic-3",
        transcript: "Hello! This audio was generated in one batch and then played.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "wav", encoding: "pcm_s16le", sample_rate: SAMPLE_RATE },
      });

      const blob = await response.blob();
      const url = URL.createObjectURL(blob);
      const audio = audioRef.current!;
      audio.src = url;
      audio.onended = () => URL.revokeObjectURL(url);
      await audio.play();
    } finally {
      setLoading(false);
    }
  }

  return (
    <section>
      <h2>Batch</h2>
      <p>Waits for the full audio, then plays via an audio element.</p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Generating..." : "Speak"}
      </button>
      <audio ref={audioRef} controls style={{ display: "block", marginTop: "0.5rem" }} />
    </section>
  );
}

// =============================================================================
// Streaming: plays audio chunks as they arrive via Web Audio API
// =============================================================================

function StreamingCartesiaTTSExample() {
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      const client = new Cartesia({ token: await getToken() });
      const response = await client.tts.generate({
        model_id: "sonic-3",
        transcript:
          "Hello! This audio is being streamed and played as chunks arrive.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: SAMPLE_RATE },
      });

      // Stream the response and play each chunk as it arrives.
      // We buffer incoming bytes so we only decode complete f32 samples —
      // getReader() can split chunks at arbitrary byte boundaries.
      const audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE });
      let nextStartTime = audioCtx.currentTime;
      const reader = response.body!.getReader();
      let leftover = new Uint8Array(0);

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        // Prepend any leftover bytes from the previous chunk
        let bytes: Uint8Array;
        if (leftover.length > 0) {
          bytes = new Uint8Array(leftover.length + value.length);
          bytes.set(leftover);
          bytes.set(value, leftover.length);
        } else {
          bytes = value;
        }

        // Only decode complete samples, save the remainder
        const usableBytes = bytes.length - (bytes.length % BYTES_PER_SAMPLE);
        leftover = bytes.slice(usableBytes);

        if (usableBytes === 0) continue;

        // Copy to an aligned buffer so Float32Array doesn't throw on unaligned offset
        const aligned = new ArrayBuffer(usableBytes);
        new Uint8Array(aligned).set(bytes.subarray(0, usableBytes));
        const floats = new Float32Array(aligned);

        const buf = audioCtx.createBuffer(1, floats.length, SAMPLE_RATE);
        buf.getChannelData(0).set(floats);

        const source = audioCtx.createBufferSource();
        source.buffer = buf;
        source.connect(audioCtx.destination);

        const startTime = Math.max(nextStartTime, audioCtx.currentTime);
        source.start(startTime);
        nextStartTime = startTime + buf.duration;
      }
    } finally {
      setLoading(false);
    }
  }

  return (
    <section>
      <h2>Streaming</h2>
      <p>Plays audio chunks as they arrive via the Web Audio API.</p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Streaming..." : "Speak"}
      </button>
    </section>
  );
}

// =============================================================================
// Page
// =============================================================================

export default function Home() {
  return (
    <main style={{ padding: "2rem", fontFamily: "system-ui" }}>
      <h1>Cartesia TTS — Next.js Example</h1>
      <div style={{ display: "flex", flexDirection: "column", gap: "2rem", marginTop: "1rem" }}>
        <BatchCartesiaTTSExample />
        <StreamingCartesiaTTSExample />
      </div>
      <p style={{ marginTop: "2rem" }}>
        <a href="/websocket">WebSocket streaming example →</a>
      </p>
    </main>
  );
}
```

## WebSocket Streaming

```tsx app/websocket/page.tsx theme={null}
"use client";

import { useState } from "react";
import Cartesia from "@cartesia/cartesia-js";

const SAMPLE_RATE = 44100;

export default function WebSocketExample() {
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      // 1. Get a short-lived token from our server
      const res = await fetch("/api/token", { method: "POST" });
      const { token } = await res.json();

      // 2. Connect via WebSocket from the browser
      const client = new Cartesia({ token });
      const ws = await client.tts.websocket();

      // 3. Stream audio and play each chunk as it arrives
      const audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE });
      let nextStartTime = audioCtx.currentTime;

      const resp = ws.generate({
        model_id: "sonic-3",
        transcript:
          "Hello from a WebSocket! Each audio chunk is played the moment it arrives, giving you the lowest possible latency.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: SAMPLE_RATE },
      });

      for await (const event of resp) {
        if (event.type === "chunk" && event.audio) {
          // event.audio is a Uint8Array of f32le samples
          const aligned = new ArrayBuffer(event.audio.byteLength);
          new Uint8Array(aligned).set(event.audio);
          const floats = new Float32Array(aligned);

          const buf = audioCtx.createBuffer(1, floats.length, SAMPLE_RATE);
          buf.getChannelData(0).set(floats);

          const source = audioCtx.createBufferSource();
          source.buffer = buf;
          source.connect(audioCtx.destination);

          const startTime = Math.max(nextStartTime, audioCtx.currentTime);
          source.start(startTime);
          nextStartTime = startTime + buf.duration;
        }
      }

      ws.close();
    } finally {
      setLoading(false);
    }
  }

  return (
    <main style={{ padding: "2rem", fontFamily: "system-ui" }}>
      <h1>Cartesia TTS — WebSocket Streaming</h1>
      <p>
        Uses the SDK&apos;s WebSocket API directly from the browser.
        Audio plays as each chunk arrives for lowest latency.
      </p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Streaming..." : "Speak"}
      </button>
      <p style={{ marginTop: "1rem" }}>
        <a href="/">← Back to HTTP examples</a>
      </p>
    </main>
  );
}
```

## Run this example

```sh theme={null}
cd cartesia-js/examples/nextjs
npm install
CARTESIA_API_KEY=YOUR_KEY npm run dev
```

Then open [http://localhost:3000](http://localhost:3000).

## Source

<Card title="View on GitHub" icon="github" href="https://github.com/cartesia-ai/cartesia-js/tree/main/examples/nextjs">
  Full Next.js example project
</Card>


# Transcribe Audio
Source: https://docs.cartesia.ai/examples/stt-transcribe

Transcribe audio with word timestamps.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def stt_transcribe(client: Cartesia) -> None:
        """Transcribe audio with word timestamps."""
        with open("audio.wav", "rb") as f:
            response = client.stt.transcribe(
                file=f,
                model="ink-whisper",
                language="en",
                timestamp_granularities=["word"],  # Optional: get word timestamps
            )
        print(response.text)
        if response.words:
            for word in response.words:
                print(f"{word.word}: {word.start}s - {word.end}s")
    ```

    From [cartesia-python/examples/examples.py:526](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L526)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function sttTranscribe(client: Cartesia): Promise<void> {
      /** Transcribe audio with word timestamps. */
      const file = fs.createReadStream('audio.wav');
      const response = await client.stt.transcribe({
        file,
        model: 'ink-whisper',
        language: 'en',
        timestamp_granularities: ['word'],
      });
      console.log(response.text);
      if (response.words) {
        for (const word of response.words) {
          console.log(`${word.word}: ${word.start}s - ${word.end}s`);
        }
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:377](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L377)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py stt_transcribe
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts sttTranscribe
    ```
  </Tab>
</Tabs>


# Download Audio File
Source: https://docs.cartesia.ai/examples/tts-download-file

Generate audio and trigger a file download in the browser.

```typescript theme={null}
async function ttsDownloadFile(client: Cartesia): Promise<void> {
  /** Generate audio and trigger a file download in the browser. */
  const response = await client.tts.generate({
    model_id: 'sonic-3',
    transcript: 'This audio will be downloaded as a file.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'wav', encoding: 'pcm_s16le', sample_rate: 44100 },
  });

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);

  const a = document.createElement('a');
  a.href = url;
  a.download = 'speech.wav';
  a.click();

  URL.revokeObjectURL(url);
}
```

From [cartesia-js/examples/browser\_examples.ts:54](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L54)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Play Audio in Browser
Source: https://docs.cartesia.ai/examples/tts-play-audio

Generate a wav and play it using an <audio> element.

```typescript theme={null}
async function ttsPlayAudio(client: Cartesia): Promise<void> {
  /** Generate a wav and play it using an <audio> element. */
  const response = await client.tts.generate({
    model_id: 'sonic-3',
    transcript: 'Hello from the browser!',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'wav', encoding: 'pcm_s16le', sample_rate: 44100 },
  });

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);

  const audio = new Audio(url);
  audio.onended = () => URL.revokeObjectURL(url);
  await audio.play();
}
```

From [cartesia-js/examples/browser\_examples.ts:33](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L33)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# SSE Streaming
Source: https://docs.cartesia.ai/examples/tts-sse-basic

Basic SSE streaming.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_sse_basic(client: Cartesia) -> None:
        """Basic SSE streaming."""
        stream = client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )

        import datetime
        filename = f"tts_sse_basic_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            for event in stream:
                if event.type == "chunk":
                    # v3.x returns raw bytes in event.audio
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:62](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L62)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_sse_basic_async(client: AsyncCartesia) -> None:
        """Basic SSE streaming."""
        stream = await client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )

        filename = f"tts_sse_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            async for event in stream:
                if event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:52](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L52)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_basic
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_sse_basic_async
    ```
  </Tab>
</Tabs>


# SSE with Match Statement
Source: https://docs.cartesia.ai/examples/tts-sse-with-match

SSE streaming using match statement.

```python theme={null}
def tts_sse_with_match(client: Cartesia) -> None:
    """SSE streaming using match statement."""
    stream = client.tts.generate_sse(
        model_id="sonic-3",
        transcript="Hello, world!",
        voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
        output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
    )

    import datetime
    filename = f"tts_sse_with_match_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

    with open(filename, "wb") as f:
        for event in stream:
            if event.type == "chunk":
                # Audio chunk - event.audio contains bytes
                if event.audio:
                    f.write(event.audio)
                    process_audio(event.audio)
            elif event.type == "timestamps":
                # Word timestamps - event.word_timestamps
                process_timestamps(event.word_timestamps)
            elif event.type == "done":
                # Stream complete
                break
            elif event.type == "error":
                # Error occurred
                raise Exception(event.error)

    print(f"Saved audio to {filename}")
    print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
```

From [cartesia-python/examples/examples.py:151](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L151)

## Run this example

```sh theme={null}
cd cartesia-python
CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_match
```


# SSE with Phoneme Timestamps
Source: https://docs.cartesia.ai/examples/tts-sse-with-phoneme-timestamps

SSE streaming with phoneme timestamps.

```python theme={null}
def tts_sse_with_phoneme_timestamps(client: Cartesia) -> None:
    """SSE streaming with phoneme timestamps."""
    stream = client.tts.generate_sse(
        model_id="sonic-3",
        transcript="Hello, world!",
        voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
        output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        add_phoneme_timestamps=True,
    )

    import datetime
    filename = f"tts_sse_with_phoneme_timestamps_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

    with open(filename, "wb") as f:
        for event in stream:
            if event.type == "phoneme_timestamps":
                pt = event.phoneme_timestamps
                if pt:
                    print(f"Phonemes: {pt.phonemes}, Starts: {pt.start}, Ends: {pt.end}")
            elif event.type == "chunk":
                if event.audio:
                    f.write(event.audio)
            elif event.type == "done":
                break
            elif event.type == "error":
                raise Exception(f"Error: {event.error}")

    print(f"Saved audio to {filename}")
    print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
```

From [cartesia-python/examples/examples.py:120](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L120)

## Run this example

```sh theme={null}
cd cartesia-python
CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_phoneme_timestamps
```


# SSE with Word Timestamps
Source: https://docs.cartesia.ai/examples/tts-sse-with-timestamps

SSE streaming with word timestamps.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_sse_with_timestamps(client: Cartesia) -> None:
        """SSE streaming with word timestamps."""
        stream = client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            add_timestamps=True,
        )

        import datetime
        filename = f"tts_sse_with_timestamps_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            for event in stream:
                if event.type == "timestamps":
                    wt = event.word_timestamps
                    if wt:
                        print(f"Words: {wt.words}, Starts: {wt.start}, Ends: {wt.end}")
                elif event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:89](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L89)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_sse_with_timestamps_async(client: AsyncCartesia) -> None:
        """SSE streaming with word timestamps."""
        stream = await client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            add_timestamps=True,
        )

        filename = f"tts_sse_timestamps_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            async for event in stream:
                if event.type == "timestamps":
                    wt = event.word_timestamps
                    if wt:
                        print(f"Words: {wt.words}, Starts: {wt.start}, Ends: {wt.end}")
                elif event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:76](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L76)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_timestamps
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_sse_with_timestamps_async
    ```
  </Tab>
</Tabs>


# WebSocket Concurrent Contexts
Source: https://docs.cartesia.ai/examples/tts-websocket-concurrent-contexts

Two contexts on one connection, each using ctx.receive() to get their own audio.

<Tabs>
  <Tab title="Python">
    Since sync code can't receive from both contexts concurrently, we collect
    them sequentially — but the lazy-routing in receive() ensures that events
    consumed while reading context 1 are queued for context 2 (and vice-versa).

    ```python theme={null}
    def tts_websocket_concurrent_contexts(client: Cartesia) -> None:
        """Two contexts on one connection, each using ctx.receive() to get their own audio.

        Since sync code can't receive from both contexts concurrently, we collect
        them sequentially — but the lazy-routing in receive() ensures that events
        consumed while reading context 1 are queued for context 2 (and vice-versa).
        """
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx1 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )
            ctx2 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )

            # Send to both contexts before receiving
            ctx1.push("Context one is speaking now. This is a longer transcript to ensure that audio chunks from both contexts are interleaved on the wire. The quick brown fox jumps over the lazy dog.")
            ctx1.no_more_inputs()

            ctx2.push("Context two has a different message. We want to verify that the routing logic correctly separates the audio streams. Pack my box with five dozen liquor jugs.")
            ctx2.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # Receive from ctx1 — any ctx2 events read from the wire get queued
            filename1 = f"tts_concurrent_ctx1_{timestamp}.pcm"
            with open(filename1, "wb") as f:
                for response in ctx1.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            # Receive from ctx2 — picks up queued events first
            filename2 = f"tts_concurrent_ctx2_{timestamp}.pcm"
            with open(filename2, "wb") as f:
                for response in ctx2.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved context 1 audio to {filename1}")
            print(f"Saved context 2 audio to {filename2}")
            print(f"Play with:")
            print(f"  ffplay -f f32le -ar 44100 {filename1}")
            print(f"  ffplay -f f32le -ar 44100 {filename2}")
    ```

    From [cartesia-python/examples/examples.py:375](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L375)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_concurrent_contexts_async(client: AsyncCartesia) -> None:
        """Two contexts on one connection, each using ctx.receive() to get their own audio."""
        from cartesia.resources.tts import AsyncWebSocketContext

        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx1 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )
            ctx2 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )

            # Send to both contexts
            await ctx1.push("Context one is speaking now. This is a longer transcript to ensure that audio chunks from both contexts are interleaved on the wire. The quick brown fox jumps over the lazy dog.")
            await ctx1.no_more_inputs()

            await ctx2.push("Context two has a different message. We want to verify that the routing logic correctly separates the audio streams. Pack my box with five dozen liquor jugs.")
            await ctx2.no_more_inputs()

            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # Receive concurrently via tasks, writing to files
            async def collect(ctx: AsyncWebSocketContext, filename: str) -> None:
                with open(filename, "wb") as f:
                    async for response in ctx.receive():
                        if response.type == "chunk" and response.audio:
                            f.write(response.audio)

            filename1 = f"tts_concurrent_async_ctx1_{timestamp}.pcm"
            filename2 = f"tts_concurrent_async_ctx2_{timestamp}.pcm"

            await asyncio.gather(
                collect(ctx1, filename1),
                collect(ctx2, filename2),
            )

            print(f"Saved context 1 audio to {filename1}")
            print(f"Saved context 2 audio to {filename2}")
            print(f"Play with:")
            print(f"  ffplay -f f32le -ar 44100 {filename1}")
            print(f"  ffplay -f f32le -ar 44100 {filename2}")
    ```

    From [cartesia-python/examples/async\_examples.py:288](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L288)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketConcurrentContexts(client: Cartesia): Promise<void> {
      /** Two contexts on one connection, each using ctx.receive() to get their own audio. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx1 = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      const ctx2 = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      // Send to both contexts before receiving.
      await ctx1.push({
        transcript:
          'Context one is speaking now. This is a longer transcript to ensure that ' +
          'audio chunks from both contexts are interleaved on the wire. ' +
          'The quick brown fox jumps over the lazy dog.',
      });
      await ctx1.no_more_inputs();

      await ctx2.push({
        transcript:
          'Context two has a different message. We want to verify that the routing ' +
          'logic correctly separates the audio streams. ' +
          'Pack my box with five dozen liquor jugs.',
      });
      await ctx2.no_more_inputs();

      const ts = timestamp();

      async function collect(ctx: { receive: typeof ctx1.receive }, filename: string): Promise<void> {
        const file = fs.createWriteStream(filename);
        for await (const event of ctx.receive()) {
          if (event.type === 'chunk' && event.audio) {
            file.write(event.audio);
          }
        }
        file.end();
      }

      const filename1 = `tts_concurrent_ctx1_${ts}.pcm`;
      const filename2 = `tts_concurrent_ctx2_${ts}.pcm`;

      await Promise.all([collect(ctx1, filename1), collect(ctx2, filename2)]);

      ws.close();
      console.log(`Saved context 1 audio to ${filename1}`);
      console.log(`Saved context 2 audio to ${filename2}`);
      console.log('Play with:');
      console.log(`  ffplay -f f32le -ar 44100 ${filename1}`);
      console.log(`  ffplay -f f32le -ar 44100 ${filename2}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:239](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L239)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_concurrent_contexts
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_concurrent_contexts_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketConcurrentContexts
    ```
  </Tab>
</Tabs>


# WebSocket Emotion Control
Source: https://docs.cartesia.ai/examples/tts-websocket-emotion

Demonstrates changing emotion mid-stream using generation_config.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_emotion(client: Cartesia) -> None:
        """Demonstrates changing emotion mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending neutral text...")
            ctx.push("Well maybe if you just ")

            print("Sending angry text...")
            ctx.push("loosen up a little!", generation_config={"emotion": "angry"})

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_emotion_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:313](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L313)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_emotion_async(client: AsyncCartesia) -> None:
        """Demonstrates changing emotion mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending neutral text...")
            await ctx.push("Well maybe if you just ")

            print("Sending angry text...")
            await ctx.push("loosen up a little!", generation_config={"emotion": "angry"})

            await ctx.no_more_inputs()

            import datetime
            filename = f"tts_emotion_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:228](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L228)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketEmotion(client: Cartesia): Promise<void> {
      /** Demonstrates changing emotion mid-stream using generation_config. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      console.log('Sending neutral text...');
      await ctx.push({ transcript: 'Well maybe if you just ' });

      console.log('Sending angry text...');
      await ctx.send({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        transcript: 'loosen up a little!',
        continue: true,
        generation_config: { emotion: 'angry' },
      });

      await ctx.no_more_inputs();

      const filename = `tts_emotion_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:157](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L157)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_emotion
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_emotion_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketEmotion
    ```
  </Tab>
</Tabs>


# WebSocket Low-Latency Playback
Source: https://docs.cartesia.ai/examples/tts-websocket-low-latency

Play audio chunks as they arrive for lowest latency.

```typescript theme={null}
async function ttsWebsocketLowLatency(client: Cartesia): Promise<void> {
  /** Play audio chunks as they arrive for lowest latency. */
  const sampleRate = 44100;
  const audioCtx = new AudioContext({ sampleRate });
  let nextStartTime = audioCtx.currentTime;

  const ws = await client.tts.websocket();

  for await (const event of ws.generate({
    model_id: 'sonic-3',
    transcript: 'Low latency streaming. Each chunk plays as soon as it arrives.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: sampleRate },
  })) {
    if (event.type === 'chunk' && event.audio) {
      const floats = new Float32Array(
        event.audio.buffer,
        event.audio.byteOffset,
        event.audio.byteLength / 4,
      );

      const audioBuffer = audioCtx.createBuffer(1, floats.length, sampleRate);
      audioBuffer.getChannelData(0).set(floats);

      const source = audioCtx.createBufferSource();
      source.buffer = audioBuffer;
      source.connect(audioCtx.destination);

      // Schedule this chunk right after the previous one
      const startTime = Math.max(nextStartTime, audioCtx.currentTime);
      source.start(startTime);
      nextStartTime = startTime + audioBuffer.duration;
    }
  }

  ws.close();
}
```

From [cartesia-js/examples/browser\_examples.ts:127](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L127)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# WebSocket Response Handling
Source: https://docs.cartesia.ai/examples/tts-websocket-response-handling

WebSocket response type handling.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_response_handling(client: Cartesia) -> None:
        """WebSocket response type handling."""
        with client.tts.websocket_connect() as connection:
            connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "voice-id"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            import datetime
            filename = f"tts_websocket_response_handling_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.type == "timestamps":
                        process_timestamps(response.word_timestamps)
                    elif response.type == "done" or response.done:
                        break
                    elif response.type == "error":
                        raise Exception(response.error)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:427](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L427)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketResponseHandling(client: Cartesia): Promise<void> {
      /** WebSocket response type handling. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const filename = `tts_websocket_response_handling_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ws.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        add_timestamps: true,
      })) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        } else if (event.type === 'timestamps') {
          const wt = (event as any).word_timestamps;
          if (wt) {
            console.log(`Words: ${wt.words}, Starts: ${wt.start}, Ends: ${wt.end}`);
          }
        } else if (event.type === 'error') {
          throw new Error(JSON.stringify(event));
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:298](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L298)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_response_handling
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketResponseHandling
    ```
  </Tab>
</Tabs>


# WebSocket Speed Control
Source: https://docs.cartesia.ai/examples/tts-websocket-speed

Demonstrates changing speed mid-stream using generation_config.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_speed(client: Cartesia) -> None:
        """Demonstrates changing speed mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending normal speed text...")
            ctx.push("I am speaking at a normal pace. ")

            print("Sending fast speed text...")
            ctx.push("But now I am speaking much faster!", generation_config={"speed": 1.5})

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_speed_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:344](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L344)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_speed_async(client: AsyncCartesia) -> None:
        """Demonstrates changing speed mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending normal speed text...")
            await ctx.push("I am speaking at a normal pace. ")

            print("Sending fast speed text...")
            await ctx.push("But now I am speaking much faster!", generation_config={"speed": 1.5})

            await ctx.no_more_inputs()

            import datetime
            filename = f"tts_speed_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:258](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L258)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketSpeed(client: Cartesia): Promise<void> {
      /** Demonstrates changing speed mid-stream using generation_config. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      console.log('Sending normal speed text...');
      await ctx.push({ transcript: 'I am speaking at a normal pace. ' });

      console.log('Sending fast speed text...');
      await ctx.send({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        transcript: 'But now I am speaking much faster!',
        continue: true,
        generation_config: { speed: 1.5 },
      });

      await ctx.no_more_inputs();

      const filename = `tts_speed_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:198](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L198)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_speed
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_speed_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketSpeed
    ```
  </Tab>
</Tabs>


# WebSocket Stream to Web Audio
Source: https://docs.cartesia.ai/examples/tts-websocket-stream-audio

Stream audio from a WebSocket and play it in real-time with Web Audio API.

```typescript theme={null}
async function ttsWebsocketStreamAudio(client: Cartesia): Promise<void> {
  /** Stream audio from a WebSocket and play it in real-time with Web Audio API. */
  const sampleRate = 44100;
  const audioCtx = new AudioContext({ sampleRate });

  const ws = await client.tts.websocket();

  const chunks: Float32Array[] = [];

  for await (const event of ws.generate({
    model_id: 'sonic-3',
    transcript: 'This is being streamed in real time from a WebSocket connection.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: sampleRate },
  })) {
    if (event.type === 'chunk' && event.audio) {
      // event.audio is a raw buffer of f32le samples
      const floats = new Float32Array(
        event.audio.buffer,
        event.audio.byteOffset,
        event.audio.byteLength / 4,
      );
      chunks.push(floats);
    }
  }

  ws.close();

  // Combine all chunks into a single AudioBuffer and play
  const totalSamples = chunks.reduce((sum, c) => sum + c.length, 0);
  const audioBuffer = audioCtx.createBuffer(1, totalSamples, sampleRate);
  const channelData = audioBuffer.getChannelData(0);

  let offset = 0;
  for (const chunk of chunks) {
    channelData.set(chunk, offset);
    offset += chunk.length;
  }

  const source = audioCtx.createBufferSource();
  source.buffer = audioBuffer;
  source.connect(audioCtx.destination);
  source.start();
}
```

From [cartesia-js/examples/browser\_examples.ts:78](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L78)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Clone a Voice
Source: https://docs.cartesia.ai/examples/voices-clone

Clone a voice from an audio clip.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_clone(client: Cartesia) -> Any:
        """Clone a voice from an audio clip."""
        with open("sample.wav", "rb") as clip:
            voice = client.voices.clone(
                clip=clip,
                name="My Voice",
                description="A custom voice",
                language="en",
            )
        return voice
    ```

    From [cartesia-python/examples/examples.py:474](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L474)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesClone(client: Cartesia): Promise<void> {
      /** Clone a voice from an audio clip. */
      const clip = fs.createReadStream('sample.wav');
      const voice = await client.voices.clone({
        clip,
        name: 'My Voice',
        description: 'A custom voice',
        language: 'en',
      });
      console.log('Cloned voice:', voice.id);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:348](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L348)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_clone
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesClone
    ```
  </Tab>
</Tabs>


# Delete a Voice
Source: https://docs.cartesia.ai/examples/voices-delete

Delete a voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_delete(client: Cartesia, voice_id: str) -> None:
        """Delete a voice."""
        client.voices.delete(voice_id)
    ```

    From [cartesia-python/examples/examples.py:495](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L495)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesDelete(client: Cartesia): Promise<void> {
      /** Delete a voice. */
      await client.voices.delete('voice-id');
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:368](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L368)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_delete
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesDelete
    ```
  </Tab>
</Tabs>


# Get a Voice
Source: https://docs.cartesia.ai/examples/voices-get

Get a specific voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_get(client: Cartesia) -> Any:
        """Get a specific voice."""
        voice = client.voices.get("voice-id")
        return voice
    ```

    From [cartesia-python/examples/examples.py:468](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L468)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesGet(client: Cartesia): Promise<void> {
      /** Get a specific voice. */
      const voice = await client.voices.get('6ccbfb76-1fc6-48f7-b71d-91ac6298247b');
      console.log(voice.name);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:342](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L342)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_get
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesGet
    ```
  </Tab>
</Tabs>


# List Voices
Source: https://docs.cartesia.ai/examples/voices-list

List voices with pagination.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_list(client: Cartesia) -> None:
        """List voices with pagination."""
        voices = client.voices.list(limit=10)
        for voice in voices:
            print(voice.name)
    ```

    From [cartesia-python/examples/examples.py:461](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L461)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesList(client: Cartesia): Promise<void> {
      /** List voices with pagination. */
      for await (const voice of client.voices.list({ limit: 10 })) {
        console.log(voice.name);
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:335](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L335)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_list
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesList
    ```
  </Tab>
</Tabs>


# List Voices to DOM
Source: https://docs.cartesia.ai/examples/voices-list-to-dom

Fetch voices and display them in a <ul> element.

```typescript theme={null}
async function voicesListToDOM(client: Cartesia): Promise<void> {
  /** Fetch voices and display them in a <ul> element. */
  const ul = document.createElement('ul');

  for await (const voice of client.voices.list({ limit: 20 })) {
    const li = document.createElement('li');
    li.textContent = `${voice.name} (${voice.language})`;
    ul.appendChild(li);
  }

  document.body.appendChild(ul);
}
```

From [cartesia-js/examples/browser\_examples.ts:169](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L169)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Update a Voice
Source: https://docs.cartesia.ai/examples/voices-update

Update a voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_update(client: Cartesia, voice_id: str) -> None:
        """Update a voice."""
        client.voices.update(
            voice_id,
            name="Updated Name",
            description="Updated description",
        )
    ```

    From [cartesia-python/examples/examples.py:486](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L486)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesUpdate(client: Cartesia): Promise<void> {
      /** Update a voice. */
      await client.voices.update('voice-id', {
        name: 'Updated Name',
        description: 'Updated description',
      });
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:360](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L360)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_update
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesUpdate
    ```
  </Tab>
</Tabs>


# Welcome to Cartesia
Source: https://docs.cartesia.ai/get-started/overview

Our API enables developers to build real-time, multimodal AI experiences that feel natural and responsive.

<Frame>
  <img alt="" />
</Frame>

The Cartesia API is the fastest, most emotive, ultra-realistic voice AI platform. Purpose-built for developers, it serves state-of-the-art models for both text-to-speech and speech-to-text, enabling seamless conversational AI experiences.

## Sonic Models for Text-to-Speech

Sonic models take text input and and stream back ultra-realistic speech in response. They can also clone voices, with full control over pronunciation and accent.

**Sonic 3 is the world's fastest, most emotive, ultra-realistic text-to-speech model.** It can stream out the first byte of audio in just 90ms, making it perfect for real-time and conversational experiences as well as dubbing, narration, AI avatars, and more. (To put things into perspective, 90ms is about twice as fast as the blink of an eye.)

**If real-time performance is your top priority,** Sonic Turbo offers even better performance, streaming out the first byte of audio in just 40ms.

Learn more about available Sonic model variants and their capabilities in the [TTS Models](../build-with-cartesia/tts-models/latest) section.

## Ink Models for Speech-to-Text

Ink models provide streaming speech-to-text transcription optimized for real-time voice applications.

**Ink-Whisper**, our debut model, is specifically engineered for conversational AI—handling telephony artifacts, background noise, accents, and proper nouns that typically challenge standard STT systems.

Ink-Whisper uses advanced dynamic chunking to process variable-length audio segments, reducing errors and hallucinations during pauses or audio gaps. At just \$0.13/hour, it's the most affordable streaming STT model available.

Learn more about the Ink model and its capabilities in the [STT Models](../build-with-cartesia/stt-models) section.

## Support

<CardGroup>
  <Card title="Discord" icon="discord" href="https://discord.gg/cartesia">
    Join our Discord server to chat with the Cartesia team, engage with the community, and get help with your projects.
  </Card>

  <Card title="Email" icon="envelope" href="mailto:support@cartesia.ai">
    Email us at [support@cartesia.ai](mailto:support@cartesia.ai) to get help with integrating Cartesia, your account, or billing.
  </Card>
</CardGroup>


# LiveKit
Source: https://docs.cartesia.ai/integrations/live-kit


<Frame>
  <img alt="LiveKit Agents logo" />
</Frame>

**LiveKit** is a WebRTC-first platform for realtime **video, voice, and data** in your product. **LiveKit Agents** is its framework for conversational agents.

**Cartesia** integrates in two ways: **LiveKit Inference** (hosted **cartesia/sonic-3** and related model IDs in the agent runtime; keys and pricing are through **LiveKit**—see [LiveKit’s Cartesia TTS guide](https://docs.livekit.io/agents/models/tts/inference/cartesia)) and the open-source **[livekit-plugins-cartesia](https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-cartesia)** Python package for **TTS and STT** using your **Cartesia** credentials from the worker.

# Demo

Here's a demo of a voice assistant built with LiveKit Agents and Cartesia:

<Card title="LiveKit Cartesia Demo" icon="solid link" href="https://cartesia-assistant.vercel.app/">
  Try out the LiveKit Cartesia demo.
</Card>

The source code for this demo is available [here](https://github.com/livekit-examples/cartesia-voice-agent)


# Overview
Source: https://docs.cartesia.ai/integrations/overview

Partner integrations for Cartesia TTS and STT in your own app—not Cartesia-hosted agents.

Cartesia provides first-party speech APIs and SDKs, and integrates with many other products and developer frameworks. The pages in this section describe each path at a high level; detailed setup usually lives in partner documentation and repositories.

## Prerequisites

You’ll need these for almost every integration below. Individual pages also list extras (partner accounts, runtimes, SDK installs).

* **[Cartesia API key](https://play.cartesia.ai/keys)** — create and manage keys in the Playground.
* **A voice** — pick one in the Playground or API; see [Choosing a voice](/build-with-cartesia/capability-guides/choosing-a-voice) and [Voice IDs](/build-with-cartesia/tts-models/voice-ids).

## Integrations

<CardGroup>
  <Card title="LiveKit" icon="circle" href="/integrations/live-kit">
    Realtime rooms and agents—Cartesia via LiveKit Inference or the Cartesia plugin.
  </Card>

  <Card title="Pipecat" icon="cat" href="/integrations/pipecat">
    Python voice and multimodal agents with official Cartesia TTS/STT services.
  </Card>

  <Card title="Twilio" icon="phone" href="/integrations/twilio">
    Programmable Voice and Media Streams with Cartesia TTS (Node walkthrough).
  </Card>

  <Card title="Tencent RTC" icon="tencent-weibo" href="/integrations/tencent-rtc">
    TRTC realtime media with Cartesia for conversational AI workloads.
  </Card>

  <Card title="Thoughtly" icon="phone" href="/integrations/thoughtly">
    No-code phone agents; Cartesia is the default voice stack for new agents.
  </Card>

  <Card title="Rasa" icon="robot" href="/integrations/rasa">
    Rasa Pro voice assistants with Cartesia as the TTS backend.
  </Card>

  <Card title="Vision Agents (by Stream)" icon="camera" href="/integrations/vision-agents-by-stream">
    Stream’s Vision Agents framework with a Cartesia TTS plugin.
  </Card>

  <Card title="MCP" icon="comment" href="/tools/ai/mcp">
    `cartesia-mcp` for Cursor, Claude Desktop, and other MCP clients.
  </Card>
</CardGroup>


# Pipecat
Source: https://docs.cartesia.ai/integrations/pipecat


<Frame>
  <img alt="Pipecat logo" />
</Frame>

## Overview

[**Pipecat**](https://www.pipecat.ai/) is an open-source Python framework for realtime **voice** agents.

Building voice agents requires the creation and orchestration of pipelines, media and communication transports (such as Daily or LiveKit), and pluggable AI models.

**Cartesia** is available as a first-party provider plugin for **[TTS and STT services](https://github.com/pipecat-ai/pipecat/tree/main/src/pipecat/services/cartesia)** in the Pipecat repo.

## Prerequisites

Pipecat’s examples require a recent Python installation (see the Pipecat repo's [root-level README](https://github.com/pipecat-ai/pipecat/tree/main#prerequisites) for current prerequisites).

Install the **`pipecat-ai`** Python package with the **`cartesia`** extra for TTS/STT (bracket syntax):

```
pip install "pipecat-ai[cartesia,...]"

# or

uv add "pipecat-ai[cartesia,...]"
```

You'd also need to choose the **transport** extras your sample needs - you can do this by matching whatever the upstream README lists for that example.

## Getting Started - TTS (Websockets)

Pipecat's getting-started example provides you with a small, copy-friendly path to wire Cartesia TTS into a Pipecat [TTS WebSocket API](https://docs.cartesia.ai/api-reference/tts/websocket), and:

<Card title="Cartesia & Pipecat | Getting Started" icon="github" href="https://github.com/pipecat-ai/pipecat/tree/main/examples/getting-started">
  Getting-started examples in the Pipecat repo.
</Card>

## Getting Started - TTS and STT (Websockets & HTTP)

For smaller voice-focused samples using Cartesia STT and TTS you can choose between two transports - WebSockets or HTTP:

<CardGroup>
  <Card title="Pipecat & Cartesia Voice (WebSockets)" icon="github" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-cartesia.py">
    Voice bot using Cartesia STT & TTS over WebSocket.
  </Card>

  <Card title="Pipecat & Cartesia Voice (HTTP)" icon="github" href="https://github.com/pipecat-ai/pipecat/blob/main/examples/voice/voice-cartesia-http.py">
    Same flow using Cartesia STT & TTS over HTTP.
  </Card>
</CardGroup>

## Orchestrated Conversational AI

For a fuller example app that shows an end to end Voice Agent experience (VAD -> STT -> LLM -> TTS) orchestrated with Pipecat, see StudyPal:

<Card title="Pipecat > StudyPal" icon="github" href="https://github.com/pipecat-ai/pipecat-examples/tree/main/studypal">
  StudyPal example in the pipecat-examples repo.
</Card>


# Rasa
Source: https://docs.cartesia.ai/integrations/rasa


**Rasa** is an open dialogue stack; **voice streaming with Cartesia** is documented for **Rasa Pro** (commercial) assistants. Configure a voice channel in **`credentials.yml`** with `tts: name: cartesia` and **`CARTESIA_API_KEY`** per Rasa’s speech-integrations reference. Start with their walkthrough, then use the reference for parameters (`model_id`, `voice`, multilingual `language_map`, etc.):

<Card title="Tutorial: Build a Voice Agent with Rasa and Cartesia" href="https://rasa.com/blog/building-a-voice-bot-with-rasa-and-cartesia-a-technical-tutorial/">
  Full tutorial for building a voice agent with Rasa and Cartesia.
</Card>

For implementation details, see their documentation:

<Card title="Rasa > Docs > Speech integrations (Cartesia)" href="https://rasa.com/docs/reference/integrations/speech-integrations/#cartesia-tts">
  Rasa reference for Cartesia TTS in voice channels.
</Card>


# Thoughtly
Source: https://docs.cartesia.ai/integrations/thoughtly


<Frame>
  <div>
    <img alt="Thoughtly logo" />
  </div>
</Frame>

**Thoughtly** is a no-code platform for **inbound and outbound AI phone agents** (sales, support, routing): visual flows, CRM and calendar integrations, analytics, and telephony. Following the [Thoughtly × Cartesia partnership](https://www.thoughtly.com/blog/thoughtly-upgrades-its-voice-library-through-partnership-with-cartesia/), **new agents default to Cartesia voices** (low-latency TTS, expanded library, cloning from a short sample in-product); Thoughtly notes existing agents can keep prior voices during migration.

# Demo

<Card title="Thoughtly Cartesia Demo" icon="link" href="https://app.arcade.software/share/MaOO9bPhyHAP5ZdOq8Gt">
  See a demo of Cartesia on Thoughtly.
</Card>


# Integrate with Twilio
Source: https://docs.cartesia.ai/integrations/twilio

How to integrate Twilio with Cartesia to generate audio from text and send it as a voice call.

Use **Twilio Programmable Voice** with **Media Streams** so a phone call receives audio generated by **Cartesia TTS** over WebSockets. This walkthrough uses **Node.js**: a small server bridges Twilio’s stream to Cartesia and plays TTS audio on the callee’s line.

## Prerequisites

Before you begin, make sure you have the following:

1. [Node.js](https://nodejs.org/en/download) installed.
2. A [Twilio account](https://www.twilio.com/en-us/try-twilio). You will need your Account SID and Auth Token.
3. A [Cartesia API key](https://play.cartesia.ai/keys).
4. A phone number that you want to call.
5. A Twilio phone number to call from.
6. An [ngrok authtoken](https://dashboard.ngrok.com/get-started/your-authtoken) (a free account works).

## Get Started

<Steps>
  <Step title="Set Up Your Project">
    1. Create a new directory for your project and navigate to it in your terminal.
    2. Initialize a new Node.js project:
       ```bash lines theme={null}
       npm init -y
       ```
    3. Install the required dependencies:
       ```bash lines theme={null}
       npm install twilio ws http @ngrok/ngrok dotenv
       ```
  </Step>

  <Step title="Configure Environment Variables">
    Create a `.env` file in your project root and add the following:

    ```sh lines theme={null}
    TWILIO_ACCOUNT_SID="your_twilio_account_sid"
    TWILIO_AUTH_TOKEN="your_twilio_auth_token"
    CARTESIA_API_KEY="your_cartesia_api_key"
    NGROK_AUTHTOKEN="your_ngrok_authtoken"
    ```

    Replace the placeholder values with your actual credentials.
  </Step>

  <Step title="Create the Main Script">
    Create a file named `app.js` (or any name you prefer) and add the following code:

    ```javascript lines theme={null}
    const twilio = require('twilio');
    const WebSocket = require('ws');
    const http = require('http');
    const ngrok = require('@ngrok/ngrok');
    const dotenv = require('dotenv');
    const crypto = require('crypto');

    // Load environment variables
    dotenv.config({ override: true });

    // Function to get a value from environment variable or command line argument
    function getConfig(key, defaultValue = undefined) {
      return process.env[key] || process.argv.find(arg => arg.startsWith(`${key}=`))?.split('=')[1] || defaultValue;
    }

    // Configuration
    const config = {
        TWILIO_ACCOUNT_SID: getConfig('TWILIO_ACCOUNT_SID'),
        TWILIO_AUTH_TOKEN: getConfig('TWILIO_AUTH_TOKEN'),
        CARTESIA_API_KEY: getConfig('CARTESIA_API_KEY'),
        NGROK_AUTHTOKEN: getConfig('NGROK_AUTHTOKEN'),
    };

    // Validate required configuration
    const requiredConfig = ['TWILIO_ACCOUNT_SID', 'TWILIO_AUTH_TOKEN', 'CARTESIA_API_KEY', 'NGROK_AUTHTOKEN'];
    for (const key of requiredConfig) {
        if (!config[key]) {
            console.error(`Missing required configuration: ${key}`);
            process.exit(1);
        }
    }

    const client = twilio(config.TWILIO_ACCOUNT_SID, config.TWILIO_AUTH_TOKEN);
    ```
  </Step>

  <Step title="Configure Cartesia TTS">
    In the script, you'll find a configuration section for Cartesia TTS. Make sure to set the following variables according to your needs:

    ```javascript lines theme={null}
    const TTS_WEBSOCKET_URL = `wss://api.cartesia.ai/tts/websocket?cartesia_version=2025-03-01`;
    const modelId = 'sonic-3';
    const voice = {
        'mode': 'id',
        // You can check available voices using the Cartesia API or at https://play.cartesia.ai
        'id': "e07c00bc-4134-4eae-9ea4-1a55fb45746b"
    };
    const partialResponse = 'Hi there, my name is Cartesia. I hope youre having a great day!';
    ```
  </Step>

  <Step title="Set Up Twilio Calling">
    Configure your Twilio outbound and inbound numbers:

    ```javascript lines theme={null}
    const outbound = "+1234567890"; // Replace with the number you want to call
    const inbound = "+1234567890";  // Replace with your Twilio number
    ```
  </Step>

  <Step title="Implement Main Logic">
    The `main()` function orchestrates the entire process:

    1. Connects to the Cartesia TTS WebSocket
    2. Tests the TTS WebSocket
    3. Sets up a Twilio WebSocket server
    4. Creates an ngrok tunnel for the Twilio WebSocket
    5. Initiates the call using Twilio

    ```javascript expandable lines  theme={null}
    let ttsWebSocket;
    let callSid;
    let messageComplete = false;
    let audioChunksReceived = 0;

    function log(message) {
      console.log(`[${new Date().toISOString()}] ${message}`);
    }

    function connectToTTSWebSocket() {
      return new Promise((resolve, reject) => {
        log('Attempting to connect to TTS WebSocket');
        ttsWebSocket = new WebSocket(TTS_WEBSOCKET_URL, {
          headers: { 'X-Api-Key': config.CARTESIA_API_KEY }
        });

        ttsWebSocket.on('open', () => {
          log('Connected to TTS WebSocket');
          resolve(ttsWebSocket);
        });

        ttsWebSocket.on('error', (error) => {
          log(`TTS WebSocket error: ${error.message}`);
          reject(error);
        });

        ttsWebSocket.on('close', (code, reason) => {
          log(`TTS WebSocket closed. Code: ${code}, Reason: ${reason}`);
          reject(new Error('TTS WebSocket closed unexpectedly'));
        });
      });
    }

    function sendTTSMessage(message) {
      const textMessage = {
        'model_id': modelId,
        'transcript': message,
        'voice': voice,
        'output_format': {
          'container': 'raw',
          'encoding': 'pcm_mulaw',
          'sample_rate': 8000
        },
        // create a new context for each message since each is a complete transcript
        'context_id': crypto.randomUUID()
      };

      log(`Sending message to TTS WebSocket: ${message}`);
      ttsWebSocket.send(JSON.stringify(textMessage));
    }

    function testTTSWebSocket() {
      return new Promise((resolve, reject) => {
        const testMessage = 'This is a test message';
        let receivedAudio = false;

        sendTTSMessage(testMessage);

        const timeout = setTimeout(() => {
          if (!receivedAudio) {
            reject(new Error('Timeout: No audio received from TTS WebSocket'));
          }
        }, 10000); // 10 second timeout

        ttsWebSocket.on('message', (audioChunk) => {
          if (!receivedAudio) {
            log(audioChunk);
            log('Received audio chunk from TTS for test message');
            receivedAudio = true;
            clearTimeout(timeout);
            resolve();
          }
        });
      });
    }

    async function startCall(twilioWebsocketUrl) {
      try {
        log(`Initiating call with WebSocket URL: ${twilioWebsocketUrl}`);
        const call = await client.calls.create({
          twiml: `<Response><Connect><Stream url="${twilioWebsocketUrl}"/></Connect></Response>`,
          to: outbound,  // Replace with the phone number you want to call
          from: inbound  // Replace with your Twilio phone number
        });

        callSid = call.sid;
        log(`Call initiated. SID: ${callSid}`);
      } catch (error) {
        log(`Error initiating call: ${error.message}`);
        throw error;
      }
    }

    async function hangupCall() {
      try {
        log(`Attempting to hang up call: ${callSid}`);
        await client.calls(callSid).update({status: 'completed'});
        log('Call hung up successfully');
      } catch (error) {
        log(`Error hanging up call: ${error.message}`);
      }
    }

    function setupTwilioWebSocket() {
        return new Promise((resolve, reject) => {
          const server = http.createServer((req, res) => {
            log(`Received HTTP request: ${req.method} ${req.url}`);
            res.writeHead(200);
            res.end('WebSocket server is running');
          });

          const wss = new WebSocket.Server({ server });

          log('WebSocket server created');

          wss.on('connection', (twilioWs, request) => {
            log(`Twilio WebSocket connection attempt from ${request.socket.remoteAddress}`);

            let streamSid = null;

            twilioWs.on('message', (message) => {
              try {
                const msg = JSON.parse(message);
                log(`Received message from Twilio: ${JSON.stringify(msg)}`);

                if (msg.event === 'start') {
                  log('Media stream started');
                  streamSid = msg.start.streamSid;
                  log(`Stream SID: ${streamSid}`);
                  sendTTSMessage(partialResponse);
                } else if (msg.event === 'media' && !messageComplete) {
                  log('Received media event');
                } else if (msg.event === 'stop') {
                  log('Media stream stopped');
                  hangupCall();
                }
              } catch (error) {
                log(`Error processing Twilio message: ${error.message}`);
              }
            });

            twilioWs.on('close', (code, reason) => {
              log(`Twilio WebSocket disconnected. Code: ${code}, Reason: ${reason}`);
            });

            twilioWs.on('error', (error) => {
              log(`Twilio WebSocket error: ${error.message}`);
            });

            // Handle incoming audio chunks from TTS WebSocket
            ttsWebSocket.on('message', (audioChunk) => {
              log('Received audio chunk from TTS');
              try {
                if (streamSid) {
                  twilioWs.send(JSON.stringify({
                    event: 'media',
                    streamSid: streamSid,
                    media: {
                      payload: JSON.parse(audioChunk)['data']
                    }
                  }));

                  audioChunksReceived++;
                  log(`Audio chunks received: ${audioChunksReceived}`);

                  if (audioChunksReceived >= 50) {
                    messageComplete = true;
                    log('Message complete, preparing to hang up');
                    setTimeout(hangupCall, 2000);
                  }
                } else {
                  log('Warning: Received audio chunk but streamSid is not set');
                }
              } catch (error) {
                log(`Error sending audio chunk to Twilio: ${error.message}`);
              }
            });

            log('Twilio WebSocket connected and handlers set up');
          });

          wss.on('error', (error) => {
            log(`WebSocket server error: ${error.message}`);
          });

          server.listen(0, () => {
            const port = server.address().port;
            log(`Twilio WebSocket server is running on port ${port}`);
            resolve(port);
          });

          server.on('error', (error) => {
            log(`HTTP server error: ${error.message}`);
            reject(error);
          });
        });
      }

    async function setupNgrokTunnel(port) {
        try {
          const listener = await ngrok.forward({
            addr: port,
            authtoken: config.NGROK_AUTHTOKEN,
          });
          const wssUrl = listener.url().replace('https://', 'wss://');
          log(`ngrok tunnel established: ${wssUrl}`);
          return wssUrl;
        } catch (error) {
          log(`Error setting up ngrok tunnel: ${error.message}`);
          throw error;
        }
      }

    async function main() {
      try {
        log('Starting application');

        await connectToTTSWebSocket();
        log('TTS WebSocket connected successfully');

        await testTTSWebSocket();
        log('TTS WebSocket test passed successfully');

        const twilioWebsocketPort = await setupTwilioWebSocket();
        log(`Twilio WebSocket server set up on port ${twilioWebsocketPort}`);

        const twilioWebsocketUrl = await setupNgrokTunnel(twilioWebsocketPort);

        await startCall(twilioWebsocketUrl);
      } catch (error) {
        log(`Error in main function: ${error.message}`);
      }
    }

    // Run the script
    main();
    ```
  </Step>

  <Step title="Run the Application">
    To run the application, use the following command:

    ```bash lines theme={null}
    node app.js
    ```
  </Step>
</Steps>

## How It Works

1. The script establishes a connection to Cartesia's TTS WebSocket.
2. It sets up a local WebSocket server to communicate with Twilio.
3. An ngrok tunnel is created to expose the local WebSocket server to the internet.
4. A call is initiated using Twilio, connecting to the ngrok tunnel.
5. When the call connects, the script sends the predefined message to Cartesia's TTS.
6. Cartesia converts the text to speech and sends audio chunks back.
7. The script forwards these audio chunks to Twilio, which plays them on the call.

## Customization

* To change the spoken message, modify the `partialResponse` variable.
* Adjust the voice parameters in the `voice` object to change the TTS voice characteristics.
* Modify the `audioChunksReceived` threshold to control when the call should end.

## Troubleshooting

* If you encounter any issues, check the console logs for detailed error messages.
* Ensure all required environment variables are correctly set.
* If you see `invalid tunnel configuration`, make sure you're using the better supported `@ngrok/ngrok` package and not `ngrok`.


# CLI documentation
Source: https://docs.cartesia.ai/line/cli


Create, deploy, and manage voice agents from the command line.

## Installation

<Warning>By running the quick install commands, you are accepting Cartesia's [Terms of Service (TOS)](https://cartesia.ai/legal/terms.html). Please make sure to review the full TOS here before proceeding.</Warning>

Install and download from our servers:

```zsh lines theme={null}
curl -fsSL https://cartesia.sh | sh
```

Update to the latest version:

```zsh lines theme={null}
cartesia update
```

## Quick Start

<Steps>
  <Step title="Login with API key">
    Authenticate with your Cartesia API key.
    To make an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys) and select your organization.

    ```zsh lines theme={null}
    cartesia auth login  # paste your API key when prompted
    ```
  </Step>

  <Step title="Clone an example agent">
    Clone an example agent from the Line repository.

    ```zsh lines theme={null}
    cartesia create my-agent
    # Choose any example you like.
    cd my-agent
    ```
  </Step>

  <Step title="Initialize your agent">
    Give your agent a name and link it to your organization.

    ```zsh lines theme={null}
    cartesia init
    ```
  </Step>

  <Step title="Deploy your agent">
    Deploy your agent to make it available in the playground.

    ```zsh lines theme={null}
    cartesia deploy
    ```
  </Step>
</Steps>

## Features

### Initialize a Project

Link any directory to a new or existing Cartesia agent:

```zsh lines theme={null}
cartesia init
```

Create a project from an example:

```zsh lines theme={null}
cartesia create
```

<Tip>
  Inside a project directory, the CLI auto-detects the agent. Run `cartesia status` to see the current agent ID.
</Tip>

### Chat with Your Agent

Test your agent's text reasoning locally.

Terminal 1. Run your text logic fastapi server:

```zsh lines theme={null}
PORT=8000 uv run python main.py
```

Terminal 2. Run the CLI to chat with your agent:

```zsh lines theme={null}
cartesia chat 8000
```

## Commands

### Authentication

To get an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys), select your organization, and generate a new key.

```zsh lines theme={null}
cartesia auth login
```

To validate the existing API key:

```zsh lines theme={null}
cartesia auth status
```

To logout (clears cached credentials):

```zsh lines theme={null}
cartesia auth logout
```

### Voice Agents

Deploy your agent to Cartesia cloud.

```zsh lines theme={null}
cartesia deploy
```

List out all the agents in your organization:

```zsh lines theme={null}
cartesia agents ls
```

#### Managed Deployments

Versions of your agent running on Cartesia's cloud. Each deployment rebuilds the environment, instantiates your project, and runs a health check.

To see all of your deployments:

```zsh lines theme={null}
cartesia deployments ls
```

Check the status of a deployment:

```zsh lines theme={null}
cartesia status [<deployment-id> or <agent-id>]
```

#### Self-Hosted Agent Code

While Cartesia's managed deployments are the simplest way to deploy low-latency voice agents, if you'd like to manage your own deployments of your agent code, you can pass us a URL for your agent to connect to during calls.

Connect an existing agent to your self-hosted code:

```zsh lines theme={null}
cartesia connect --agent-id <agent-id> --url https://my-agent.example.com
```

Or run without `--agent-id` to interactively select an existing agent or create a new one:

```zsh lines theme={null}
cartesia connect --url https://my-agent.example.com
```

Disconnect an agent from your self-hosted code:

```zsh lines theme={null}
cartesia disconnect --agent-id <agent-id>
```

### Environment Variables

Create, list, and remove environment variables for your agent.

Set environment variables for your agent:

```zsh lines theme={null}
cartesia env set API_KEY=FOOBAR MY_CONFIG=FOOBAZ
```

<Warning icon="lock">
  Environment variables are encrypted for storage and can only be accessed by your code.
</Warning>

Port environment variables from a `.env` file:

```zsh lines theme={null}
cartesia env set --from .env
```

```text .env theme={null}
API_KEY=FOOBAR
MY_CONFIG=FOOBAZ
```

Remove an environment variable:

```zsh lines theme={null}
cartesia env rm <env-var-name>
```

### Help Menu

For more details on any command:

```zsh lines theme={null}
cartesia --help
```


# Release Notes
Source: https://docs.cartesia.ai/line/developer-tools/release-notes

Updates to the Line SDK and platform.

## March 2026

Platform-wide API, PVC, and client library updates for this month are in [Changelog 2026](/changelog/2026) (March 2026).

***

## February 4, 2026

### AgentUpdateCall Output Event

Added `AgentUpdateCall` event for dynamically updating call configuration during a conversation:

```python theme={null}
from line.events import AgentUpdateCall

# In an agent's process method:
yield AgentUpdateCall(voice_id="5ee9feff-1265-424a-9d7f-8e4d431a12c7")
yield AgentUpdateCall(pronunciation_dict_id="dict-123")
```

| Field                   | Description                          |
| ----------------------- | ------------------------------------ |
| `voice_id`              | Updates the agent's voice            |
| `pronunciation_dict_id` | Updates the pronunciation dictionary |

All fields are optional—only set fields are updated. See [Events](/line/sdk/events#dynamic-configuration) for details.

***

## February 1, 2026

### Line SDK v0.2 — Major Release

We're releasing **Line SDK v0.2**, a complete redesign of the voice agent framework focused on simplicity, streaming performance, and seamless LLM integration. This release introduces a new async iterable architecture that replaces the previous event bus system.

<Warning>
  **Breaking Changes**: v0.2 is not backwards compatible with v0.1.x. See the [Migration Guide](#migration-guide-from-v0-1-x-to-v0-2) below for detailed upgrade instructions.
</Warning>

<Info>
  **What's changing?** Line SDK v0.2 makes it much simpler to build voice agents. Instead of manually wiring together multiple components (systems, bridges, nodes), you now write a single function that returns your agent. The SDK handles audio, interruptions, and conversation flow automatically.
</Info>

**Why upgrade?**

* **Faster development** — Build agents in hours instead of days with less boilerplate code
* **Easier maintenance** — Fewer moving parts means fewer bugs and simpler debugging
* **Better reliability** — Built-in error handling, retries, and fallback models
* **More flexibility** — Switch between 100+ AI providers (OpenAI, Anthropic, Google, etc.) without code changes
* **Powerful tools** — Add capabilities like web search, call transfers, and multi-agent handoffs with one line of code

***

## What's New in v0.2

### Simplified Agent Architecture

The new architecture replaces the `VoiceAgentSystem`, `Bus`, `Bridge`, and `ReasoningNode` pattern with a single async iterable function:

```python theme={null}
import os
from line import CallRequest
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import AgentEnv, VoiceAgentApp

async def get_agent(env: AgentEnv, call_request: CallRequest):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)
```

**Benefits:**

* Less boilerplate code
* No manual event routing or bridge configuration
* Automatic conversation history management
* Built-in interruption handling
* Quick, and easy tool definition

### Built-in LLM Support via LiteLLM

`LlmAgent` provides unified access to 100+ LLM providers through [LiteLLM](https://github.com/BerriAI/litellm):

```python theme={null}
# OpenAI
LlmAgent(model="gpt-5-nano", api_key=os.getenv("OPENAI_API_KEY"), ...)

# Anthropic
LlmAgent(model="anthropic/claude-haiku-4-5-20251001", api_key=os.getenv("ANTHROPIC_API_KEY"), ...)

# Google Gemini
LlmAgent(model="gemini/gemini-2.5-flash-preview-09-2025", api_key=os.getenv("GEMINI_API_KEY"), ...)

# With fallbacks
LlmAgent(
    model="gpt-5-nano",
    config=LlmConfig(fallbacks=["anthropic/claude-haiku-4-5-20251001", "gemini/gemini-2.5-flash-preview-09-2025"]),
    ...
)
```

### Declarative Tool System

Define agent capabilities using simple decorators. Three tool types cover all common scenarios:

| Tool Type       | Decorator           | What It Does                                                    | Example Use Case                                  |
| --------------- | ------------------- | --------------------------------------------------------------- | ------------------------------------------------- |
| **Loopback**    | `@loopback_tool`    | Fetches information, then the agent speaks the answer naturally | Looking up order status, checking account balance |
| **Passthrough** | `@passthrough_tool` | Takes an immediate action without additional AI processing      | Ending a call, transferring to a phone number     |
| **Handoff**     | `@handoff_tool`     | Transfers the conversation to a different specialized agent     | Routing to Spanish support, escalating to billing |

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool, passthrough_tool, handoff_tool
from line.events import AgentEndCall

@loopback_tool
async def get_weather(ctx, city: Annotated[str, "City name"]) -> str:
    """Get current weather for a city."""
    return f"72°F and sunny in {city}"

@passthrough_tool
async def end_call(ctx):
    """End the call."""
    yield AgentEndCall()

@handoff_tool
async def transfer_to_support(ctx, event):
    """Transfer to support agent."""
    async for output in support_agent.process(ctx.turn_env, event):
        yield output
```

### Background Tool Execution

Long-running tools can execute in the background without blocking the LLM:

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool(is_background=True)
async def check_bank_balance(ctx, account_id: Annotated[str, "Account ID"]):
    """Check account balance (may take a few seconds)."""
    yield "Checking your balance..."  # Immediate acknowledgment
    balance = await api.get_balance(account_id)  # Long operation
    yield f"Your balance is ${balance:.2f}"  # Triggers new LLM completion
```

### Built-in Tools

Common operations available out of the box:

```python theme={null}
from line.llm_agent import end_call, send_dtmf, transfer_call, web_search, agent_as_handoff

agent = LlmAgent(
    tools=[
        end_call,                    # End the call
        send_dtmf,                   # Send DTMF tones
        transfer_call,               # Transfer to phone number
        web_search,                  # Real-time web search
        agent_as_handoff(other_agent, name="transfer_to_billing"),
    ],
    ...
)
```

### Multi-Agent Workflows

Create sophisticated agent routing with `agent_as_handoff`:

```python theme={null}
spanish_agent = LlmAgent(
    model="gpt-5-nano",
    config=LlmConfig(system_prompt="Speak only in Spanish.", ...),
    ...
)

main_agent = LlmAgent(
    tools=[
        agent_as_handoff(
            spanish_agent,
            handoff_message="Transferring to Spanish support...",
            name="transfer_to_spanish",
            description="Transfer when user requests Spanish.",
        ),
    ],
    ...
)
```

### Structured Event System

Events are how your agent communicates with the outside world. **Output events** are actions your agent takes (speaking, ending calls). **Input events** are things that happen during a call (user speaks, call starts).

**Output Events** (agent → harness):

* `AgentSendText` — Send text to be spoken
* `AgentEndCall` — End the call
* `AgentTransferCall` — Transfer to another number
* `AgentSendDtmf` — Send DTMF tone
* `AgentToolCalled` / `AgentToolReturned` — Tool execution tracking
* `LogMetric` / `LogMessage` — Observability

**Input Events** (harness → agent):

* `CallStarted` / `CallEnded` — Call lifecycle
* `UserTurnStarted` / `UserTurnEnded` — User speaking
* `UserTextSent` / `UserDtmfSent` — User content
* `AgentHandedOff` — Handoff notification

All input events include a `history` field with the complete conversation context.

### Enhanced Configuration

Fine-tune how your agent thinks and responds. `LlmConfig` lets you control the AI's personality, response length, creativity, and reliability:

```python theme={null}
LlmConfig(
    system_prompt="You are a helpful assistant.",
    introduction="Hello! How can I help?",

    # Sampling parameters
    temperature=0.7,
    max_tokens=1024,
    top_p=0.95,

    # Resilience
    num_retries=2,
    fallbacks=["gpt-5-nano"],
    timeout=30.0,

    # Provider-specific options
    extra={"reasoning_effort": "high"},
)
```

***

## Migration Guide from v0.1.x to v0.2

This guide walks you through upgrading your existing v0.1.x agents to v0.2. The migration involves updating imports, simplifying your agent setup, and adopting the new tool system. Most agents can be migrated in under an hour.

### Overview of Changes

| v0.1.x                                | v0.2                                      |
| ------------------------------------- | ----------------------------------------- |
| `VoiceAgentSystem` + `Bus` + `Bridge` | `VoiceAgentApp` with `get_agent` callback |
| `ReasoningNode` subclasses            | `LlmAgent` or custom `Agent` protocol     |
| `call_handler(system, request)`       | `get_agent(env, request) -> Agent`        |
| Manual event routing                  | Automatic event dispatch with filters     |
| `process_context()` method            | `process(env, event)` async iterable      |

### Step 1: Update Imports

```python theme={null}
# v0.1.x
from line.voice_agent_app import VoiceAgentApp
from line.voice_agent_system import VoiceAgentSystem
from line.bridge import Bridge
from line.nodes import ReasoningNode
from line.events import (
    AgentSpeechSent,
    UserTranscriptionReceived,
    EndCall,
    TransferCall,
)

# v0.2
from line.voice_agent_app import VoiceAgentApp, AgentEnv
from line.llm_agent import LlmAgent, LlmConfig
from line.llm_agent import end_call, transfer_call, loopback_tool, passthrough_tool
from line.events import (
    AgentSendText,
    AgentEndCall,
    AgentTransferCall,
    UserTurnEnded,
    CallStarted,
)
```

### Step 2: Replace VoiceAgentSystem with get\_agent

In v0.1.x, event routing was configured manually via `bridge.on()`. In v0.2, event dispatch is automatic with customizable **run** and **cancel filters**.

<CodeGroup>
  ```python v0.1.x theme={null}
  from line.voice_agent_app import VoiceAgentApp
  from line.voice_agent_system import VoiceAgentSystem
  from line.bridge import Bridge
  from line.nodes import ReasoningNode
  from line.events import (
      UserTranscriptionReceived,
      UserStoppedSpeaking,
      DTMFInputEvent,
  )

  class MyReasoningNode(ReasoningNode):
      async def process_context(self, context):
          # Your LLM logic here
          response = await call_llm(context.messages)
          yield AgentResponse(content=response)

  async def call_handler(system: VoiceAgentSystem, call_request):
      node = MyReasoningNode(system_prompt="You are helpful.")
      bridge = Bridge(node)

      system.with_speaking_node(node, bridge)

      # Manual event routing with bridge.on()
      bridge.on(UserTranscriptionReceived).map(node.add_event)
      bridge.on(UserStoppedSpeaking).stream(node.generate).broadcast()

      # DTMF events required explicit routing
      bridge.on(DTMFInputEvent).map(node.handle_dtmf)

      await system.start()
      await system.send_initial_message("Hello!")
      await system.wait_for_shutdown()

  app = VoiceAgentApp(call_handler=call_handler)
  ```

  ```python v0.2 theme={null}
  import os
  from line import CallRequest
  from line.voice_agent_app import VoiceAgentApp, AgentEnv
  from line.llm_agent import LlmAgent, LlmConfig, end_call
  from line.events import (
      CallStarted,
      UserTurnEnded,
      UserDtmfSent,
      UserTurnStarted,
      CallEnded,
  )

  async def get_agent(env: AgentEnv, call_request: CallRequest):
      agent = LlmAgent(
          model="gpt-5-nano",
          api_key=os.getenv("OPENAI_API_KEY"),
          tools=[end_call],
          config=LlmConfig(
              system_prompt="You are helpful.",
              introduction="Hello!",
          ),
      )

      # Default: just return the agent (uses default filters)
      return agent

  async def get_agent_with_dtmf(env: AgentEnv, call_request: CallRequest):
      """Alternative: include DTMF events in processing."""
      agent = LlmAgent(...)

      # Return an AgentSpec tuple to customize filters
      run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded]
      cancel_filter = [UserTurnStarted]
      return (agent, run_filter, cancel_filter)

  app = VoiceAgentApp(get_agent=get_agent)
  ```
</CodeGroup>

#### Run and Cancel Filters

Filters control your agent's behavior during a call:

* **Run filters** determine what triggers your agent to respond (e.g., when the user finishes speaking)
* **Cancel filters** determine what interrupts your agent (e.g., when the user starts talking over the agent)

You can customize these by returning a tuple instead of just the agent:

```python theme={null}
from typing import Union, Tuple

AgentSpec = Union[Agent, Tuple[Agent, run_filter, cancel_filter]]
```

| Filter             | Purpose                                    | Default                                   |
| ------------------ | ------------------------------------------ | ----------------------------------------- |
| **run\_filter**    | Events that trigger agent processing       | `[CallStarted, UserTurnEnded, CallEnded]` |
| **cancel\_filter** | Events that cancel in-progress agent tasks | `[UserTurnStarted]`                       |

**Example: Agent that responds to DTMF input**

```python theme={null}
from line.events import (
    CallStarted, CallEnded, UserTurnEnded, UserTurnStarted, UserDtmfSent
)

async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)

    # Include UserDtmfSent in run_filter to process DTMF
    run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded]
    cancel_filter = [UserTurnStarted]

    return (agent, run_filter, cancel_filter)
```

**Example: Agent that doesn't get interrupted**

```python theme={null}
async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)

    # Empty cancel_filter = agent won't be interrupted
    run_filter = [CallStarted, UserTurnEnded, CallEnded]
    cancel_filter = []

    return (agent, run_filter, cancel_filter)
```

**Example: Custom filter function**

```python theme={null}
def my_run_filter(event: InputEvent) -> bool:
    """Only process events during business hours."""
    if isinstance(event, CallStarted):
        return is_business_hours()
    return isinstance(event, (UserTurnEnded, CallEnded))

async def get_agent(env: AgentEnv, call_request: CallRequest):
    agent = LlmAgent(...)
    return (agent, my_run_filter, [UserTurnStarted])
```

### Step 3: Migrate Event Handling

<CodeGroup>
  ```python v0.1.x theme={null}
  # Event names
  AgentSpeechSent        # Agent spoke
  UserTranscriptionReceived  # User spoke
  EndCall                # End call
  TransferCall           # Transfer call

  # Manual event handling in ReasoningNode
  class MyNode(ReasoningNode):
      async def process_context(self, context):
          for event in context.events:
              if isinstance(event, UserTranscriptionReceived):
                  user_message = event.transcription
  ```

  ```python v0.2 theme={null}
  # Event names
  AgentSendText          # Output: send text to speak
  AgentTextSent          # Input: confirmation text was spoken
  UserTurnEnded          # Input: user finished speaking
  AgentEndCall           # Output: end call
  AgentTransferCall      # Output: transfer call

  # Events include history automatically
  async def process(self, env, event):
      if isinstance(event, UserTurnEnded):
          # Access user's message
          user_message = event.content[0].content

          # Access full conversation history
          for past_event in event.history:
              if isinstance(past_event, UserTextSent):
                  print(f"User previously said: {past_event.content}")
  ```
</CodeGroup>

### Step 4: Migrate Custom Tools

<CodeGroup>
  ```python v0.1.x theme={null}
  # Manual tool handling in ReasoningNode
  class MyNode(ReasoningNode):
      async def process_context(self, context):
          # Parse tool calls from LLM response
          if tool_call := extract_tool_call(response):
              result = await self.execute_tool(tool_call)
              # Manually add to context and call LLM again
              context.add_tool_result(result)
              response = await call_llm(context)
  ```

  ```python v0.2 theme={null}
  from typing import Annotated
  from line.llm_agent import loopback_tool, passthrough_tool
  from line.events import AgentSendText, AgentEndCall

  # Declarative tool definitions
  @loopback_tool
  async def get_account_balance(ctx, account_id: Annotated[str, "Account ID"]):
      """Look up account balance."""
      balance = await api.get_balance(account_id)
      return f"${balance:.2f}"

  @passthrough_tool
  async def end_call_with_message(ctx, message: Annotated[str, "Goodbye message"]):
      """End call with a custom message."""
      yield AgentSendText(text=message)
      yield AgentEndCall()

  # Tools are passed to LlmAgent
  agent = LlmAgent(
      tools=[get_account_balance, end_call_with_message],
      ...
  )
  ```
</CodeGroup>

### Step 5: Migrate Multi-Agent Patterns

<CodeGroup>
  ```python v0.1.x theme={null}
  # Manual agent switching
  class MainNode(ReasoningNode):
      def __init__(self, spanish_node):
          self.spanish_node = spanish_node
          self.use_spanish = False

      async def process_context(self, context):
          if self.should_switch_to_spanish(context):
              self.use_spanish = True
              # Complex manual state management
  ```

  ```python v0.2 theme={null}
  from line.llm_agent import agent_as_handoff

  spanish_agent = LlmAgent(
      model="gpt-5-nano",
      config=LlmConfig(system_prompt="Speak only in Spanish."),
      ...
  )

  main_agent = LlmAgent(
      tools=[
          agent_as_handoff(
              spanish_agent,
              handoff_message="Transferring...",
              name="transfer_to_spanish",
              description="Use when user requests Spanish.",
          ),
      ],
      ...
  )
  ```
</CodeGroup>

### Removed APIs

The following APIs from v0.1.x have been removed with no direct replacement:

| Removed               | Alternative                                  |
| --------------------- | -------------------------------------------- |
| `VoiceAgentSystem`    | Use `VoiceAgentApp` with `get_agent`         |
| `Bus`                 | Events are dispatched automatically          |
| `Bridge`              | Use run/cancel filters on `AgentSpec`        |
| `ReasoningNode`       | Use `LlmAgent` or implement `Agent` protocol |
| `ConversationHarness` | Handled internally by `ConversationRunner`   |
| `EventsRegistry`      | Use typed event classes directly             |

### Custom Agent Protocol

If you need custom logic beyond `LlmAgent`, implement the `Agent` protocol:

```python theme={null}
from typing import AsyncIterable
from line.events import (
    InputEvent,
    OutputEvent,
    AgentSendText,
    CallStarted,
    UserTurnEnded,
)

class CustomAgent:
    """Custom agent implementing the Agent protocol."""

    async def process(self, env, event: InputEvent) -> AsyncIterable[OutputEvent]:
        if isinstance(event, CallStarted):
            yield AgentSendText(text="Hello from custom agent!")
        elif isinstance(event, UserTurnEnded):
            # Your custom logic here
            user_message = event.content[0].content
            response = await your_custom_logic(user_message, event.history)
            yield AgentSendText(text=response)
```

***

## Breaking Changes Summary

This section provides a quick reference for all breaking changes. Use this as a checklist when migrating your code.

### Event Renames

| v0.1.x                      | v0.2                                               |
| --------------------------- | -------------------------------------------------- |
| `AgentSpeechSent`           | `AgentSendText` (output) / `AgentTextSent` (input) |
| `UserTranscriptionReceived` | `UserTextSent` / `UserTurnEnded`                   |
| `UserStartedSpeaking`       | `UserTurnStarted`                                  |
| `UserStoppedSpeaking`       | `UserTurnEnded`                                    |
| `AgentStartedSpeaking`      | `AgentTurnStarted`                                 |
| `AgentStoppedSpeaking`      | `AgentTurnEnded`                                   |
| `EndCall`                   | `AgentEndCall`                                     |
| `TransferCall`              | `AgentTransferCall`                                |
| `DTMFInputEvent`            | `UserDtmfSent`                                     |
| `DTMFOutputEvent`           | `AgentSendDtmf`                                    |

<Note>
  **Output vs. Input events**: `AgentSendText` is an output event you **yield** to make the agent speak. `AgentTextSent` is an input event you **receive** confirming what was spoken (appears in history).
</Note>

### Structural Changes

* **History in events**: All input events now include an optional `history` field with complete conversation context. When `history` is `None`, the event is inside a history list; when it contains a list, the event has full context attached.
* **Tool events**: `ToolCall`/`ToolResult` replaced with structured `AgentToolCalled`/`AgentToolReturned`
* **Event IDs**: All events now have stable `event_id` fields for tracking

### Configuration Changes

| v0.1.x                            | v0.2                                  |
| --------------------------------- | ------------------------------------- |
| `CallRequest.agent.system_prompt` | `LlmConfig.system_prompt`             |
| `CallRequest.agent.introduction`  | `LlmConfig.introduction`              |
| Manual LLM parameters             | `LlmConfig` with full LiteLLM support |

<Tip>
  Use `LlmConfig.from_call_request(call_request, fallback_system_prompt="...", fallback_introduction="...")` to automatically inherit configuration from the Cartesia Playground while providing sensible defaults. See [Agents documentation](/line/sdk/agents#accessing-call-metadata-in-your-agent-logic) for details.
</Tip>

***

## New Dependencies

v0.2 introduces the following dependencies:

```
litellm              # Multi-provider LLM support
pydantic             # Type validation for events
phonenumbers >= 9.0  # Phone number validation for transfer_call
```

Optional dependencies for examples:

```
exa-py               # Exa web search integration
duckduckgo-search    # Fallback web search
```

***

## Getting Help

* **Documentation**: [Line SDK Overview](/line/sdk/overview)
* **Examples**: [github.com/cartesia-ai/line/examples](https://github.com/cartesia-ai/line/tree/main/examples)
* **Support**: [support@cartesia.ai](mailto:support@cartesia.ai)


# Metrics
Source: https://docs.cartesia.ai/line/evaluations/metrics


The Line platform includes a suite of tools for evaluating how your Agent is performing, both during development phase and in production.
You have full control over how metrics for evaluating your agent are defined.

<Frame>
  <iframe />
</Frame>

## System Metrics

By default, all calls made by a Line Agent have a set of system metrics automatically calculated to help evaluate performance.

| System Metric                  | Description                                                                                                  |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------ |
| system\_call\_success          | A boolean status indicating if the call disconnects unexpectedly, for example due to reasoning code crashing |
| system\_text\_to\_speech\_ttfb | The time to first byte of audio generated by the TTS model on the first turn of the conversation             |

### LLM as a Judge

An LLM-as-a-Judge metric is created in the playground by setting a name and specifying a prompt. You can try out different prompts in
the playground against existing call transcripts by copying a call id into the metric creation field and clicking evaluate
to generate a sample output.

<Frame>
  <iframe />
</Frame>

<Tip title="Prompt Tips" icon="gavel">
  Write your LLM as a Judge metrics to return a single value and description
  field.
</Tip>

A metric name can only include lower case letters, digits, and ‘-’, ‘\_’, or ‘.’ characters so that you can manage it
from a cli. Metric names must also be unique within your organization.

## Assigning Metrics

Once a metric is created, it can be assigned to an Agent via the playground from the Agent page. All subsequent calls made
to or from that Agent will have metric results calculated and available to view in the console and API. Note
that when you assign a metric to an existing Agent, it won’t be automatically run on previous calls.

<Frame>
  <img alt="Assign a metric" />
</Frame>


# Metrics Results
Source: https://docs.cartesia.ai/line/evaluations/results

View the results from metrics run against all calls handled by your agent.

Metrics results are accessible via both API and the playground.

Each metric result contains relevant information to help you analyze your calls. Some fields include:

```
- metric_id
- metric_name
- agent_id
- call_id
- summary
- transcript
- deployment_id
- value
- status
```

To view the full schema, visit the API [List Metric Results](/api-reference/agents/metrics/list-metric-results).

## API

To get metrics via the API, you can specify a few filter parameters including `call_id`, `agent_id` and more. You can retrieve these metric results or export them into a CSV. [List Metric Results](/api-reference/agents/metrics/list-metric-results) and [Export Metric Results](/api-reference/agents/metrics/export-metric-results) have the same query parameters available and differ only in the response format.

#### Example Request for CSV Results

<CodeGroup>
  ```zsh cURL lines theme={null}
  curl --location 'https://api.cartesia.ai/agents/metrics/export?metric_id={metric_id}&limit=100&starting_after={previous_next_page_metric_id}' \
  --header 'Cartesia-Version: 2025-04-16' \
  --header 'Authorization: Bearer {YOUR_API_KEY}'
  ```

  ```python Python lines theme={null}
  import requests

  url = "https://api.cartesia.ai/agents/metrics/export"
  params = {
      "metric_id": "{metric_id}",
      "limit": 100,
      "starting_after": "{previous_next_page_metric_id}"
  }
  headers = {
      "Content-Type": "application/json",
      "Cartesia-Version": "2025-04-16",
      "Authorization": "Bearer <YOUR_API_KEY>"
  }

  response = requests.get(url, headers=headers, params=params)

  if response.status_code == 200:
      # Save CSV to file
      with open("metrics.csv", "w", encoding="utf-8") as f:
          f.write(response.text)
      print("CSV file saved as metrics.csv")
  else:
      print(f"Error {response.status_code}: {response.text}")
  ```

  ```typescript Javascript lines theme={null}
  const response = await fetch(
    "https://api.cartesia.ai/agents/metrics/export?metric_id={metric_id}&limit=100&starting_after={previous_next_page_metric_id}",
    {
      method: "GET",
      headers: {
        "Content-Type": "application/json",
        "Cartesia-Version": "2025-04-16",
        Authorization: "Bearer <your_api_key>",
      },
    }
  );
  ```
</CodeGroup>

## Console

Metrics are visible in the playground for a specific call record.


# Deployments
Source: https://docs.cartesia.ai/line/infrastructure/deployments


Deployments are instances of your agent running on Cartesia's servers.

<Frame>
  <img alt="Deployments" />
</Frame>

# State

Only deployments in the `ready` state can handle inbound or outbound calls. At any time, only one deployment is active.
Deployments that fail health checks will not receive traffic.

# Creating a deployment

Use `cartesia deploy` or push to a linked GitHub repository to create a deployment.

Cartesia servers:

1. Build the virtual environment
2. Load `main.py` and instantiate a FastAPI app
3. Run a health check
4. Set the deployment to `ready` and start receiving traffic

<Info>
  Line supports Python 3.9–3.13 (specify in `pyproject.toml`). FastAPI servers only; more frameworks coming soon.
</Info>

<Tip title="Pre-Call Initialization" icon="phone-volume">
  **Pre-Call Initialization**

  Inbound calls will ring for five seconds to allow your application logic to warm up any required resources and establish
  connections.
</Tip>


# Observability
Source: https://docs.cartesia.ai/line/infrastructure/observability

Get full visibility into how your Agent is performing.

Monitor every deployment and call.

<Frame>
  <iframe />
</Frame>

## Deployment

Each deployment generates a unique ID. View logs in the console.

<Frame>
  <img alt="Sample Deployment Logs" />
</Frame>

## Call Logs

You can click into a call and view any logging statements generated by your reasoning code.

<Frame>
  <iframe />
</Frame>

## Transcripts

Each call has a transcript with independently separated transcribed audio and text to be generated. When you export these
transcripts with the API or CLI, these include more granular turn level timestamps.

<Frame>
  <img alt="Sample Call Transcripts" />
</Frame>

## Loggable Events

Record events without tying them to tool calls.

### SDK

In the SDK, yield `LogMessage` events from your agent or tools to record custom events:

```python theme={null}
from line.events import LogMessage

@loopback_tool
async def process_order(ctx, order_id: Annotated[str, "Order ID"]):
    """Process a customer order."""
    result = await api.process_order(order_id)

    # Log a custom event
    yield LogMessage(
        name="order_processed",
        level="info",
        message=f"Processed order {order_id}",
        metadata={"status": result.status, "order_id": order_id}
    )

    return f"Order {order_id} processed: {result.status}"
```

Events are automatically sent to the platform when yielded.

### Websocket

If you're not using the SDK and instead just relying on the bare websocket, logging events will look like this:

```json theme={null}
{
  "type": "log_event",
  "event": "event_name",
  "metadata": {
    "key": "value"
  }
}
```

### Playground

You can view these events in the Playground under the `Transcript` tab of the call.

## Loggable Metrics

Record metrics at any point in your workflow.

### SDK

In the context of the SDK, we can log a metric by broadcasting the `LogMetric` event.
Here's a snippet from the form filling template that exhibits this:

```python theme={null}
# Record the answer in form manager
success = self.form_manager.record_answer(answer)

if success:
  # Log metric for the answered question
  if current_question:
    metric_name = current_question["id"]
    yield LogMetric(name=metric_name, value=answer)
    logger.info(f"📊 Logged metric: {metric_name}={answer}")
```

The user bridge is subscribed to the `LogMetric` event by default, and it will
log it over the websocket by default when it sees that `LogMetric` has been broadcast.

### Websocket

If you're not using the SDK and instead just relying on the bare websocket, logging metrics will look like this:

```json theme={null}
{
  "type": "log_metric",
  "name": "metric_name",
  "value": "metric_value"
}
```

### Playground

You can view these events in the Playground under the `Transcript` tab of the call.

<Frame>
  <img alt="Loggable Metrics in the Playground" />
</Frame>

## Call Recordings

Call recordings can be downloaded from the playground.

<Frame>
  <img alt="Sample Call Recordings" />
</Frame>

## Webhooks

Cartesia sends webhook events to your **HTTPS** endpoint throughout the call lifecycle. Expose **`POST`** + **`application/json`** and verify the **`x-webhook-secret`** header matches your stored secret.

<Frame>
  <img alt="Sample Call Webhooks" />
</Frame>

### Verify the webhook secret

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    if request.headers.get("x-webhook-secret") != os.environ["LINE_WEBHOOK_SECRET"]:
        return jsonify({"error": "unauthorized"}), 401
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    if (req.headers["x-webhook-secret"] !== process.env.LINE_WEBHOOK_SECRET)
      return res.status(401).json({ error: "unauthorized" });
    ```
  </Tab>
</Tabs>

### Event types

| Event                | When                           | Typed field |
| -------------------- | ------------------------------ | ----------- |
| `call_started`       | Call session begins            | `call`      |
| `call_completed`     | Call ends normally             | `call`      |
| `call_failed`        | Call ends with error           | `call`      |
| `call_turn`          | Each conversational turn       | `turn`      |
| `post_call_analysis` | After async analysis completes | `analysis`  |

### Envelope fields

Every webhook event includes these top-level fields:

| Field        | Description                   |
| ------------ | ----------------------------- |
| `type`       | Event type (see table above). |
| `call_id`    | Call identifier.              |
| `agent_id`   | Agent that handled the call.  |
| `webhook_id` | Webhook config id.            |
| `timestamp`  | RFC 3339 event time.          |

### `call`

Present on `call_started`, `call_completed`, and `call_failed` events. Matches the [GET /agents/calls/\{call\_id}](/api-reference/agents/calls/get-call) response. Some events (e.g. `call_started`) may omit fields like `end_time` that do not yet have a valid value.

| Field                     | Description                                                                                                                                    |
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `id`                      | Call identifier.                                                                                                                               |
| `agent_id` / `agent_name` | Agent details.                                                                                                                                 |
| `status`                  | `started`, `completed`, or `failed`.                                                                                                           |
| `start_time` / `end_time` | RFC 3339 timestamps.                                                                                                                           |
| `end_reason`              | Why the call ended (e.g. `client_hangup`, `agent_hangup`, `inactivity`). See [EndReason](/api-reference/agents/calls/get-call) for all values. |
| `transcript`              | Array of turns (see `turn` below).                                                                                                             |
| `telephony_params`        | `from`, `to`, `direction`, `call_sid`, `connection_type`.                                                                                      |
| `error_message`           | Error detail (failed calls only).                                                                                                              |
| `metadata`                | User-supplied metadata passed at call start.                                                                                                   |
| `summary`                 | Call summary (if available at event time).                                                                                                     |

### `turn`

Present on `call_turn` events. One turn per agent or user utterance.

| Field                               | Description                                           |
| ----------------------------------- | ----------------------------------------------------- |
| `role`                              | `assistant` or `user`.                                |
| `text`                              | Turn text.                                            |
| `start_timestamp` / `end_timestamp` | Seconds from call start.                              |
| `tts_ttfb`                          | Agent TTS time-to-first-byte (seconds), when present. |
| `tool_calls`                        | Tool calls made during this turn, when present.       |

### `analysis`

Present on `post_call_analysis` events. Sent after async analysis completes (currently summary generation; evaluations and metrics will be added here in the future).

| Field     | Description                |
| --------- | -------------------------- |
| `summary` | 1-2 sentence call summary. |

### Example: `call_completed`

```json theme={null}
{
  "type": "call_completed",
  "call_id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
  "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
  "webhook_id": "agent_webhook_P3MgdLf1cpaucZJ7xWehCC",
  "end_reason": "client_hangup",
  "timestamp": "2026-04-16T01:08:50.061907836Z",
  "call": {
    "id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
    "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
    "agent_name": "My Agent",
    "status": "completed",
    "start_time": "2026-04-16T01:08:37.413659Z",
    "end_time": "2026-04-16T01:08:50.036327Z",
    "end_reason": "client_hangup",
    "telephony_params": {
      "from": "websocket",
      "to": "agent_rwh4HGMgyhK7rM5ucVqbiC",
      "connection_type": "websocket"
    },
    "transcript": [
      {
        "role": "assistant",
        "text": "Hi there! How can I help you today?",
        "start_timestamp": 0.41,
        "end_timestamp": 3.2,
        "tts_ttfb": 0.065
      },
      {
        "role": "user",
        "text": "I want to schedule an appointment.",
        "start_timestamp": 3.5,
        "end_timestamp": 5.8
      }
    ]
  }
}
```

### Example: `post_call_analysis`

```json theme={null}
{
  "type": "post_call_analysis",
  "call_id": "ac_sid_gqkgRWUz2u64qFUjA1mZyr",
  "agent_id": "agent_rwh4HGMgyhK7rM5ucVqbiC",
  "webhook_id": "agent_webhook_P3MgdLf1cpaucZJ7xWehCC",
  "timestamp": "2026-04-16T01:08:50.955058787Z",
  "analysis": {
    "summary": "The caller requested to schedule an appointment. The agent confirmed availability and booked a slot."
  }
}
```

### Test your endpoint

```bash theme={null}
curl -sS -X POST "https://your-server.example/webhooks/cartesia" \
  -H "Content-Type: application/json" \
  -H "x-webhook-secret: YOUR_WEBHOOK_SECRET" \
  -d '{
    "type": "call_completed",
    "call_id": "ac_test_123",
    "agent_id": "agent_demo",
    "webhook_id": "agent_webhook_test",
    "timestamp": "2026-01-01T00:00:00.000000000Z",
    "call": {
      "id": "ac_test_123",
      "agent_id": "agent_demo",
      "agent_name": "Test Agent",
      "status": "completed",
      "end_reason": "client_hangup",
      "transcript": []
    }
  }'
```

<Note>
  For backwards compatibility, `call_completed` and `call_failed` events also include `body` (transcript array) and a top-level `end_reason`. These are deprecated — use `call.transcript` and `call.end_reason` instead.
</Note>


# Scaling
Source: https://docs.cartesia.ai/line/infrastructure/scaling


## Compute Resources

Each call has access to 1GB memory and 0.5 vCPU. Contact support to increase limits.

<Card title="Contact Support" href="https://cartesia.ai/contact" />

## Concurrency

Concurrent call limits by subscription tier:

| Subscription Tier | Concurrency Limit |
| ----------------- | ----------------- |
| Free              | 8                 |
| Pro               | 12                |
| Startup           | 20                |
| Scale             | 60                |

<Tip title="Outbound Concurrency" icon="dialpad">
  **Outbound Concurrency**

  When triggering outbound calls, you are limited to triggering one call per second while the overall concurrency limits still apply.
</Tip>


# Calls API
Source: https://docs.cartesia.ai/line/integrations/calls-api


Stream audio between your application and your voice agent via WebSocket. Use this for web apps, mobile apps, or to bridge your own telephony provider.

## Quick start

```javascript theme={null}
const ws = new WebSocket(
  `wss://api.cartesia.ai/agents/stream/${agentId}`,
  {
    headers: {
      Authorization: `Bearer ${accessToken}`,
      "Cartesia-Version": "2025-04-16",
    },
  }
);

// Initialize the stream
ws.onopen = () => {
  ws.send(JSON.stringify({
    event: "start",
    config: { input_format: "pcm_44100" },
  }));
};

// Handle agent audio
ws.onmessage = (msg) => {
  const data = JSON.parse(msg.data);
  if (data.event === "media_output") {
    playAudio(atob(data.media.payload));
  }
};

// Send user audio
function sendAudio(audioData) {
  ws.send(JSON.stringify({
    event: "media_input",
    stream_id: streamId,
    media: { payload: btoa(audioData) },
  }));
}
```

Get an access token from the `/access-token` [endpoint](/api-reference/auth/access-token#body-grants-agent). See [Authenticating Client Apps](/get-started/authenticate-your-client-applications) for details.

***

## Connection

Connect to the WebSocket endpoint:

```
wss://api.cartesia.ai/agents/stream/{agent_id}
```

**Headers:**

| Header             | Value            |
| ------------------ | ---------------- |
| `Authorization`    | `Bearer {token}` |
| `Cartesia-Version` | `2025-04-16`     |

## Protocol Overview

The WebSocket connection uses JSON messages for control events and base64-encoded audio for media.

The client sends a `start` event, the server responds with `ack`, then both sides exchange audio and control events until the connection closes.

## Client events

### Start Event

Initializes the audio stream configuration.

* `config` overrides your agent's default input audio settings
* `stream_id` is optional. If not provided, the server generates one and returns it in the `ack` event

**This must be the first message sent.**

```json theme={null}
{
  "event": "start",
  "stream_id": "unique_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "introduction": "Hello, I'm an AI assistant",
    "system_prompt": "### Your Role \n You are a helpful assistant"
  },
  "metadata": {
    "to": "user@example.com",
    "from": "+1234567890"
  }
}
```

**Fields:**

* `stream_id` (optional): Stream identifier. If not provided, server generates one
* `config.input_format`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`)
* `config.voice_id` (optional): Override the agent's default TTS voice
* `agent` (optional): Allows configuring individual agent calls via API and previewing changes in introduction or prompt without publishing to production
* `metadata` (optional): Custom metadata object. These will be passed through to the agent code, but there are some special fields you can use as well:
  * `to` (optional): Destination identifier for call routing (defaults to agent ID)
  * `from` (optional): Source identifier for the call (defaults to "websocket")

### Media Input Event

Audio data sent from the client to the server. `payload` audio data should be base64 encoded.

```json theme={null}
{
  "event": "media_input",
  "stream_id": "unique_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
```

**Fields:**

* `stream_id`: Unique identifier for the Stream from the ack response
* `media.payload`: Base64-encoded audio data in the format specified in the start event

### DTMF Event

Sends DTMF (dual-tone multi-frequency) tones.

```json theme={null}
{
  "event": "dtmf",
  "stream_id": "example_id",
  "dtmf": "1"
}
```

**Fields:**

* `stream_id`: Stream identifier
* `dtmf`: DTMF digit (0-9, \*, #)

### Custom Event

Sends custom metadata to the agent.

```json theme={null}
{
  "event": "custom",
  "stream_id": "example_id",
  "metadata": {
    "user_id": "user123",
    "session_info": "custom_data"
  }
}
```

**Fields:**

* `stream_id`: Stream identifier
* `metadata`: Object containing key-value pairs of custom data

## Server events

### Ack Event

Confirms stream configuration. Returns the server-generated `stream_id` if one wasn't provided in the `start` event.

```json theme={null}
{
  "event": "ack",
  "stream_id": "example_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "system_prompt": "### Your Role \n You are a helpful assistant",
    "introduction": "Hello, I'm an AI assistant"
  }
}
```

### Media Output Event

Server sends agent audio response. `payload` is base 64 encoded audio data.

```json theme={null}
{
  "event": "media_output",
  "stream_id": "example_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
```

### Clear Event

Indicates the agent wants to clear/interrupt the current audio stream.

```json theme={null}
{
  "event": "clear",
  "stream_id": "example_id"
}
```

### Transfer Call Event

Indicates the agent wants to transfer the call to a phone number. The client is responsible for initiating the transfer on its telephony side.

```json theme={null}
{
  "event": "transfer_call",
  "stream_id": "example_id",
  "transfer": {
    "target_phone_number": "+1234567890"
  }
}
```

**Fields:**

* `stream_id`: Stream identifier
* `transfer.target_phone_number`: E.164 phone number to transfer the call to

## Connection Management

### Inactivity Timeout

The server closes idle connections after **180 seconds**. Any client message resets the timer:

* Application messages (media\_input, dtmf, custom events)
* Standard WebSocket ping frames
* Any other valid WebSocket message

When the timeout occurs, the connection is closed with:

* **Code:** 1000 (Normal Closure)
* **Reason:** `"connection idle timeout"`

### Ping/Pong Keepalive

To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive:

```python theme={null}
# Client sends ping to reset inactivity timer
pong_waiter = await websocket.ping()
latency = await pong_waiter
```

```javascript theme={null}
// Requires the Node.js `ws` library — the browser WebSocket API does not expose ping()
setInterval(() => {
  if (websocket.readyState === WebSocket.OPEN) {
    websocket.ping();
  }
}, 60000); // Send ping every 60 seconds
```

The server automatically responds to ping frames with pong frames and resets the inactivity timer upon receiving any message.

### Connection Close

The connection can be closed by either the client or server using WebSocket close frames.

**Client-initiated close:**

```python theme={null}
await websocket.close(code=1000, reason="session completed")
```

**Server-initiated close:**
When the agent ends the call, the server closes the connection with:

* **Code:** 1000 (Normal Closure)
* **Reason:** `"call ended by agent"` or `"call ended by agent, reason: {specific_reason}"` if additional context is available

## Best Practices

1. **Send `start` first** — The connection closes if any other event is sent before `start`.
2. **Choose the right audio format** — Match the format to your source: `mulaw_8000` for telephony, `pcm_44100` for web clients.
3. **Handle closes cleanly** — Always capture close codes and reasons for debugging and recovery.
4. **Keep the connection alive** — Send WebSocket ping frames every 60–90 seconds to avoid the 180-second inactivity timeout.
5. **Manage stream IDs** — Provide your own `stream_id` values to improve observability across systems.
6. **Recover from idle timeouts** — On `1000 / connection idle timeout`, reconnect and resend a `start` event.


# Overview
Source: https://docs.cartesia.ai/line/integrations/overview


Your Line agent needs audio input to work. Choose based on your use case.

## Telephony

Use [Cartesia Telephony](/line/integrations/telephony/phone-numbers) for managed phone numbers. Cartesia provisions numbers and handles the telephony infrastructure for inbound and outbound use cases.

You can also use your own telephony stack by connecting to the [Calls API](/line/integrations/calls-api).

<Note>
  Bringing your own phone numbers or CCaaS provider is on the roadmap.
</Note>

## Web and Mobile Apps

Use the [Calls API](/line/integrations/calls-api) to stream audio between your application and the agent via WebSocket.

```javascript theme={null}
const ws = new WebSocket(`wss://api.cartesia.ai/agents/stream/${agentId}`);
```

This option works great for:

* Web applications with browser microphone access
* Mobile apps with native audio capture

## Pricing

| Feature                  | Price per Minute | Notes                                 |
| ------------------------ | ---------------- | ------------------------------------- |
| Agent Calling            | \$0.06           | Base rate for all voice agent calls   |
| Telephony (add-on)       | +\$0.014         | Additional when using managed numbers |
| **Total with Telephony** | **\$0.074**      | Combined cost for phone-based calls   |

View your usage and remaining Voice Agent credits on the [Subscription](https://play.cartesia.ai/subscription) page.


# Outbound
Source: https://docs.cartesia.ai/line/integrations/telephony/outbound-dialing


Agents can make outbound dials with an API request. Simply specify a set of target phone numbers and your agent ID
to place your dial.

<Warning title="Compliance" icon="triangle-exclamation">
  **Compliance**

  You are solely responsible for remaining complaint with relevant local regulations for dialing including the Telephone
  Consumer Protection Act (TCPA).

  See Cartesia's [Acceptable Use Policy](https://cartesia.ai/legal/acceptable-use.html) for more detail.
</Warning>

<CodeGroup>
  ```bash Bash lines theme={null}
  curl -X POST "https://api.cartesia.ai/twilio/call/outbound" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $CARTESIA_API_KEY" \
  -H "Cartesia-Version: 2025-04-16" \
  -d '{
    "target_numbers": ["YOUR_PHONE_NUMBER"],
    "agent_id": "YOUR_AGENT_ID",
    "metadata": {
      "customer_id": "cust_123",
      "custom_prompt": "Be extra friendly"
    }
  }'
  ```

  ```python Python lines theme={null}
  import requests

  url = "https://api.cartesia.ai/twilio/call/outbound"

  headers = {
      "Content-Type": "application/json",
      "Authorization": "Bearer YOUR_CARTESIA_API_KEY",
      "Cartesia-Version": "2025-04-16"
  }

  payload = {
      "target_numbers": ["YOUR_PHONE_NUMBER"],
      "agent_id": "YOUR_AGENT_ID",
      "metadata": {
          "customer_id": "cust_123",
          "custom_prompt": "Be extra friendly"
      }
  }

  response = requests.post(url, headers=headers, json=payload)

  print("Status Code:", response.status_code)
  print("Response:", response.json())
  ```

  ```bash CLI theme={null}
  # Trigger an outbound call from a deployed agent to a specific number
  cartesia call <phone_number> <agent_id>
  ```
</CodeGroup>

The `metadata` field accepts any JSON object up to 1MB. This data is passed to your agent code deployment and can be accessed to customize agent behavior per call.

You can access the metadata in your agent code via the `call_request.metadata` object in your `get_agent` function.

```python theme={null}
async def get_agent(env, call_request):
    if call_request.metadata:
        logger.info(f"Received metadata: {call_request.metadata}")
    # Use metadata to customize agent behavior
    return LlmAgent(...)
```

<Note>You are limited to one outbound dial placed per second, any requests faster than one dial per second will be queued. </Note>


# Phone Numbers
Source: https://docs.cartesia.ai/line/integrations/telephony/phone-numbers


Cartesia Telephony provides managed phone numbers so your agent can receive and make real phone calls without setting up your own telephony infrastructure.

## Provisioning

The platform automatically provisions a phone number for each agent when you promote to production. When an agent is deleted, the assigned phone number is released and cannot be re-assigned to another agent.

<Note>
  Bringing your own phone numbers or CCaaS provider is on the roadmap.
</Note>

## Finding Your Phone Number

When viewing your Line agents from the Playground, you can see the provisioned phone number on the Agents page in the card:

<Frame>
  <img alt="Phone number shown in agent card" />
</Frame>

Or in the header once you navigate to the agent's page:

<Frame>
  <img alt="Phone number shown in agent header" />
</Frame>

You can also retrieve your phone number using the [CLI](/line/cli).

List all agents to see their phone numbers:

```bash theme={null}
cartesia agents ls
```

Or get detailed information for a specific agent:

```bash theme={null}
cartesia status <agent_id>
```

This returns agent information including name, deployments, and phone numbers.


# Introduction
Source: https://docs.cartesia.ai/line/introduction

Build intelligent, low-latency voice agents with Line.

## What is Line?

Line brings voice to your text agents with Cartesia's state-of-the-art speech models. We handle audio orchestration, deployment, and observability so you can focus on your agent's reasoning.

## Get Started

<CardGroup>
  <Card title="Quickstart" icon="rocket" href="./start-building/quickstart">
    Build, deploy, and call your first agent
  </Card>

  <Card title="Agent Builder" icon="sparkles" href="./start-building/agent-builder">
    Prototype and iterate on agents without code
  </Card>

  <Card title="SDK" icon="code" href="./sdk/overview">
    Write your custom reasoning logic in code
  </Card>
</CardGroup>

## Audio Orchestration

Line deploys your code in seconds in our managed runtime with auto-scaling and blazing fast audio processing, using [Ink](https://cartesia.ai/ink) for speech-to-text and [Sonic](https://cartesia.ai/sonic) for text-to-speech.

<Frame>
  <img alt="Line voice agent platform architecture" />
</Frame>

## What You Can Build

Line gives you full control over your agent's behavior through code: connect any LLM, call external APIs, query databases, and handle interruptions and turn-taking.

## Developer Tools

* **[CLI](/line/cli)** – Deploy and test agents from your terminal
* **[Call logs](/line/infrastructure/observability#call-logs)** – Debug conversations and monitor performance
* **[Evaluations](/line/evaluations/metrics)** – Measure agent quality with custom metrics
* **[Deployments](/line/infrastructure/observability#deployment)** – Track versions and roll back changes


# Agents
Source: https://docs.cartesia.ai/line/sdk/agents


Agents process input events and yield output events to control the conversation.

## What is an Agent?

An Agent controls the input/output event loop. The `process` method receives events (user speech, call start, etc.) and yields responses.

An Agent can be:

1. A **class** with a `process` method
2. A **function** with the same signature `(env, event) -> AsyncIterable[OutputEvent]`

```python theme={null}
from line.events import CallStarted, UserTurnEnded, AgentSendText

class HelloAgent:
    async def process(self, env, event):
        if isinstance(event, CallStarted):
            yield AgentSendText(text="Hello!")
        elif isinstance(event, UserTurnEnded):
            yield AgentSendText(text="I heard you!")
```

**How an Agent works:**

* Events arrive (user speaks, call starts, button pressed)
* SDK calls `agent.process(env, event)`
* Agent yields output events (speech, tool calls, handoffs)
* SDK handles audio, LLM calls, and state management

***

## LlmAgent

Use the built-in `LlmAgent` which wraps 100+ LLM providers via LiteLLM:

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",  # Or "gpt-5.2", "gemini/gemini-2.5-flash", etc.
    api_key="your-api-key",
    tools=[...],  # Optional list of tools
    config=LlmConfig(
        system_prompt="You are a helpful assistant...",
        introduction="Hello! How can I help you today?",
    ),
)
```

### Prompting

`system_prompt` to define your agent's personality and `introduction` for the greeting:

```python theme={null}
import os
from line import CallRequest
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import AgentEnv, VoiceAgentApp

SYSTEM_PROMPT = """You are a friendly customer service agent.

Rules:
- Be polite and empathetic
- Confirm understanding before taking action
-  end_call to gracefully end conversations
"""

async def get_agent(env: AgentEnv, call_request: CallRequest):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt=SYSTEM_PROMPT,
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

### Supported Models

| Provider                                                            | Model Examples                                                         |
| ------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| Anthropic                                                           | `anthropic/claude-haiku-4-5-20251001`, `anthropic/claude-sonnet-4-5`   |
| OpenAI                                                              | `gpt-5.4`, `gpt-5.2`                                                   |
| Google                                                              | `gemini/gemini-2.5-flash-preview-09-2025`, `gemini/gemini-3.0-preview` |
| And 100+ more via [LiteLLM](https://docs.litellm.ai/docs/providers) |                                                                        |

### LlmConfig Options

| Option              | Type                  | Description                                                |
| ------------------- | --------------------- | ---------------------------------------------------------- |
| `system_prompt`     | `str`                 | The system prompt defining agent behavior                  |
| `introduction`      | `Optional[str]`       | Message sent on call start. `None` or `""` to wait for r   |
| `temperature`       | `Optional[float]`     | Sampling temperature                                       |
| `max_tokens`        | `Optional[int]`       | Maximum tokens per response                                |
| `top_p`             | `Optional[float]`     | Nucleus sampling threshold                                 |
| `stop`              | `Optional[List[str]]` | Stop sequences                                             |
| `seed`              | `Optional[int]`       | Random seed for reproducibility                            |
| `presence_penalty`  | `Optional[float]`     | Presence penalty for token generation                      |
| `frequency_penalty` | `Optional[float]`     | Frequency penalty for token generation                     |
| `num_retries`       | `int`                 | Number of retries on failure (default: 2)                  |
| `fallbacks`         | `Optional[List[str]]` | Fallback models if primary fails                           |
| `timeout`           | `Optional[float]`     | Request timeout in seconds                                 |
| `reasoning_effort`  | `Optional[str]`       | `none`, `low`, `medium`, or `high`. Dependent on provider. |
| `extra`             | `Dict[str, Any]`      | Provider-specific options passed through to LiteLLM        |

### History Management

`LlmAgent` exposes a `history` attribute for structured control over the conversation history the LLM sees.

**Adding entries:**

```python theme={null}
# Append a user note (role="user" is the default)
agent.history.add_entry("The user prefers formal language.")

# Insert before a specific event
agent.history.add_entry("Context about the caller.", before=some_event)
```

**Replacing history segments:**

```python theme={null}
# Replace the entire history
agent.history.update(new_events)

# Replace everything from `start` onward
agent.history.update(new_events, start=some_event)

# Replace a specific segment
agent.history.update(new_events, start=start_event, end=end_event)
```

### Per-Turn Overrides

`process()` accepts keyword arguments that apply to just that turn without mutating the agent:

```python theme={null}
# Higher temperature for just this turn
await agent.process(env, event, config=LlmConfig(temperature=0.9))

# Swap a specific tool for one turn
await agent.process(env, event, tools=[custom_lookup_tool])

# Inject ephemeral context
await agent.process(env, event, context="The user is a VIP customer.")

# Completely override history for one turn
await agent.process(env, event, history=custom_history_list)
```

Only explicitly set `LlmConfig` fields take effect — unset fields fall through to the agent's stored config.

To change tools permanently (e.g., enabling `end_call` after a certain point), modify `agent.tools` directly instead of using per-turn overrides.

***

## Controlling the Conversational Loop

Use **event filters** to control when your agent’s `process` method runs, and which events can interrupt it.

### Default Behavior

```python theme={null}
# Agent processes these events:
run_filter = [CallStarted, UserTurnEnded, CallEnded]

# These events interrupt the agent:
cancel_filter = [UserTurnStarted]
```

This means: agent greets on call start, responds when user finishes speaking, and can be interrupted.

### Customizing Filters

Return a tuple from `get_agent` to override defaults:

```python theme={null}
from line.events import CallStarted, UserTurnEnded, UserTurnStarted, CallEnded

async def get_agent(env, call_request):
    agent = LlmAgent(...)
    
    # Customize behavior
    run_filter = [CallStarted, UserTurnEnded, CallEnded]
    cancel_filter = [UserTurnStarted]
    
    return (agent, run_filter, cancel_filter)
```

### Common Customizations

**More responsive (process partial transcriptions):**

```python theme={null}
from line.events import CallStarted, UserTurnEnded, UserTextSent, CallEnded

run_filter = [CallStarted, UserTurnEnded, UserTextSent, CallEnded]
cancel_filter = [UserTurnStarted]
```

This makes your agent start processing before the user finishes speaking, creating a more responsive experience.

**Uninterruptible turns:**

If you want a single message to complete without being interrupted by the user, mark the output as `interruptible=False` when sending it with `AgentSendText`.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(
    text="Before we continue, I need to share a quick disclaimer.",
    interruptible=False,
)
```

**Custom logic with functions:**

```python theme={null}
def business_hours_only(event):
    hour = datetime.now().hour
    if isinstance(event, (CallStarted, CallEnded)):
        return True
    return isinstance(event, UserTurnEnded) and 9 <= hour < 17

return (agent, business_hours_only, [UserTurnStarted])
```

<Tip>
  For advanced patterns like guardrails, routing, and agent wrappers, see [Advanced Patterns](./patterns#agent-wrappers).
</Tip>

***

## Handling Incoming Calls

When a call arrives, you can inspect caller information and configure how your agent responds before it starts.

1. A call arrives from a web client or telephony provider
2. Your `pre_call_handler` receives a `CallRequest` with caller details
3. You return configuration (voice, language) or reject the call
4. Your `get_agent` function creates an agent using the enriched request

### Parsing the CallRequest

Contains information about the incoming call:

| Field           | Type             | Description                                     |
| --------------- | ---------------- | ----------------------------------------------- |
| `call_id`       | `str`            | Unique identifier for the call                  |
| `from_`         | `str`            | Caller identifier (phone number or client ID)   |
| `to`            | `str`            | Called number or agent ID                       |
| `agent_call_id` | `str`            | Agent call ID for logging/correlation           |
| `metadata`      | `Optional[dict]` | Custom data passed from your client application |
| `agent`         | `AgentConfig`    | Prompts configured in Playground or via API     |

The `agent` field contains an `AgentConfig` with:

| Field           | Type            | Description                                                        |
| --------------- | --------------- | ------------------------------------------------------------------ |
| `system_prompt` | `Optional[str]` | System prompt configured in Playground or via the Calls API        |
| `introduction`  | `Optional[str]` | Introduction message configured in Playground or via the Calls API |

### Returning a PreCallResult

Use `pre_call_handler` to set voice, language, or reject calls before your agent starts:

```python theme={null}
from line.voice_agent_app import CallRequest, PreCallResult, VoiceAgentApp

async def pre_call_handler(call_request: CallRequest):
    return PreCallResult(
        metadata={"tier": "premium"},  # Merged into call_request.metadata
        config={
            "tts": {
                "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
                "model": "sonic-3",
                "language": "en",
            }
        }
    )

app = VoiceAgentApp(get_agent=get_agent, pre_call_handler=pre_call_handler)
```

Your client application can pass metadata (user ID, language preference, account tier) in the call request. Your `pre_call_handler` reads this and configures TTS/STT accordingly.

#### Configuration Options

**TTS Options:**

| Option                  | Type   | Description                                                                              |
| ----------------------- | ------ | ---------------------------------------------------------------------------------------- |
| `voice_id`              | string | Voice identifier (UUID)                                                                  |
| `model`                 | string | TTS model (`sonic-3`, `sonic-turbo`)                                                     |
| `language`              | string | Language code (`en`, `es`, `hi`, etc.)                                                   |
| `pronunciation_dict_id` | string | [Custom pronunciation dictionary](/build-with-cartesia/sonic-3/custom-pronunciations) ID |

**STT Options:**

| Option     | Type   | Description                          |
| ---------- | ------ | ------------------------------------ |
| `language` | string | Language code for speech recognition |

#### Rejecting Calls

Return `None` to reject a call with a 403 status:

```python theme={null}
async def pre_call_handler(call_request: CallRequest):
    if is_blocked(call_request.from_):
        return None  # Rejects with 403
    return PreCallResult()
```

#### Custom Pronunciations

Use a [pronunciation dictionary](/build-with-cartesia/sonic-3/custom-pronunciations) to control how specific words are spoken:

```python theme={null}
async def pre_call_handler(call_request: CallRequest):
    return PreCallResult(
        config={
            "tts": {
                "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
                "model": "sonic-3",
                "pronunciation_dict_id": "your-dict-id",
            }
        }
    )
```

### Accessing call metadata in your Agent logic

The `CallRequest` is available in `get_agent`:

```python theme={null}
async def get_agent(env, call_request):
    # Log call information
    logger.info(f"Call {call_request.call_id} from {call_request.from_}")

    # Access metadata passed from your application (or added in pre_call_handler)
    customer_id = call_request.metadata.get("customer_id") if call_request.metadata else None
    customer_name = call_request.metadata.get("customer_name") if call_request.metadata else None

    # Build a personalized system prompt using metadata
    base_prompt = call_request.agent.system_prompt or "You are a helpful customer service agent."

    if customer_id:
        base_prompt += f"\n\nCurrent customer ID: {customer_id}"
    if customer_name:
        base_prompt += f"\nCustomer name: {customer_name}"

    return LlmAgent(
        model="gpt-5-nano",
        api_key=os.getenv("OPENAI_API_KEY"),
        config=LlmConfig(
            system_prompt=base_prompt,
            introduction=call_request.agent.introduction,
        ),
    )
```

`LlmConfig.from_call_request()` handles the priority chain automatically:

1. `CallRequest.agent.system_prompt` value (if set)
2. Your fallback value (if provided)
3. SDK default

```python theme={null}
async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig.from_call_request(
            call_request,
            fallback_system_prompt="You are a sales assistant.",
            fallback_introduction="Hi! How can I help with your purchase?",
            temperature=0.7,  # Additional LlmConfig options
        ),
    )
```

Using `CallRequest` lets you iterate on system prompts from the Playground instantly, while code handles the technical configuration and fallback defaults.

### Letting The User Speak First

Set `introduction` to an empty string to wait for the user to speak first:

```python theme={null}
config=LlmConfig.from_call_request(
    call_request,
    fallback_system_prompt=SYSTEM_PROMPT,
    fallback_introduction="",
)
```

***

## Custom Agent Function

For advanced use cases, you can build agents from scratch as functions:

```python theme={null}
from line.events import UserTurnEnded, AgentSendText, CallStarted

async def my_agent(env, event):
    if isinstance(event, CallStarted):
        yield AgentSendText(text="Hello! How can I help?")
    elif isinstance(event, UserTurnEnded):
        user_text = event.content[0].content if event.content else ""
        yield AgentSendText(text=f"You said: {user_text}")
```

## Custom Agent Class

Or as classes with state:

```python theme={null}
class GreetingAgent:
    def __init__(self, greeting: str):
        self.greeting = greeting
        self.greeted = False

    async def process(self, env, event):
        if isinstance(event, CallStarted) and not self.greeted:
            yield AgentSendText(text=self.greeting)
            self.greeted = True
```

<Tip>
  Most developers can use `LlmAgent` with tools rather than building custom agents from scratch! Custom agents are powerful when you need full control over the event processing logic without LLM reasoning.
</Tip>


# Events
Source: https://docs.cartesia.ai/line/sdk/events


Events are typed Python objects for communication between your agent and the Cartesia platform. Your agent receives **input events** from the harness and yields **output events** to control the conversation.

<Tip>
  To learn which events trigger your agent and how to customize this behavior (e.g., responding to DTMF, preventing interruptions), see [Controlling the Conversational Loop](/line/sdk/agents#controlling-the-conversational-loop).
</Tip>

## Input Events

Input events are received by your agent from the Cartesia harness. All input events include an optional `history` field containing the complete conversation history. When `history` is `None`, the event is being used within a history list; when `history` contains a list, the event has the full conversation context attached.

### Call Lifecycle

| Event         | Description            |
| ------------- | ---------------------- |
| `CallStarted` | The call has connected |
| `CallEnded`   | The call has ended     |

```python theme={null}
from line.events import CallStarted, CallEnded

async def process(self, env, event):
    if isinstance(event, CallStarted):
        yield AgentSendText(text="Hello! How can I help?")
    elif isinstance(event, CallEnded):
        # Perform cleanup
        pass
```

### User Turn Events

| Event             | Description                                                     |
| ----------------- | --------------------------------------------------------------- |
| `UserTurnStarted` | The user started speaking (triggers interruption by default)    |
| `UserTurnEnded`   | The user finished speaking (triggers new agent turn by default) |
| `UserTextSent`    | User text content (within `UserTurnEnded.content`)              |
| `UserDtmfSent`    | User pressed a DTMF button                                      |

```python theme={null}
from line.events import UserTurnEnded, UserTextSent

if isinstance(event, UserTurnEnded):
    for content in event.content:
        if isinstance(content, UserTextSent):
            user_message = content.content
```

### Agent Turn Events (in history)

| Event              | Description                |
| ------------------ | -------------------------- |
| `AgentTurnStarted` | Agent started its turn     |
| `AgentTurnEnded`   | Agent finished its turn    |
| `AgentTextSent`    | Agent text that was spoken |
| `AgentDtmfSent`    | DTMF tone sent by agent    |

### Handoff Event

| Event            | Description                           |
| ---------------- | ------------------------------------- |
| `AgentHandedOff` | Control transferred to a handoff tool |

### Custom Event

| Event            | Description                                                                                                        |
| ---------------- | ------------------------------------------------------------------------------------------------------------------ |
| `UserCustomSent` | Custom metadata sent from the client via the WebSocket [`custom` event](/line/integrations/calls-api#custom-event) |

Received when your client application sends a `custom` WebSocket event to the call stream. The event carries a `metadata` dict with whatever key-value pairs the client included:

```python theme={null}
from line.events import UserCustomSent

async def process(self, env, event):
    if isinstance(event, UserCustomSent):
        action = event.metadata.get("action")
        # React to client-side triggers (e.g., button clicks, form submissions)
```

***

## Output Events

Output events are yielded by your agent to control the conversation.

### Speech

You can choose to send messages with `AgentSendText`.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(text="Hello! How can I help you today?")
```

By default, users can interrupt the agent. However, if you have a disclaimer or another important message that you wish be uninterruptible, you can set the `interruptible` flag as false.

```python theme={null}
from line.events import AgentSendText

yield AgentSendText(
    text="Before we continue, I need to share a quick disclaimer.",
    interruptible=False,
)
```

### Call Control

```python theme={null}
from line.events import AgentEndCall, AgentTransferCall, AgentSendDtmf

# End the call
yield AgentEndCall()

# Transfer to another number
yield AgentTransferCall(target_phone_number="+14155551234")

# Send DTMF tone
yield AgentSendDtmf(button="1")
```

### Dynamic Configuration

Update call settings (voice, pronunciation, language) mid-conversation using `AgentUpdateCall`:

```python theme={null}
from line.events import AgentUpdateCall

# Change voice
yield AgentUpdateCall(voice_id="5ee9feff-1265-424a-9d7f-8e4d431a12c7")

# Change pronunciation dictionary
yield AgentUpdateCall(pronunciation_dict_id="dict-123")

# Change language
yield AgentUpdateCall(language="es")

# Update multiple settings at once
yield AgentUpdateCall(
    voice_id="spanish-voice-id",
    pronunciation_dict_id="spanish-dict-id",
    language="es"
)
```

**AgentUpdateCall Parameters:**

| Field                   | Type                     | Description                                                                       |
| ----------------------- | ------------------------ | --------------------------------------------------------------------------------- |
| `type`                  | `Literal["update_call"]` | Event type identifier (automatically set)                                         |
| `voice_id`              | `Optional[str]`          | Updates the agent's voice                                                         |
| `pronunciation_dict_id` | `Optional[str]`          | Updates the pronunciation dictionary                                              |
| `language`              | `Optional[str]`          | Updates the language used on speech-to-text (STT) and text-to-speech (TTS) models |

All fields are optional—only set fields are updated.

### Tool Events

These are emitted by `LlmAgent` to track tool execution:

```python theme={null}
from line.events import AgentToolCalled, AgentToolReturned

# Emitted when LLM calls a tool
yield AgentToolCalled(
    tool_call_id="call_123",
    tool_name="get_weather",
    tool_args={"city": "San Francisco"}
)

# Emitted when tool returns
yield AgentToolReturned(
    tool_call_id="call_123",
    tool_name="get_weather",
    tool_args={"city": "San Francisco"},
    result="72°F and sunny"
)
```

### Logging

```python theme={null}
from line.events import LogMetric, LogMessage

# Log a metric
yield LogMetric(name="response_time_ms", value=150)

# Log a message
yield LogMessage(
    name="order_lookup",
    level="info",
    message="Found order #12345",
    metadata={"order_id": "12345"}
)
```

### Custom Events

Send arbitrary metadata from your agent to the harness:

```python theme={null}
from line.events import AgentSendCustom

yield AgentSendCustom(metadata={"action": "show_form", "form_id": "checkout"})
```

Pair with [`UserCustomSent`](#custom-event) for bidirectional metadata exchange.

### Voice & Language Control

Change voice or speech recognition language mid-call:

```python theme={null}
from line.events import AgentUpdateCall

# Switch to Spanish voice and speech recognition
yield AgentUpdateCall(voice_id="spanish-voice-id", language="es")

# Enable multilingual auto-detect mode
yield AgentUpdateCall(language="multilingual")
```

The `language` field sets the ASR (speech recognition) language. Pass any language code supported by [Ink STT](/build-with-cartesia/stt-models), or `"multilingual"` for automatic language detection.

***

## Event History

All input events include an optional `history` field containing the conversation history. When `history` is `None`, the event is inside a history list; when it contains a list, full conversation context is attached. `LlmAgent` handles this automatically—you only need to understand history if building custom agents.

### Accessing History

```python theme={null}
from line.events import UserTextSent, AgentTextSent

async def process(self, env, event):
    for past_event in event.history:
        if isinstance(past_event, UserTextSent):
            print(f"User said: {past_event.content}")
        elif isinstance(past_event, AgentTextSent):
            print(f"Agent said: {past_event.content}")
```

<Accordion title="Event types in history">
  Events in the history list have `history=None` to avoid redundant nesting. The event types are the same as regular input events:

  | Event Type         | Description               |
  | ------------------ | ------------------------- |
  | `CallStarted`      | Call began                |
  | `UserTurnStarted`  | User started speaking     |
  | `UserTextSent`     | User's transcribed speech |
  | `UserDtmfSent`     | User's DTMF button press  |
  | `UserTurnEnded`    | User finished speaking    |
  | `AgentTurnStarted` | Agent started responding  |
  | `AgentTextSent`    | Agent's spoken text       |
  | `AgentDtmfSent`    | Agent's DTMF tone         |
  | `AgentTurnEnded`   | Agent finished responding |
  | `CallEnded`        | Call ended                |
</Accordion>

<Accordion title="How LlmAgent processes history">
  `LlmAgent` automatically converts the event history to LLM messages:

  * **User messages**: From `UserTextSent` events
  * **Assistant messages**: From `AgentTextSent` events
  * **Tool calls**: From `AgentToolCalled` and `AgentToolReturned` events

  This means the LLM sees full context including previous tool calls and results, enabling it to reference that information without making redundant API calls.
</Accordion>

<Accordion title="Custom agents: Using history">
  If building a custom agent (not using `LlmAgent`), you can use history for context, summarization, or pattern detection:

  ```python theme={null}
  class CustomAgent:
      async def process(self, env, event):
          user_turns = sum(
              1 for e in event.history
              if isinstance(e, UserTurnEnded)
          )

          if user_turns > 5:
              yield AgentSendText(text="We've been chatting for a while! Is there anything else I can help with?")
  ```
</Accordion>


# SDK Overview
Source: https://docs.cartesia.ai/line/sdk/overview


The [Line SDK](https://github.com/cartesia-ai/line/) is a Python framework for building voice agents. Handles audio infrastructure, speech recognition, and conversation flow.

```bash theme={null}
uv add cartesia-line
```

<Note>
  New to Line? Start with the [Quickstart](/line/start-building/quickstart) to build and deploy your first agent.
</Note>

## Core Concepts

| Component                                           | Purpose                                                                 |
| --------------------------------------------------- | ----------------------------------------------------------------------- |
| [`Agent`](./agents)                                 | Controls the input/output event loop via a `process` method             |
| [`LlmAgent`](./agents#llmagent)                     | Built-in agent that wraps 100+ LLM providers via LiteLLM                |
| [`Tools`](./tools)                                  | Functions your agent can call—database lookups, handoffs, web search    |
| [`VoiceAgentApp`](./agents#handling-incoming-calls) | HTTP server that connects your agent to Cartesia's audio infrastructure |

```python theme={null}
import os
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import VoiceAgentApp

async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)
```

The agent speaks the `introduction` when a call starts, then responds to whatever the user says using the LLM.

## Features

* **Real-time interruption support** — Handles audio interruptions and turn-taking out-of-the-box.
* **Tool calling** — Connect to databases, APIs, and external services
* **Multi-agent handoffs** — Route conversations between specialized agents
* **Web search** — Built-in tool for real-time information lookup

## Add Capabilities

### Look up information

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool
async def get_order_status(ctx, order_id: Annotated[str, "The order ID"]):
    """Look up an order's current status."""
    order = await db.get_order(order_id)
    return f"Order {order_id} is {order.status}"
```

### Handoff to another agent

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, agent_as_handoff, end_call

spanish_agent = LlmAgent(
    model="gpt-5-nano",
    api_key=os.getenv("OPENAI_API_KEY"),
    tools=[end_call],
    config=LlmConfig(
        system_prompt="You speak only in Spanish.",
        introduction="¡Hola! ¿Cómo puedo ayudarte?",
    ),
)

main_agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[
        end_call,
        agent_as_handoff(
            spanish_agent,
            name="transfer_to_spanish",
            description="Transfer when user requests Spanish.",
        ),
    ],
    config=LlmConfig(...),
)
```

### Search the web

```python theme={null}
from line.llm_agent import end_call, web_search

agent = LlmAgent(
    tools=[end_call, web_search],  # Add built-in web search
    ...
)
```

See [Tools](./tools) for the full guide.

## Code Examples

| Example                                                                                   | Description                                        |
| ----------------------------------------------------------------------------------------- | -------------------------------------------------- |
| [Basic Chat](https://github.com/cartesia-ai/line/tree/main/examples/basic_chat)           | Simple conversational agent                        |
| [Chat Supervisor](https://github.com/cartesia-ai/line/tree/main/examples/chat_supervisor) | Fast chat model with powerful reasoning escalation |
| [Form Filler](https://github.com/cartesia-ai/line/tree/main/examples/form_filler)         | Collect structured data via conversation           |
| [Multi-Agent](https://github.com/cartesia-ai/line/tree/main/examples/transfer_agent)      | Hand off between specialized agents                |

### Integrations

| Integration                                                                                   | Description              |
| --------------------------------------------------------------------------------------------- | ------------------------ |
| [Exa Web Research](https://github.com/cartesia-ai/line/tree/main/example_integrations/exa)    | Real-time web search     |
| [Browserbase](https://github.com/cartesia-ai/line/tree/main/example_integrations/browserbase) | Fill web forms via voice |

## Next Steps

<CardGroup>
  <Card title="Agents" icon="robot" href="./agents">
    Configure prompts, LLMs, and conversation flow
  </Card>

  <Card title="Tools" icon="wrench" href="./tools">
    Add custom tools and multi-agent handoffs
  </Card>
</CardGroup>


# Advanced Patterns
Source: https://docs.cartesia.ai/line/sdk/patterns


Patterns for production voice agents: observability, tool design, multi-agent systems, and guardrails.

## Complete Example: Multi-Agent Customer Service

This example combines prompting, all three tool types, and multi-agent handoffs:

```python theme={null}
import os
from typing import Annotated
from line import CallRequest
from line.llm_agent import (
    LlmAgent, LlmConfig, loopback_tool, passthrough_tool,
    agent_as_handoff, end_call
)
from line.events import AgentSendText, AgentTransferCall
from line.voice_agent_app import AgentEnv, VoiceAgentApp

# Loopback tool: Fetch order info for LLM to contextualize
@loopback_tool
async def get_order_status(ctx, order_id: Annotated[str, "The order ID"]):
    """Look up order status by ID."""
    order = await db.get_order(order_id)
    return f"Order {order_id}: {order.status}, delivers {order.delivery_date}"

# Passthrough tool: Deterministic transfer action
@passthrough_tool
async def transfer_to_human(ctx):
    """Transfer to a human agent."""
    yield AgentSendText(text="Let me connect you with a team member who can help further.")
    yield AgentTransferCall(target_phone_number="+18005551234")

SYSTEM_PROMPT = """You are a friendly customer service agent for Acme Corp.

You can:
- Look up order status using get_order_status
- Transfer to a human agent using transfer_to_human
- Transfer to Spanish support using transfer_to_spanish
- End calls politely using end_call

Rules:
- Always confirm the order ID before looking it up
- Offer to transfer to a human if you can't resolve the issue
- Transfer to Spanish support if the user speaks Spanish or requests it
- Be empathetic and professional
"""

async def get_agent(env: AgentEnv, call_request: CallRequest):
    # Spanish-speaking specialist agent
    spanish_agent = LlmAgent(
        model="gpt-5-nano",
        api_key=os.getenv("OPENAI_API_KEY"),
        tools=[get_order_status, transfer_to_human, end_call],
        config=LlmConfig(
            system_prompt="Eres un agente de servicio al cliente amigable para Acme Corp. Habla solo en español.",
            introduction="¡Hola! Gracias por llamar a Acme Corp. ¿Cómo puedo ayudarte hoy?",
        ),
    )

    # Main English-speaking agent with handoff capability
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001",
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[
            get_order_status,
            transfer_to_human,
            agent_as_handoff(
                spanish_agent,
                handoff_message="Transferring you to our Spanish-speaking team...",
                name="transfer_to_spanish",
                description="Transfer to Spanish support when user speaks Spanish or requests it.",
            ),
            end_call,
        ],
        config=LlmConfig(
            system_prompt=SYSTEM_PROMPT,
            introduction="Hi! Thanks for calling Acme Corp. How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

***

## Observability

### Log Metrics

Track performance and business metrics:

```python theme={null}
from line.events import LogMetric, LogMessage

@loopback_tool
async def process_order(ctx, order_id: Annotated[str, "Order ID"]):
    """Process a customer order."""
    import time
    start = time.time()

    result = await api.process_order(order_id)

    # Log timing metric
    yield LogMetric(name="order_processing_ms", value=(time.time() - start) * 1000)

    # Log business event
    yield LogMessage(
        name="order_processed",
        level="info",
        message=f"Processed order {order_id}",
        metadata={"status": result.status}
    )

    return f"Order {order_id} processed: {result.status}"
```

### Built-in LLM Agent Metrics

`LlmAgent` automatically emits three timing metrics on every turn — no code needed:

| Metric               | Description                                                                            |
| -------------------- | -------------------------------------------------------------------------------------- |
| `llm_first_chunk_ms` | Time from start of response generation to first chunk (text or tool call) from the LLM |
| `llm_first_text_ms`  | Time from start of response generation to first text chunk                             |
| `agent_turn_ms`      | Total agent processing time for the turn                                               |

***

## Tool Patterns

### Validation in Tools

Validate inputs before processing:

```python theme={null}
@loopback_tool
async def book_appointment(
    ctx,
    date: Annotated[str, "Date in YYYY-MM-DD format"],
    time: Annotated[str, "Time in HH:MM format"]
):
    """Book an appointment."""
    from datetime import datetime

    try:
        dt = datetime.strptime(f"{date} {time}", "%Y-%m-%d %H:%M")
    except ValueError:
        return "Invalid date or time format. Please use YYYY-MM-DD and HH:MM."

    if dt < datetime.now():
        return "Cannot book appointments in the past."

    # Proceed with booking
    return f"Appointment booked for {dt.strftime('%B %d at %I:%M %p')}"
```

### Async Operations in Tools

Handle long-running operations with proper timeout handling:

```python theme={null}
import asyncio

@loopback_tool
async def search_inventory(ctx, query: Annotated[str, "Search query"]):
    """Search inventory with timeout protection."""
    try:
        result = await asyncio.wait_for(
            inventory_api.search(query),
            timeout=5.0
        )
        return f"Found {len(result.items)} items matching '{query}'"
    except asyncio.TimeoutError:
        return "Search is taking longer than expected. Please try a more specific query."
```

### Error Handling

Handle errors gracefully in tools:

```python theme={null}
@loopback_tool
async def get_account_info(ctx, account_id: Annotated[str, "Account ID"]):
    """Look up account information."""
    try:
        account = await api.get_account(account_id)
        return f"Account {account_id}: Balance ${account.balance:.2f}"
    except AccountNotFoundError:
        return f"Account {account_id} not found."
    except Exception as e:
        logger.error(f"Error fetching account: {e}")
        return "Sorry, I couldn't retrieve that account information right now."
```

***

## Agent Wrappers

Agent wrappers add cross-cutting behavior (logging, validation, routing) without modifying the underlying agent.

### Guardrails: Safety and Content Filtering

Wrappers are ideal for implementing guardrails that filter unsafe content in both directions:

```python theme={null}
class GuardrailsAgent:
    def __init__(self, inner_agent, safety_api):
        self.inner = inner_agent
        self.safety_api = safety_api

    async def process(self, env, event):
        # Pre-processing: Check user input for unsafe content
        if isinstance(event, UserTurnEnded):
            user_text = event.content[0].content if event.content else ""

            if await self.safety_api.is_unsafe(user_text):
                yield AgentSendText(text="I'm here to help with appropriate requests. Let's keep our conversation respectful.")
                return

        # Post-processing: Check agent output for safety issues
        async for output in self.inner.process(env, event):
            if isinstance(output, AgentSendText):
                if await self.safety_api.is_unsafe(output.text):
                    yield LogMessage(
                        name="safety_violation",
                        level="warning",
                        message=f"Blocked unsafe output: {output.text[:100]}..."
                    )
                    yield AgentSendText(text="I apologize, but I can't provide that information.")
                    continue

            yield output
```

Common guardrail patterns:

* Content safety filtering (toxicity, hate speech, PII)
* Rate limiting and abuse prevention
* Compliance checks (HIPAA, financial regulations)
* Brand safety (off-brand responses)

### Routing Between Multiple Agents

Dynamically switch between specialized agents based on conversation context:

```python theme={null}
class RouterAgent:
    def __init__(self, default_agent, specialists: dict):
        self.default = default_agent
        self.specialists = specialists
        self.current = default_agent

    async def process(self, env, event):
        # Switch agent based on user input
        if isinstance(event, UserTurnEnded):
            user_text = event.content[0].content if event.content else ""

            if "billing" in user_text.lower():
                self.current = self.specialists.get("billing", self.default)
            elif "technical" in user_text.lower():
                self.current = self.specialists.get("technical", self.default)

        async for output in self.current.process(env, event):
            yield output
```

Use with `LlmAgent`:

```python theme={null}
async def get_agent(env, call_request):
    return RouterAgent(
        default_agent=LlmAgent(
            model="gpt-5-nano",
            api_key=os.getenv("OPENAI_API_KEY"),
            config=LlmConfig(system_prompt="You are a helpful assistant..."),
        ),
        specialists={
            "billing": LlmAgent(
                model="gpt-5-nano",
                api_key=os.getenv("OPENAI_API_KEY"),
                config=LlmConfig(system_prompt="You are a billing specialist..."),
            ),
            "technical": LlmAgent(
                model="anthropic/claude-haiku-4-5-20251001",
                api_key=os.getenv("ANTHROPIC_API_KEY"),
                config=LlmConfig(system_prompt="You are a technical support specialist..."),
            ),
        }
    )
```

### Best Practices

Keep wrappers focused on a single responsibility. Use `async for` and `yield` to preserve streaming. Stack simple wrappers rather than building one complex one.

```python theme={null}
# Composable wrappers
agent = LoggingWrapper(
    ValidationWrapper(
        LlmAgent(...)
    )
)
```

***

## Example Implementations

Full working examples demonstrating these patterns:

| Example                                                                                       | Pattern             | Description                                            |
| --------------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------ |
| [Form Filler](https://github.com/cartesia-ai/line/tree/main/examples/form_filler)             | Stateful tools      | Walk users through a YAML-defined form with validation |
| [Multi-Agent Transfer](https://github.com/cartesia-ai/line/tree/main/examples/transfer_agent) | agent\_as\_handoff  | English/Spanish agent handoff                          |
| [Chat Supervisor](https://github.com/cartesia-ai/line/tree/main/examples/chat_supervisor)     | Background research | Separate agents for talking and longer-thinking        |


# Tools
Source: https://docs.cartesia.ai/line/sdk/tools


Tools let your agent perform actions and retrieve information. The SDK supports three tool paradigms that differ in how they affect conversation flow.

## Defining Tools

Any properly annotated function can be a tool. The SDK uses the function's docstring as the description and type annotations for parameters:

```python theme={null}
from typing import Annotated

async def get_weather(
    ctx,
    city: Annotated[str, "The city to check weather for"],
    units: Annotated[str, "celsius or fahrenheit"] = "fahrenheit"
):
    """
    Look up the current weather in a given city.
    """
    return f"72°F and sunny in {city}"
```

<Note>
  The first parameter of every tool must be `ctx` (the tool context). This provides access to conversation state and is required for forward compatibility. Your tool parameters follow after `ctx`.
</Note>

***

## Tool Types

<Note>
  Plain functions passed to `tools` are automatically wrapped as loopback tools. Use decorators (`@loopback_tool`, `@passthrough_tool`, `@handoff_tool`) for explicit control.
</Note>

### Loopback Tools (`@loopback_tool`)

The default behavior. The tool's result is sent back to the LLM, which can then continue generating a response.

```python theme={null}
from line.llm_agent import loopback_tool

@loopback_tool
async def get_account_balance(ctx, account_id: Annotated[str, "The account ID"]):
    """Look up the balance for a customer account."""
    balance = await api.get_balance(account_id)
    return f"${balance:.2f}"
```

**Use for:** Information retrieval, calculations, API queries.

### Passthrough Tools (`@passthrough_tool`)

Output events go directly to the user, bypassing the LLM. Use for deterministic actions.

```python theme={null}
from line.llm_agent import passthrough_tool
from line.events import AgentSendText, AgentEndCall

@passthrough_tool
async def end_call_with_message(ctx, message: Annotated[str, "Goodbye message"]):
    """End the call with a custom goodbye message."""
    yield AgentSendText(text=message)
    yield AgentEndCall()
```

**Use for:** Call control (`EndCall`, `TransferCall`, `SendDtmf`), deterministic responses.

### Handoff Tools (`@handoff_tool`)

Transfers control to another handler. All future events are routed to the handoff target instead of the original agent.

```python theme={null}
from typing import Annotated
from line.llm_agent import handoff_tool
from line.events import AgentHandedOff, AgentSendText, UserTurnEnded, AgentEndCall

@handoff_tool
async def run_satisfaction_survey(
    ctx,
    customer_name: Annotated[str, "The customer's name"],
    event
):
    """Hand off to a customer satisfaction survey at the end of the call."""
    if isinstance(event, AgentHandedOff):
        # First call - send introduction
        yield AgentSendText(
            text=f"Thank you for your call, {customer_name}. "
            "Please stay on the line for a brief satisfaction survey. "
            "On a scale of 1 to 5, how would you rate your experience today?"
        )
        return

    # Subsequent calls - handle survey responses
    if isinstance(event, UserTurnEnded):
        user_response = event.content[0].content if event.content else ""
        yield AgentSendText(text=f"You rated us {user_response}. Thank you for your feedback!")
        yield AgentEndCall()
```

**Use for:** Custom multi-step flows, specialized handlers with their own logic.

When using a handoff tool, the `event` parameter receives different values depending on timing:

* **First call**: `event` is `AgentHandedOff` — use this to send a transition message
* **Subsequent calls**: `event` is the actual `InputEvent` (`UserTurnEnded`, etc.)

Once a handoff occurs, the original agent no longer receives events. The handoff tool function handles all future conversation turns.

<Tip>
  To hand off to another `LlmAgent`, use the [`agent_as_handoff`](#agent_as_handoff) helper instead of writing a raw `@handoff_tool`. It handles the delegation automatically.
</Tip>

***

## Built-in Tools

```python theme={null}
from line.llm_agent import end_call, send_dtmf, transfer_call, web_search

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[end_call, send_dtmf, transfer_call, web_search],
    config=LlmConfig(...),
)
```

| Tool            | Description                                | When to Use                                                   |
| --------------- | ------------------------------------------ | ------------------------------------------------------------- |
| `end_call`      | Ends the call                              | User says "goodbye" or the agent's objective has been met     |
| `transfer_call` | Transfers to another number (E.164 format) | Escalating to human agents, routing to departments            |
| `web_search`    | Searches the web for real-time info        | Current events, live prices, recent news the LLM doesn't know |

**Examples:**

```python theme={null}
# End call: Let the LLM decide when conversation is complete
tools=[end_call]  # LLM calls this when user says "thanks, bye!"

# Transfer: Route to human support
tools=[transfer_call]  # LLM calls transfer_call(target_phone_number="+18005551234")

# Web search with custom context size
tools=[web_search(search_context_size="high")]  # More context for complex queries
```

### `end_call`

Ends the current call and disconnects. The actual hangup occurs after the agent's final speech completes, so the user hears the full goodbye message before disconnection.

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, end_call

agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[end_call],
    config=LlmConfig(...),
)
```

By default, `end_call` uses a conservative policy that only ends the call when:

* The user's objective is fully complete
* The user explicitly says goodbye
* The agent has said a natural goodbye

#### Custom Description

We recommend providing a custom description tailored to your use case. The description **fully replaces** the default—it is not appended—so include complete instructions with explicit do/don't guidance.

```python theme={null}
from line.llm_agent import end_call

# Restaurant reservation agent
tools=[end_call(description="""Ends the call and disconnects.

Call when ALL of the following are true:
- The reservation is confirmed with date, time, party size, and name.
- You have repeated the reservation details back to the guest.
- The guest confirms the details are correct or says goodbye.

Do not call when:
- The guest asks to modify the reservation.
- Details are missing or unconfirmed.
- The guest says 'okay' or 'thanks' without an explicit goodbye.

If unsure, ask: 'Is there anything else I can help you with for your reservation?'
""")]

# Order confirmation agent
tools=[end_call(description="""Ends the call and disconnects.

Call when ALL of the following are true:
- The order is placed and confirmed.
- You have provided the order number and estimated delivery time.
- The customer acknowledges with a goodbye phrase.

Do not call when:
- The customer has questions about their order.
- Payment has not been confirmed.
- The customer says 'got it' without saying goodbye.
""")]
```

| Parameter     | Type            | Description                                                                                                 |
| ------------- | --------------- | ----------------------------------------------------------------------------------------------------------- |
| `description` | `Optional[str]` | Fully replaces the default description. Include complete instructions for when the LLM should end the call. |

### `agent_as_handoff`

Creates a handoff tool from another `Agent`—the easiest way to implement multi-agent workflows.

```python theme={null}
from line.llm_agent import LlmAgent, LlmConfig, agent_as_handoff, end_call, UpdateCallConfig

spanish_agent = LlmAgent(
    model="gpt-5-nano",
    api_key=os.getenv("OPENAI_API_KEY"),
    tools=[end_call],
    config=LlmConfig(
        system_prompt="You speak only in Spanish.",
        introduction="¡Hola! ¿Cómo puedo ayudarte?",
    ),
)

main_agent = LlmAgent(
    model="anthropic/claude-haiku-4-5-20251001",
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    tools=[
        end_call,
        agent_as_handoff(
            spanish_agent,
            handoff_message="Transferring to Spanish support...",
            update_call=UpdateCallConfig(
                voice_id="spanish-voice-id",
                pronunciation_dict_id="spanish-pronunciation-dict-id"
            ),
            name="transfer_to_spanish",
            description="Use when user requests Spanish.",
        ),
    ],
    config=LlmConfig(...),
)
```

| Parameter         | Type                         | Description                                                                             |
| ----------------- | ---------------------------- | --------------------------------------------------------------------------------------- |
| `agent`           | `Agent`                      | The agent to hand off to                                                                |
| `handoff_message` | `Optional[str]`              | Message spoken before the handoff                                                       |
| `update_call`     | `Optional[UpdateCallConfig]` | Optional config to update call settings (voice, pronunciation, language) before handoff |
| `name`            | `Optional[str]`              | Tool name for the LLM                                                                   |
| `description`     | `Optional[str]`              | When the LLM should use this tool                                                       |

When called, `agent_as_handoff` automatically sends the handoff message, updates the call settings if specified, triggers the new agent's introduction, and routes all future events to it.

<Tip>
  See [Advanced Patterns](/line/sdk/patterns) for a complete multi-agent example with loopback, passthrough, and handoff tools.
</Tip>

***

## Long-Running Tools

By default, tool calls are terminated when the agent is interrupted (though any reasoning and tool call response values already produced are preserved for use in the next generation).

For tools that are expected to take a long time to complete, set `is_background=True`. The tool will continue running in the background until completion regardless of interruptions, then loop back to the LLM to produce a response.

```python theme={null}
from typing import Annotated
from line.llm_agent import loopback_tool

@loopback_tool(is_background=True)
async def search_database(ctx, query: Annotated[str, "Search query"]) -> str:
    """Search the database - may take several seconds."""
    results = await slow_database_search(query)
    return format_results(results)

@loopback_tool(is_background=True)
async def generate_report(ctx, report_type: Annotated[str, "Type of report"]) -> str:
    """Generate a detailed report - runs in background."""
    report = await compile_report(report_type)
    return report
```

<Note>
  Background tools are useful when:

  * The operation may take longer than typical user patience (e.g., complex searches, report generation)
  * You want the user to be able to speak while the operation completes
  * The result should be delivered even if the user interrupts with another question
</Note>


# Agent Builder
Source: https://docs.cartesia.ai/line/start-building/agent-builder


Prototype voice agents in the Playground. Test prompts, configure voices, and deploy in seconds.

## Create your agent

Go to [play.cartesia.ai/agents](https://play.cartesia.ai/agents) and select **Start in Playground**.

<Frame>
  <img alt="Create your first voice agent options" />
</Frame>

Customize your agent's behavior, voice, and greeting.

<Frame>
  <img alt="Dynamic agent configuration interface" />
</Frame>

**System Prompt** — Define your agent's role and guidelines. You can also provide a natural language description of your agent and the platform will generate a structured system prompt.

**Voice** — Choose from Cartesia's voice library. Preview voices before selecting.

**Initial Message** — Set the greeting your agent speaks when calls start. Check **Skip agent introduction** to have the agent wait for the user to speak first.

**Background Sound** — Add ambient audio for call center atmospheres or office environments.

**Preview** changes before publishing.

## Continue building in code

Connect your Playground agent to GitHub to customize with code.

<Steps>
  <Step title="Connect to GitHub">
    On your agent page, click **Connect to GitHub**. Authorize Cartesia to create a repository.
  </Step>

  <Step title="Clone locally">
    ```bash theme={null}
    git clone https://github.com/your-org/your-agent.git
    cd your-agent
    ```
  </Step>

  <Step title="Install dependencies">
    ```bash theme={null}
    uv pip install .
    ```
  </Step>

  <Step title="Edit your agent">
    Open `main.py` to add tools, custom logic, or modify the prompt.
  </Step>

  <Step title="Deploy">
    Push to deploy your changes.

    ```bash theme={null}
    git push
    ```
  </Step>
</Steps>

## Next steps

<CardGroup>
  <Card title="Quickstart" icon="rocket" href="/line/start-building/quickstart">
    Build agents with the SDK
  </Card>

  <Card title="Agents" icon="robot" href="/line/sdk/agents">
    Prompts, voices, and pre-call configuration
  </Card>
</CardGroup>


# Quickstart
Source: https://docs.cartesia.ai/line/start-building/quickstart


Build an agent, deploy it, and make your first call within minutes.

## Prerequisites

* A free Cartesia account ([sign up here](https://play.cartesia.ai))
* Python 3.9+
* An LLM API key (Anthropic, OpenAI, Google, etc.)
* [uv](https://docs.astral.sh/uv/) (Python package and project manager)

## Install the CLI

```bash theme={null}
curl -fsSL https://cartesia.sh | sh
cartesia auth login
```

## Install uv

Install [uv](https://docs.astral.sh/uv/), a fast Python package manager to manage dependencies and virtual environments.

```bash theme={null}
curl -LsSf https://astral.sh/uv/install.sh | sh
```

## Create your agent

Create a new project and install dependencies. uv will automatically set up a virtual environment and manage your packages.

```bash theme={null}
uv init my-voice-agent && cd my-voice-agent
uv add cartesia-line
```

Create `main.py`:

```python theme={null}
import os
from line.llm_agent import LlmAgent, LlmConfig, end_call
from line.voice_agent_app import VoiceAgentApp

async def get_agent(env, call_request):
    return LlmAgent(
        model="anthropic/claude-haiku-4-5-20251001", # Or "gpt-5-nano", "gemini/gemini-2.5-flash", etc.
        api_key=os.getenv("ANTHROPIC_API_KEY"),
        tools=[end_call],
        config=LlmConfig(
            system_prompt="You are a helpful assistant.",
            introduction="Hello! How can I help you today?",
        ),
    )

app = VoiceAgentApp(get_agent=get_agent)

if __name__ == "__main__":
    app.run()
```

## Test locally

Start your agent server.

```bash theme={null}
ANTHROPIC_API_KEY=your-api-key PORT=8000 uv run python main.py
```

In a separate terminal, chat with your agent by simply running:

```bash theme={null}
cartesia chat 8000
```

This lets you test your agent's reasoning before deploying.

## Deploy

Link your project and deploy.

```bash theme={null}
cartesia init    # Choose "Create new" and name your agent
cartesia deploy
```

Your agent deploys in under 30 seconds on Cartesia's managed runtime.

## Set environment variables

Configure your API key for the deployed agent.

```bash theme={null}
cartesia env set ANTHROPIC_API_KEY=your-api-key
```

Or import from a `.env` file:

```bash theme={null}
cartesia env set --from .env
```

## Make a call

Call your agent from your phone.

```bash theme={null}
cartesia call +1XXXXXXXXXX
```

Or visit the [Playground](https://play.cartesia.ai/agents) to call from the web.

## Next steps

<CardGroup>
  <Card title="Add tools" icon="wrench" href="/line/sdk/tools">
    Connect databases, APIs, and external services
  </Card>

  <Card title="Configure prompts" icon="robot" href="/line/sdk/agents">
    Customize system prompts and conversation flow
  </Card>

  <Card title="Calls API" icon="globe" href="/line/integrations/calls-api">
    Connect web clients via WebSocket
  </Card>

  <Card title="Agent Builder" icon="sparkles" href="/line/start-building/agent-builder">
    Build agents visually in the Playground
  </Card>
</CardGroup>


# Air-Gapped Deployments
Source: https://docs.cartesia.ai/self-hosted/air-gapped

Deploy Cartesia without internet connectivity to licensing servers

For deployments without internet connectivity to Cartesia's licensing servers, you can run in air-gapped mode. This mode uses an offline license file instead of real-time authentication.

<Note>Download your offline license file from the [on-prem portal](https://play.cartesia.ai/on-prem). See [Provisioned Resources](/self-hosted/provisioned-resources) for details.</Note>

## Configuration

<Tabs>
  <Tab title="Terraform">
    ```hcl theme={null}
    # In your .tfvars file
    authenticate               = false
    license_proxy_persistence  = true   # Required for air-gapped mode
    ```
  </Tab>

  <Tab title="Helm">
    ```yaml theme={null}
    infra:
      authenticate: false
    licenseProxy:
      persistence:
        enabled: true
        storageClass: gp2  # Use appropriate storage class for your cluster
    ```
  </Tab>
</Tabs>

## Loading a License

In air-gapped mode, the `/license` endpoint is exposed for license management.

### Via Port-Forward

```bash theme={null}
kubectl port-forward svc/cartesia-license-proxy 8080:8080 -n cartesia
```

In another terminal:

```bash theme={null}
curl -X POST http://localhost:8080/license -d '<license-json>'
```

### Via Ingress

If ingress is enabled:

```bash theme={null}
curl -X POST https://<your-domain>/license -d '<license-json>'
```

## Retrieving Audit Logs

The `/audit` endpoint is available in air-gapped mode for retrieving usage audit logs:

```bash theme={null}
curl -X GET https://<your-domain>/audit --output audits.tar
```

These audit logs contain usage metadata for billing reconciliation. No transcript data is included, which you can validate by looking at the contents of the output.


# Architecture
Source: https://docs.cartesia.ai/self-hosted/architecture

Overview of the core components in a Cartesia self-hosted deployment.

Cartesia's self-hosted services support a configurable trade-off between latency and throughput for both TTS and STT deployments.

<Frame>
  <img alt="Self-hosted Architecture" />
</Frame>

## Core Components

### API Server

The API Server is the entrypoint for all requests for your self-hosted Cartesia Service. It handles incoming REST API requests and WebSocket connections.

### PubSub Controller (NATS)

We leverage an async communication protocol between the API server and the model containers to manage smooth low latency request handling. This design allows :

* Model containers to leave and join the cluster freely.
* Efficient stateful management of long running request lifecycles.
* Coordination between the API server and Model containers for the lowest latency pathways for a request.

### Model Workers (Engine)

Cartesia provides batched engine workers for both TTS and STT. The core parameter to customize here is the `batch_size (B)`. We'll discuss tradeoffs
for this and other parameters in the Performance Tuning sections.

### License Proxy Server

We deploy a single service which talks to our cloud environment for authenticating and ensuring license validity of the self-hosted deployment.  We
do this for several reasons, primarily: this becomes the only service making outbound calls, thus making it easier to configure network security
policies.

Proxy allows you to choose the level of isolation you want:

* `Connected`: The deployment validates licensing by pinging our cloud periodically and sends telemetry regarding usage.
* `Air-gapped`: Completely isolated offering, where you work with an offline license.  In air-gapped mode, we work with you directly to get usage
  information via audit-logs.

For most customers, we recommend deploying in `Connected` mode, however if you have need for completely isolated deployments,
our GTM team can work with you in setting things up.

For both `Connected` and `Air-gapped` mode, we have grace periods configured, so we don't immediately terminate the operations on getting disconnected or license expiring.


# Autoscaling
Source: https://docs.cartesia.ai/self-hosted/auto-scaling


## Pod Auto-Scaling (KEDA)

KEDA ScaledObjects use Prometheus-based metrics with two triggers:

| Trigger     | Metric                                            | Threshold | Condition               |
| ----------- | ------------------------------------------------- | --------- | ----------------------- |
| Worker Load | inferno\_worker\_load / inferno\_worker\_capacity | 0.8 (80%) | Always active           |
| Queue-based | api\_queue\_size / capacity (overflow mode)       | 1.0       | Only when minReplicas=0 |
| Queue-based | api\_unserviceable\_requests\_size                | 0.9       | Only when minReplicas=0 |

Scaling behavior:

* Polling interval: 15 seconds
* Scale-up stabilization: 30 seconds
* Scale-down stabilization: 900 seconds (15 min)
* Scale-down policy: Remove 1 pod per 60 seconds

## Cluster/Node Auto-Scaling

<Tabs>
  <Tab title="AWS EKS">
    Uses the Cluster Autoscaler:

    * Scan interval: 10 seconds
    * Scale-down delay: 10 minutes after node add
    * Scale-down unneeded time: 10 minutes
    * Expander: least-waste (bin-packing)
    * Metric: Pending pods that can't be scheduled due to insufficient resources
  </Tab>

  <Tab title="GCP GKE">
    Uses the Native Autoscaler:

    * Profile: BALANCED
    * Resource limits: CPU (1-128), Memory (1-512GB), nvidia-l4 GPUs (0-8)
    * Metric: Pending pods + resource utilization
  </Tab>
</Tabs>

## Metrics Used for Scaling

The autoscaling triggers above use [Prometheus metrics](/self-hosted/metrics) exposed by the application. See the [Metrics and Monitoring](/self-hosted/metrics) page for the full list of available metrics.


# Changelog
Source: https://docs.cartesia.ai/self-hosted/changelog

Release history for Cartesia self-hosted deployments

## sonic-20260310

<AccordionGroup>
  <Accordion title="Add voices API">
    New `POST /onprem/add-voices` endpoint to migrate voices from the Cartesia cloud to your self-hosted deployment. Supports up to 50 voices per request.

    See [Managing Artifacts](/self-hosted/managing-artifacts) for details.
  </Accordion>

  <Accordion title="Add pronunciation dictionaries API">
    New `POST /onprem/add-pdict` endpoint to migrate pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment. Supports up to 50 dictionaries per request.

    See [Managing Artifacts](/self-hosted/managing-artifacts) for details.
  </Accordion>

  <Accordion title="Hot reload">
    New artifacts (voices, migration files) are picked up automatically without requiring a rollout. Enabled by default.

    ```hcl theme={null}
    enable_hot_reload = false  # to disable
    ```

    <Warning>
      Hot reload does not support PVC voices. Migrations with `include_loras: true` require a restart of the worker pods.
    </Warning>
  </Accordion>
</AccordionGroup>


# Cloud Service Provisioning
Source: https://docs.cartesia.ai/self-hosted/cloud-service-provisioning

Deploy Cartesia using Amazon SageMaker Jumpstart

Amazon SageMaker Jumpstart provides the quickest path to deploying Cartesia's self-hosted solution with managed infrastructure, automatic scaling, and integrated monitoring. This deployment method is ideal for teams new to self-hosted AI or those wanting managed infrastructure.

To get started, visit the [Sonic 3 on AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-w2bmik3jypagm) to subscribe.

## Overview

SageMaker Jumpstart deployment offers:

* **Managed Infrastructure**: AWS handles server provisioning and maintenance
* **Automatic Scaling**: Built-in auto-scaling based on demand
* **Integrated Monitoring**: CloudWatch integration for metrics and logging
* **Pay-per-use**: Cost optimization through on-demand resource allocation
* **Quick Setup**: Deploy in minutes using pre-configured notebooks

## Prerequisites

### AWS Account Requirements

* AWS account with SageMaker access
* Sufficient service limits for GPU instances (ml.g6e.xlarge)
* IAM role with Sagemaker Full Access and Marketplace Subscription Access (ViewSubscriptions, Unsubscribe, Subscribe)
* VPC configuration (optional, for private deployment)

## Getting Started

To get started with deploying an inference endpoint for Sonic 3 on Sagemaker, please refer to [the steps in this notebook](https://github.com/cartesia-ai/cartesia-aws/blob/main/Sonic-3-Jumpstart.ipynb)

## Inference Setup

Sonic 3 supports only real time inference on Sagemaker. Please select `ml.g6e.xlarge` as your inference endpoint instance type. Each instance is capable of serving 8 concurrent requests. In order to get the best performance, Sagemaker suggests that you reuse the client-to-SageMaker connection, as it can save the time to re-establish the connection. In boto3, you can configure max\_pool\_connections . Multiple requests will reuse the connections, which avoids the cost of establishing new TCP/TLS connections for each request.

## Inputs and Outputs

### Input Summary

The response streaming endpoint takes in a JSON object as the input that specifies the transcript, voice, language, and output format for the generation

### Input Parameters

| Parameter                   | Description                                                                                                                                                                                                                                                                                                                           | Type      | Required |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | -------- |
| `context_id`                | A unique ID provided by the client to identify the request. It can be any string value and helps with tracking or debugging.                                                                                                                                                                                                          | `string`  | Yes      |
| `transcript`                | The text that will be converted into speech. You can include additional controls (e.g., emotion, speed, volume) as supported by Sonic 3 models.<br /><a href="https://docs.cartesia.ai/build-with-cartesia/sonic-3/volume-speed-emotion">Docs</a>                                                                                     | `string`  | Yes      |
| `language`                  | The language code of the transcript text.<br /><br />Supported codes:<br />`en`, `fr`, `de`, `es`, `pt`, `zh`, `ja`, `hi`, `it`, `ko`, `nl`, `pl`, `ru`, `sv`, `tr`, `tl`, `bg`, `ro`, `ar`, `cs`, `el`, `fi`, `hr`, `ms`, `sk`, `da`, `ta`, `uk`, `hu`, `no`, `vi`, `bn`, `th`, `he`, `ka`, `id`, `te`, `gu`, `kn`, `ml`, `mr`, `pa` | `string`  | Yes      |
| `output_format`             | Must match the `raw` option from the Cartesia TTS SSE API. Only `raw` is supported.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-output-format">Docs</a>                                                                                                                                                         | `string`  | Yes      |
| `voice`                     | Matches the `voice` field from the Cartesia TTS SSE API. Only **mode = `id`** is supported.<br /><br />Example:<br />`{ "mode": "id", "id": "voice_123" }`<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-voice">Docs</a>                                                                                          | `object`  | Yes      |
| `generation_config`         | Optional configuration object matching the API schema.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-generation-config">Docs</a>                                                                                                                                                                                  | `object`  | No       |
| `add_timestamps`            | Whether to include word-level timestamps in the output.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-add-timestamps">Docs</a>                                                                                                                                                                                    | `boolean` | No       |
| `add_phoneme_timestamps`    | Whether to include phoneme-level timestamps in the output.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-add-phoneme-timestamps">Docs</a>                                                                                                                                                                         | `boolean` | No       |
| `use_normalized_timestamps` | Whether timestamps should be normalized (0–1 range).<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-use-normalized-timestamps">Docs</a>                                                                                                                                                                            | `boolean` | No       |

### Data Sample

```json theme={null}
{
    "context_id": "0",
    "transcript": "The detective burst through the door. 'We've got maybe five minutes before they realize we're here, so listen carefully and listen well: <speed ratio='1.5'/> the artifact is hidden beneath the old courthouse, exactly three feet below the cornerstone, and <volume ratio='0.5'/>whatever you do, DO NOT touch it with your bare hands!' She paused, catching her breath. 'Now... here's the important part... <speed ratio='0.6'/>you need to... very slowly... very carefully... wrap it in the copper wire first... then the silk cloth... then seal it in the lead box.' <volume ratio='2.0'/> Footsteps echoed in the hallway. 'GO GO GO! They're coming up the stairs RIGHT NOW!'",
    "language": "en",
    "output_format": {
        "container": "raw",
        "sample_rate": 44100,
        "encoding": "pcm"
    },
    "voice_id": {
        "mode": "id",
        "id": "bf0a246a-8642-498a-9950-80c35e9276b5"
    }
}
```

### Output Details

#### Output Events

Sagemaker sends back the response events in a [Response Stream](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_ResponseStream.html). The payload is sent to you as base 64 encoded blobs. Due to Sagemaker limitation, it may truncate one event into several segements. Or API always attach a linebreak to the end of each complete event, such that you can reconciliate them on client side. Each event we send back is a json object that contains the generated audio chunk and some metadatas. The event can be one of the following types, identified by `event.type`:

##### Chunk Event

A chunk event always contains at most 20 ms worth of audio chunk in the output format and sample rate you specified.

| Parameter       | Description                                                                                                            | Type     | Required |
| --------------- | ---------------------------------------------------------------------------------------------------------------------- | -------- | -------- |
| `type`          | The type of response event. For chunk events, this value is always `"chunk"`.                                          | `string` | Yes      |
| `context_id`    | Optional identifier for the response context. Useful for correlating responses with requests or sessions.              | `string` | No       |
| `status_code`   | The HTTP-like status code representing the success or error state of the chunk event.                                  | `int`    | Yes      |
| `done`          | Indicates whether this is the final chunk (`true`) or if more chunks are expected (`false`).                           | `bool`   | Yes      |
| `data`          | The base 64 encoded chunk of audio data. Each chunk represents a portion of the full audio output.                     | `string` | Yes      |
| `sampling_rate` | The sampling rate (in Hz) of the audio data in this chunk (e.g., `44100` or `8000`).                                   | `int`    | Yes      |
| `step_time`     | The time (in seconds) representing the generation step for this chunk, useful for synchronization or latency tracking. | `float`  | Yes      |

##### Done Event

A done event signals the completion of the generation. Done events are identified by `event.type == "done"` and `event.done == True`.

##### Timestamp Event

A **timestamp event** provides timing information for recognized words or tokens.

| Parameter         | Description                                                                        | Type                | Required |
| ----------------- | ---------------------------------------------------------------------------------- | ------------------- | -------- |
| `type`            | The response type. Always `"timestamps"`.                                          | `string`            | Yes      |
| `context_id`      | Optional identifier correlating this timestamp event with its request/session.     | `string`            | No       |
| `status_code`     | Status code indicating success or failure.                                         | `int`               | Yes      |
| `done`            | Indicates whether this is the final timestamp event.                               | `bool`              | Yes      |
| `word_timestamps` | A dictionary describing word-level timestamps (format may vary by implementation). | `dict<string, any>` | Yes      |

##### Phoneme Timestamp Event

A **phoneme timestamp event** provides timing data at the phoneme level, typically for detailed speech analysis.

| Parameter            | Description                                                            | Type                | Required |
| -------------------- | ---------------------------------------------------------------------- | ------------------- | -------- |
| `type`               | The response type. Always `"phoneme_timestamps"`.                      | `string`            | Yes      |
| `context_id`         | Optional identifier for correlating this event with a request/session. | `string`            | No       |
| `status_code`        | Processing status code.                                                | `int`               | Yes      |
| `done`               | Indicates whether this is the final phoneme timestamp event.           | `bool`              | Yes      |
| `phoneme_timestamps` | A dictionary containing phoneme-level timing information.              | `dict<string, any>` | Yes      |

## Error Handling

If an error occurs during the generation type, Sagemaker will send back the error as a [Model Error](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_ResponseElements:~:text=Status%20Code%3A%20500-,ModelError,-Model%20\(owned%20by\)). To handle the error, you may inspect the `OriginalStatusCode` field of the error object (See examples for error handling in python).

### 422 Errors

A 422 error indicates that your input is not of the correct format. You may see more details in the `Message` field.

### 429 Errors

A 429 error indicates that the model container you are hitting does not have capacity to serve requests at the point. Our models serve at most 4 concurrent generation requests at a time. If you are running multiple inference container replicas, we suggest that you use load-aware routing in sagemaker by configuring the parameters `RoutingConfig` inside the `ProductionVariants` configuration, Set it to `LEAST_OUTSTANDING_REQUESTS` for optimal load distribution.

## Container Logs

You should be able to see container logs in cloudwatch. Most logs should be emitted with a request id. The server side request id is of the format `{uuid}-{client supplied context id}`.


# Docker
Source: https://docs.cartesia.ai/self-hosted/docker-compose

Deploy Cartesia on bare-metal or VM nodes using Docker Compose or Docker Swarm

<Note>Docker Compose and Docker Swarm deployment are currently in **beta**. Connect with the Cartesia team for support.</Note>

Deploy Cartesia TTS on a **single machine** with Docker Compose, or across a **multi-node cluster** with Docker Swarm.

|                 | Docker Compose                                       | Docker Swarm                              |
| --------------- | ---------------------------------------------------- | ----------------------------------------- |
| **Nodes**       | Single host                                          | Multiple hosts (managers + workers)       |
| **GPU scaling** | Multiple workers via `WORKER_REPLICAS` (one per GPU) | Workers scheduled on labeled GPU nodes    |
| **MIG support** | Auto-detected via `--mig` flag                       | Per-node via node labels and `--mig` flag |
| **Networking**  | Bridge (default)                                     | Overlay (Swarm-managed)                   |

## Prerequisites

* One or more machines with Docker installed (your user must be in the `docker` group)
* **Compose only:** Docker Compose V2 (`docker compose`)
* **Swarm only:** nodes meet Docker's [Swarm networking requirements](https://docs.docker.com/engine/swarm/networking/)
* At least one NVIDIA GPU with drivers installed. MIG (Multi-Instance GPU) partitioning is supported on compatible NVIDIA GPUs
* GPU nodes have the **nvidia Docker runtime set as default** (see below)
* The `cartesia-kube` repo downloaded as described in [Downloading cartesia-kube](/self-hosted/getting-started#downloading-kube)
* A Cartesia API key file (`container_key`) and a GCS service account JSON file, provided during onboarding

### GPU runtime check

On each GPU node, verify the NVIDIA runtime:

```bash theme={null}
nvidia-smi

docker info | grep "Default Runtime"
# Expected: Default Runtime: nvidia

docker run --rm nvidia/cuda:12.3.1-base-ubuntu22.04 nvidia-smi
```

If `nvidia` is not the default runtime, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) and run:

```bash theme={null}
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker
```

**If using MIG:** After enabling MIG and creating instances on the host, verify they are visible:

```bash theme={null}
nvidia-smi -L
# Each MIG instance appears as a MIG-... UUID line beneath its parent GPU.
# The deploy script reads these UUIDs automatically — no manual configuration required.
```

<Note>MIG must be enabled and instances created on the host before deploying. Recreating MIG instances generates new UUIDs; redeploy the stack if this happens.</Note>

***

## Step 1 — Prepare secrets

Place these files on the host (Compose) or **manager node** (Swarm):

* `container_key` — file containing your Cartesia API key
* `service-account.json` — GCS service account JSON with `roles/artifactregistry.reader` (image pull) and `roles/storage.objectViewer` (GCS sync)

Make the deploy script executable:

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    chmod +x local/scripts/deploy-compose.sh
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    chmod +x local/scripts/deploy-swarm.sh
    ```
  </Tab>
</Tabs>

***

## Step 2 — Initialize the cluster (Swarm only)

Skip this step if you are using Docker Compose.

On the **manager node**:

```bash theme={null}
docker swarm init --advertise-addr <MANAGER_IP>
```

Copy the `docker swarm join` command from the output. On **each additional node**, run:

```bash theme={null}
docker swarm join --token <TOKEN> <MANAGER_IP>:2377
```

Label each node from the manager. Use `docker node ls` to list node IDs:

```bash theme={null}
docker node update --label-add cpu=true <node-id>   # CPU services (API, NATS, etc.)
docker node update --label-add gpu=true <node-id>   # Standard GPU workers
```

**If using MIG:** Label MIG-enabled nodes with `mig=true` and a comma-separated list of their MIG instance UUIDs (obtained from `nvidia-smi -L` on that node). Do **not** apply `gpu=true` to MIG nodes.

```bash theme={null}
docker node update --label-add mig=true <node-id>
docker node update --label-add 'mig.uuids=MIG-<uuid1>,MIG-<uuid2>' <node-id>
```

Mixed clusters with both standard GPU nodes and MIG nodes are supported — the deploy script handles scheduling for both automatically.

***

## Step 3 — Configure environment

Set [environment variables](#configuration) before deploying. Use a `.env` file in `local/` (see `local/.env.example`) or export them in your shell.

```bash theme={null}
export IMAGE_REGISTRY="YOUR_IMAGE_REGISTRY"
export RELEASE_TAG="YOUR_RELEASE_TAG"
export MODEL_NAME="YOUR_MODEL_NAME"

export CONTAINER_KEY_FILE=/path/to/cartesia-api-key
export GCS_SA_FILE=/path/to/service-account.json

# Optional
export WORKER_REPLICAS=1
export WORKER_CAPACITY=4
export BUCKET_NAME=""
export CLUSTER_NAME="cartesia-compose"   # or "cartesia-swarm"
export USE_MIG=0                         # set to 1 to enable MIG mode (or pass --mig to the deploy script)
```

See [Configuration](#configuration) for full details on each variable.

***

## Step 4 — Deploy

<Tabs>
  <Tab title="Compose">
    From the repo root:

    ```bash theme={null}
    # Standard deployment
    ./local/scripts/deploy-compose.sh

    # With MIG support (auto-detects MIG instances via nvidia-smi)
    ./local/scripts/deploy-compose.sh --mig
    ```

    When `--mig` is used, the script auto-detects MIG instance UUIDs from `nvidia-smi`, generates a per-slice worker configuration, and scales the standard worker to zero.
  </Tab>

  <Tab title="Swarm">
    On the **manager node**:

    ```bash theme={null}
    # Standard deployment
    ./local/scripts/deploy-swarm.sh

    # With MIG support (reads UUIDs from node labels)
    ./local/scripts/deploy-swarm.sh --mig
    ```

    This will:

    1. Verify that nodes are labeled (fails with instructions if not).
    2. Create encrypted Swarm secrets from your key and service account files.
    3. Deploy all services. With `--mig`, one dedicated worker service is created per MIG instance, each pinned to its node.
  </Tab>
</Tabs>

<Warning>
  TTS workers take a few minutes to load the model into GPU memory. During this time, TTS requests will return errors even though containers appear healthy. Wait for the ready signal:

  <Tabs>
    <Tab title="Compose">
      ```bash theme={null}
      cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs -f tts-worker 2>&1 | grep -i "ready"
      ```
    </Tab>

    <Tab title="Swarm">
      ```bash theme={null}
      docker service logs cartesia_tts-worker -f 2>&1 | grep -i "ready"
      ```
    </Tab>
  </Tabs>
</Warning>

***

## Step 5 — Verify

Check that services are running:

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml ps
    ```

    If deployed with MIG, verify each worker sees exactly one MIG device:

    ```bash theme={null}
    # List all running services (MIG workers appear as tts-worker-mig-0, tts-worker-mig-1, etc.)
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml -f docker-compose.mig.generated.yaml ps
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    docker stack services cartesia
    ```

    If deployed with MIG, verify MIG worker services are scheduled and running:

    ```bash theme={null}
    docker stack ps cartesia --filter 'name=cartesia_tts-worker-mig'
    ```
  </Tab>
</Tabs>

Test the API:

```bash theme={null}
curl http://localhost:5000/status
```

Test TTS:

```bash theme={null}
curl -s -X POST "http://localhost:5000/tts/bytes" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Cartesia-Version: 2024-06-10" \
  -d '{
    "model_id": "sonic-3",
    "transcript": "Hello from Cartesia.",
    "voice": {"mode": "id", "id": "00510a15-4216-4fdc-a0ab-05d74cd9f795"},
    "language": "en",
    "output_format": {"container": "mp3", "sample_rate": 44100, "bit_rate": 128000}
  }' --output test.mp3
```

***

## Troubleshooting

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    cd local

    docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs api
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs tts-worker

    # Restart everything
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml down
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml up -d
    ```

    If the API exits with `no servers available for connection` (NATS not ready), restart the API after the stack is up:

    ```bash theme={null}
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml up -d && docker compose -f docker-compose.base.yaml -f docker-compose.yaml restart api
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    docker stack ps cartesia --no-trunc

    docker service logs cartesia_api
    docker service logs cartesia_tts-worker

    # Restart the stack
    docker stack rm cartesia
    sleep 10
    cd local && docker stack deploy --with-registry-auth -c docker-compose.base.yaml -c docker-compose.swarm.yaml cartesia
    ```
  </Tab>
</Tabs>

***

## Configuration

Set these environment variables before running the deploy script. You receive `IMAGE_REGISTRY`, `RELEASE_TAG`, and `MODEL_NAME` from Cartesia during onboarding. If you mirror images into your own registry, use your mirror URL for `IMAGE_REGISTRY`.

### Required

| Variable             | Description                                                        |
| -------------------- | ------------------------------------------------------------------ |
| `IMAGE_REGISTRY`     | Container image registry URL (Cartesia registry or your mirror).   |
| `RELEASE_TAG`        | Image tag for the release you are deploying (updates per release). |
| `MODEL_NAME`         | TTS model identifier for the worker image.                         |
| `CONTAINER_KEY_FILE` | Path to file containing your Cartesia API key.                     |
| `GCS_SA_FILE`        | Path to GCS service account JSON file.                             |

### Optional

| Variable            | Default                               | Description                                                                                                                     |
| ------------------- | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| `WORKER_REPLICAS`   | `1`                                   | Number of TTS worker containers. For Compose, set to your GPU count on the host. For Swarm, scale to match your GPU node count. |
| `WORKER_CAPACITY`   | `4`                                   | Max concurrent TTS requests per worker. Lower if you run out of GPU memory.                                                     |
| `BUCKET_NAME`       | *(empty)*                             | GCS bucket for migrations/LoRAs. Leave empty to disable sync.                                                                   |
| `CLUSTER_NAME`      | `cartesia-compose` / `cartesia-swarm` | Identifier for logs and metrics.                                                                                                |
| `GCS_SYNC_INTERVAL` | `300`                                 | GCS sync interval in seconds.                                                                                                   |
| `USE_MIG`           | `0`                                   | Set to `1` to enable MIG mode.                                                                                                  |


# Getting Started
Source: https://docs.cartesia.ai/self-hosted/getting-started

Prerequisites and initial setup for Cartesia self-hosted deployments

# Prerequisites

Before deploying Cartesia's self-hosted solution, you'll need:

## Enterprise Contract

Cartesia's self-hosted products generally require an enterprise contract. Please reach out to [support@cartesia.ai](mailto:support@cartesia.ai) to request a conversation with our Go-to-Market team.

## Infrastructure

### Hardware Requirements

Cartesia models require GPUs running NVidia devices from the Ampere family or newer, with at least 24GB GPU Memory. We'll provide more specifics
depending on how you run your GPU clusters.  See [Hardware Selection](/self-hosted/hardware-selection) for more details.

### Deployment Options

You can deploy a self-hosted Cartesia cluster in one of 3 ways that we provide today:

* Via Helm Charts on a Managed Kubernetes Cluster with the right hardware.
* Via Docker Compose / Docker Swarm on bare-metal or VM nodes (beta).
* Via managed endpoints on Sagemaker Jumpstart.

Since all of our code executes in containers, you can go with a much more customized deployment option as well.

# Setup Stages

<Note>We highly recommend trying out our cloud offering first, since you can test your application and integrate it without all the work required for self-hosting.</Note>

<Steps>
  <Step title="Create Cartesia Account">
    Sign up at [play.cartesia.ai](https://play.cartesia.ai) and create an API key.
    Navigate to [play.cartesia.ai/keys](https://play.cartesia.ai/keys) and select your organization.
  </Step>

  <Step title="Request Enterprise Access">
    Contact [support@cartesia.ai](mailto:support@cartesia.ai) for getting enterprise access.

    If you're deploying on [AWS Sagemaker](/self-hosted/cloud-service-provisioning), you can request directly on the cloud platform itself.
  </Step>

  <Step title="Choose Deployment Method">
    Select your preferred deployment approach based on your infrastructure:

    * [**Managed Kubernetes**](/self-hosted/managed-kubernetes)
    * [**Docker**](/self-hosted/docker-compose) (beta)
    * [**Cloud Service Provisioning**](/self-hosted/cloud-service-provisioning)

    Depending on how you're deploying, you'll also decide on the hardware at this stage.
  </Step>

  <Step title="Deploy">
    Once approved, you'll receive access to:

    * Google Cloud Storage bucket containing cartesia-kube and related artifacts (Docker images, voices, LoRA weights)
    * Private Docker registry credentials
    * Helm chart repositories
    * Terraform configuration examples
    * Deployment documentation and support
    * An offline license (required if you are doing an [air-gapped deployment](/self-hosted/air-gapped))

    See [Provisioned Resources](/self-hosted/provisioned-resources) for a full breakdown of what's included and how to access each resource, including downloading cartesia-kube.

    Download cartesia-kube from the GCS bucket and follow the guide for your chosen deployment method to get up and running. The provided configurations work out of the box, but can be customized to fit your infrastructure needs.
  </Step>

  <Step title="Post Deployment">
    Post deployment, we provide some resources to validate and benchmark your deployment on your own hardware. See [Testing and Benchmarking](/self-hosted/testing-and-benchmarking).
    If you're looking to setup monitoring on the deployment, checkout [Metrics](/self-hosted/metrics)
  </Step>
</Steps>


# Hardware Selection
Source: https://docs.cartesia.ai/self-hosted/hardware-selection


Cartesia's models are portable enough to run on widely available GPU hardware.

In the table below we show the recommended concurrency for our TTS and STT model workers.

| GPU  | Sonic Concurrency | Ink Concurrency |
| ---- | ----------------- | --------------- |
| A10G | 4                 |                 |
| L40S | 4                 |                 |
| A100 | 4                 |                 |
| H100 | 8                 | 16              |

See [Metrics](/self-hosted/metrics) for more details on performance metrics.

When choosing hardware you need to consider the tradeoffs between latency (TTFA), and throughput.
See the table below for the metrics on the different set of GPUs we test on:

<Tabs>
  <Tab title="H100">
    | Concurrency | TTFA (ms) | RTF Avg | RTF P95 | Throughput (chars/s) |
    | ----------- | --------- | ------- | ------- | -------------------- |
    | 1           | 95        | 0.20    | 0.25    | 30                   |
    | 2           | 115       | 0.25    | 0.35    | 50                   |
    | 4           | 165       | 0.30    | 0.55    | 90                   |
    | 8           | 280       | 0.40    | 0.70    | 165                  |
  </Tab>

  <Tab title="L40s">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 90              | 0.20          | 0.20          | 50                   |
    | 2           | 120             | 0.25          | 0.25          | 90                   |
    | 4           | 180             | 0.30          | 0.45          | 145                  |
    | 8           | 185             | 0.30          | 0.55          | 180                  |
  </Tab>

  <Tab title="A100">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 130             | 0.30          | 0.30          | 45                   |
    | 2           | 180             | 0.30          | 0.35          | 70                   |
    | 4           | 280             | 0.40          | 0.40          | 120                  |
    | 8           | 260             | 0.40          | 0.60          | 135                  |
  </Tab>

  <Tab title="A10g">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 140             | 0.30          | 0.30          | 40                   |
    | 2           | 205             | 0.35          | 0.35          | 60                   |
    | 4           | 335             | 0.45          | 0.50          | 100                  |
    | 8           | 600             | 0.65          | 0.70          | 155                  |
  </Tab>
</Tabs>

With these you'll setup your per worker configurations.  For handling your application's scaling requirements, you'll need to configure autoscaling behavior.  See [autoscaling](/self-hosted/auto-scaling) for more details.


# Introduction
Source: https://docs.cartesia.ai/self-hosted/introduction


Cartesia's models can be self-hosted into customer provisioned cloud environments, such as GCP, AWS, or on-premise data centers.

## Why Self-Host

Cartesia's public API is globally available for the lowest latency, complete with GDPR, SOC 2 Type II, PCI Level 1,
and HIPAA compliance with enterprise contract options for Service Level Agreements (SLA) and Business Associate Agreement (BAA), and more.

However certain use cases may still warrant Self-Hosted Voice AI and Cartesia supports both private cloud and on-premise hosting options.
In those circumstances we recommend a self-hosted offering that is feature complete and as performant as the cloud offering.

### Colocation

With self-hosted deployments, you can choose to colocate your Voice AI models with other offerings
and establish your own SLAs around uptime and throughput. Colocated TTS would save a lot on network latencies depending on where
your datacenters are located.

### Isolation (Single Tenant)

Even though we provide a tenant level isolation in our cloud offering, nothing will beat the isolation you can achieve by self-hosting.

### Security

Self-hosted deployments allow you to maintain tight security posture without running Voice AI traffic over the internet to our public APIs. The self-hosted deployments will only contact
the Cartesia server to authenticate model access and report usage information. Usage information is limited to metadata such as character count and voice id, and does not contain any transcript information.
We also support [air-gapped deployments](/self-hosted/air-gapped) where there's no contact to our cloud, instead your deployment works with an offline license.

### Sovereignty

You can choose to host your Voice AI offering in any geographic region with GPU availability to meet jurisdictional requirements.

## Supported Products

| Product       | Support                   |
| ------------- | ------------------------- |
| Sonic 2       | Kubernetes                |
| Sonic 3       | Kubernetes, AWS SageMaker |
| Ink Whisper   | Kubernetes                |
| Voice Agents  | Not supported             |
| Voice Cloning | Not supported             |


# Managed Kubernetes
Source: https://docs.cartesia.ai/self-hosted/managed-kubernetes

Deploy Cartesia on AWS EKS and GCP GKE

Cartesia provides Terraform configurations that deploy both infrastructure and the application, or you can deploy the Helm chart directly to an existing cluster.

<Note>Complete configurations are provided at deployment time by your Cartesia representative.</Note>

## Terraform Deployment

Terraform creates the cluster, networking, GPU drivers, and deploys Cartesia via Helm.
This is the fastest way for you to get started with self-hosting Cartesia.

<Note>Download cartesia-kube from the GCS bucket as described in [Downloading cartesia-kube](/self-hosted/provisioned-resources#deployment-configurations).</Note>

```bash theme={null}
# Download and extract cartesia-kube from GCS (see Downloading cartesia-kube guide)
cd cartesia-kube

# Copy example config for your platform
cp aws-terraform.tfvars.example aws-terraform.tfvars  # or gcp-terraform.tfvars.example

# Deploy from the platform directory
cd infra/aws/cartesia-eks  # or infra/gcp/cartesia-gke
terraform init
terraform apply -var-file="../../../aws-terraform.tfvars" \
                -var "cartesia_api_key=$CARTESIA_API_KEY" \
                -var "service_account_json=$(cat /path/to/service-account.json)"
```

### Configuration

<Tabs>
  <Tab title="AWS EKS">
    ```hcl theme={null}
    region = "us-west-2"
    name = "cartesia-production"

    eks_admin_users = ["arn:aws:iam::123456789:user/admin"]

    node_groups = {
      default = {
        ami_type = "AL2023_x86_64_STANDARD"
        instance_types = ["m7a.4xlarge"]
        min_size = 1
        max_size = 3
        desired_size = 1
      }
      gpu = {
        ami_type = "AL2023_x86_64_NVIDIA"
        instance_types = ["g5.2xlarge", "g5.4xlarge"]
        min_size = 1
        max_size = 5
        desired_size = 2
        disk_size = 100
        labels = { "nvidia.com/gpu.present" = "true" }
      }
    }

    # Ingress (optional)
    enable_ingress = true
    ingress_route = "api.cartesia.yourdomain.com"
    certificate_arn = "arn:aws:acm:us-west-2:123456789:certificate/abc123"

    # Hot reload (enabled by default)
    enable_hot_reload = true
    ```
  </Tab>

  <Tab title="GCP GKE">
    ```hcl theme={null}
    project_id = "your-gcp-project"
    region = "us-central1"
    zone = "us-central1-a"
    name = "cartesia-production"

    gke_admin_users = ["user@yourdomain.com"]

    node_pools = {
      default = {
        machine_type = "e2-standard-8"
        min_count = 1
        max_count = 3
        initial_node_count = 1
      }
      gpu = {
        machine_type = "g2-standard-8"
        accelerator_type = "nvidia-l4"
        accelerator_count = 1
        min_count = 1
        max_count = 5
        initial_node_count = 2
        disk_size_gb = 100
      }
    }

    # Ingress (optional)
    enable_ingress = true
    ingress_route = "api.cartesia.yourdomain.com"

    # Hot reload (enabled by default)
    enable_hot_reload = true
    ```
  </Tab>
</Tabs>

See [Managing Artifacts](/self-hosted/managing-artifacts) for details on hot reload and adding voices and pronunciation dictionaries to your deployment.

### Worker Configuration

Workers are defined in your tfvars file:

```hcl theme={null}
workers = [
  {
    name = "tts-worker"
    workerArgs = {
      model = "<model-name>"
      image = "cartesia-sonic-<model-name>"
      gpuType = "nvidia.com/gpu"
      capacity = 4
      operation = "TTS"
      useCB = true
      useLora = false
    }
    autoscaling = {
      enabled = true
      threshold = 0.6
      minReplicas = 1
      maxReplicas = 10
    }
  }
]
```

All the model workers have the images with prefix `cartesia-sonic-` followed by the specific model name. For instance, sonic-3 would use `cartesia-sonic-rosy-dragon`.

## Helm-Only Deployment

For existing Kubernetes clusters, deploy the Helm chart directly.

### 1. Install Prerequisites

If you want autoscaling and metrics, install KEDA and Prometheus first:

```bash theme={null}
# Prometheus
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace
```

### 2. Create Secrets

```bash theme={null}
kubectl create namespace cartesia

kubectl create secret docker-registry gar-pull-secret \
  --namespace cartesia \
  --docker-server=us-docker.pkg.dev \
  --docker-username=_json_key \
  --docker-password="$(cat /path/to/service-account.json)"
```

### 3. Configure values.yaml

```yaml theme={null}
clusterName: cartesia-production

infra:
  provider: gcp  # or aws
  authenticate: true
  imageRegistry: us-docker.pkg.dev/cartesia-external/self-serve
  imagePullSecret: gar-pull-secret
  gcsSecretName: gar-pull-secret
  serviceAccount: cartesia-image-sa

release:
  version: "1.0.0"
  releaseTag: "sonic-20251118"

filesystem:
  storageClass:
    name: standard-rwo

ingress:
  enabled: true
  routes:
    - api.cartesia.yourdomain.com
  globalStaticIpName: cartesia-ingress-ip  # GKE only

metrics:
  enabled: true

legacyComponents:
  enabled: false

workers:
  - name: tts-worker
    workerArgs:
      model: <model-name>
      image: cartesia-sonic-<model-name>
      gpuType: nvidia.com/gpu
      capacity: 4
      operation: TTS
      useCB: true
      useLora: false
    autoscaling:
      enabled: true
      threshold: "0.6"
      minReplicas: 1
      maxReplicas: 10
```

### 4. Deploy

```bash theme={null}
cd cartesia-kube/cartesia
helm upgrade --install cartesia . \
  --values values.yaml \
  --namespace cartesia
```

### Verify

```bash theme={null}
kubectl get pods -n cartesia
kubectl get ingress -n cartesia
```

## Autoscaling

Cartesia supports two levels of autoscaling for Kubernetes deployments.

### Cluster Autoscaler

Scales nodes based on pending pods. Enable in your tfvars:

```hcl theme={null}
enable_cluster_autoscaler = true
```

Node groups/pools will scale within their configured `min_size`/`max_size` bounds when pods can't be scheduled due to insufficient resources.

### Pod Autoscaler (KEDA)

Scales worker pods based on load metrics. Enable in your tfvars:

```hcl theme={null}
enable_pod_autoscaler = true
enable_metrics = true  # Required for KEDA
```

KEDA uses two scaling triggers:

* **Queue depth** - Scales when unserviceable requests accumulate
* **Worker load** - Scales when GPU utilization exceeds threshold

### Per-Worker Scaling

Each worker can have its own scaling configuration:

```hcl theme={null}
workers = [
  {
    name = "tts-worker"
    workerArgs = { ... }
    autoscaling = {
      enabled = true
      threshold = 0.6      # Scale up when load > 60%
      minReplicas = 1
      maxReplicas = 10
    }
  }
]
```

Or in Helm values.yaml:

```yaml theme={null}
workers:
  - name: tts-worker
    workerArgs: { ... }
    autoscaling:
      enabled: true
      threshold: "0.6"
      minReplicas: 1
      maxReplicas: 10
```

### Scaling Behavior

* **Scale up**: 30 second stabilization window
* **Scale down**: 900 second (15 min) stabilization window to avoid flapping
* Workers scale independently based on their individual load


# Managing Artifacts
Source: https://docs.cartesia.ai/self-hosted/managing-artifacts

Add voices and pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment

<Note>
  Hot reload and the on-prem migration APIs (`add-voices`, `add-pdict`) require release tag `sonic-20260310` or later.
</Note>

## Hot reload

New voice artifacts are picked up automatically by your self-hosted deployment without requiring an API server restart. Hot reload is enabled by default.

When a migration file lands in your GCS bucket, the API server detects and applies it automatically. No API server restarts or Helm upgrades are needed.

To disable hot reload, set `enable_hot_reload` to `false` in your tfvars — see [Managed Kubernetes](/self-hosted/managed-kubernetes) for full configuration.

```hcl theme={null}
enable_hot_reload = false
```

<Warning>
  Hot reload does not support PVC voices. If you migrate voices with `include_loras: true`, you must restart the worker pods for the LoRA checkpoints to take effect.
</Warning>

## Adding voices

Add voices from the Cartesia voice library to your self-hosted deployment using the `POST /onprem/add-voices` endpoint. You can migrate up to 50 voices per request. The migration runs asynchronously — voices typically become available on your self-hosted deployment within 4–5 minutes.

```bash theme={null}
curl -X POST "https://api.cartesia.ai/onprem/add-voices" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_ids": ["a0e99841-438c-4a64-b679-ae501e7d6091"],
    "model_id": "sonic-3",
    "include_loras": true
  }'
```

<Note>
  This endpoint is called against the **Cartesia cloud API** (`api.cartesia.ai`), not your self-hosted deployment. Your API key must belong to an organization with an active on-prem contract.
</Note>

### Request body

| Parameter       | Type       | Required | Description                                                                       |
| --------------- | ---------- | -------- | --------------------------------------------------------------------------------- |
| `voice_ids`     | `string[]` | Yes      | Voice IDs or aliases to add. Maximum 50 per request.                              |
| `model_id`      | `string`   | Yes      | The model the voices will be used with (e.g., `"sonic-3"`, `"sonic-english"`).    |
| `include_loras` | `boolean`  | No       | Set to `true` to include LoRA checkpoints for cloned voices. Defaults to `false`. |

### Headers

| Header             | Required | Description                                   |
| ------------------ | -------- | --------------------------------------------- |
| `X-API-Key`        | Yes      | Your Cartesia API key.                        |
| `Cartesia-Version` | No       | API version header. Defaults to `2025-04-16`. |

### Error responses

| Status | Condition                                                                    |
| ------ | ---------------------------------------------------------------------------- |
| `400`  | Missing or empty `voice_ids`, missing `model_id`, or more than 50 voice IDs. |
| `403`  | No on-prem access, or a requested voice is not accessible.                   |
| `422`  | Malformed request body.                                                      |
| `500`  | Internal server error.                                                       |

## Verifying a voice

After migration completes, verify a voice is available on your self-hosted deployment with `GET /voices/<id>`.

```bash theme={null}
curl -X GET "http://<your-host>:<port>/voices/<voice-id>" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```

## Adding pronunciation dictionaries

Add pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment using the `POST /onprem/add-pdict` endpoint. You can migrate up to 50 dictionaries per request. The migration runs asynchronously — dictionaries typically become available on your self-hosted deployment within 4–5 minutes.

```bash theme={null}
curl -X POST "https://api.cartesia.ai/onprem/add-pdict" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pronunciation_dict_ids": ["a0e99841-438c-4a64-b679-ae501e7d6091"]
  }'
```

<Note>
  This endpoint is called against the **Cartesia cloud API** (`api.cartesia.ai`), not your self-hosted deployment. Your API key must belong to an organization with an active on-prem contract, and must own each dictionary being migrated.
</Note>

### Request body

| Parameter                | Type       | Required | Description                                                                                      |
| ------------------------ | ---------- | -------- | ------------------------------------------------------------------------------------------------ |
| `pronunciation_dict_ids` | `string[]` | Yes      | Pronunciation dictionary IDs to add. Maximum 50 per request. Duplicates are removed server-side. |

### Headers

| Header             | Required | Description                                   |
| ------------------ | -------- | --------------------------------------------- |
| `X-API-Key`        | Yes      | Your Cartesia API key.                        |
| `Cartesia-Version` | No       | API version header. Defaults to `2025-04-16`. |

### Error responses

| Status | Condition                                                                |
| ------ | ------------------------------------------------------------------------ |
| `400`  | Missing or empty `pronunciation_dict_ids`, or more than 50 entries.      |
| `403`  | No on-prem access, or a requested dictionary is not owned by the caller. |
| `404`  | A requested dictionary ID does not exist.                                |
| `422`  | Malformed request body.                                                  |
| `500`  | Internal server error.                                                   |

## Verifying a pronunciation dictionary

After migration completes, verify a dictionary is available on your self-hosted deployment with `GET /pronunciation-dicts/<id>`.

```bash theme={null}
curl -X GET "http://<your-host>:<port>/pronunciation-dicts/<dict-id>" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```


# Metrics and Monitoring
Source: https://docs.cartesia.ai/self-hosted/metrics


Cartesia's inference cluster includes support for [Prometheus](https://prometheus.io/), an open source
metrics and monitoring solution.

All metrics are scraped every 5 seconds via PodMonitor on port 8080 `/metrics`.

## Prometheus Metrics

| Metric Name                       | Description                                                               | Normal Range                                          |
| --------------------------------- | ------------------------------------------------------------------------- | ----------------------------------------------------- |
| `inferno_worker_load`             | # of concurrent chunks the worker is processing now                       | \< Capacity                                           |
| `inferno_worker_capacity`         | # of concurrent chunks a worker can process                               | [hardware](/self-hosted/hardware-selection) dependent |
| `inferno_worker_ttfa`             | Time to First Audio                                                       | \< 200 ms                                             |
| `inferno_worker_rtf`              | [Real time factor](https://openvoice-tech.net/index.php/Real-time-factor) | \< 1                                                  |
| `api_queue_size`                  | Request queue size per offering                                           | Low                                                   |
| `api_unserviceable_requests_size` | Unserviceable requests count                                              | 0                                                     |


# Provisioned Resources
Source: https://docs.cartesia.ai/self-hosted/provisioned-resources

Reference for all resources provisioned as part of your self-hosted deployment

When your enterprise contract is finalized, Cartesia provisions the following resources for your account. All provisioned resources are available for download from the [on-prem portal](https://play.cartesia.ai/on-prem).

<Note>The on-prem portal is only accessible under the organization that has on-prem enabled. If you don't see it, switch to that organization in the account switcher.</Note>

## Service Account

A service account is created for your account, this service account has the following accesses:

* Access to a private artifact registry, which is used to host cartesia provided container images.
* Access to a common storage bucket: `gs://cartesia-onprem` containing the deployment configurations.
* Access to a private storage bucket: `gs://cartesia-{{name}}` used for hosting customer specific artifacts.

Download the JSON key for this service account from the [on-prem portal](https://play.cartesia.ai/on-prem).

Activate the service account before accessing resources hosted on GCloud:

```bash theme={null}
gcloud auth activate-service-account --key-file=/path/to/service-account.json
gsutil ls gs://cartesia-onprem/  # Verify access
```

## Deployment Configurations

The `cartesia-onprem` bucket contains versioned repository `cartesia-kube` which holds all of our deployment configurations.

```
gs://cartesia-onprem/
  cartesia-kube/
    latest/
      cartesia-kube-latest.tar.gz   # Latest release archive
      VERSION                        # Current version string
    releases/
      <version>/
        SHA256SUMS                   # Checksums for verification
```

<Note>Voice model files and LoRA weights are provided in a separate bucket or as part of `cartesia-kube`. Your Cartesia representative will confirm the exact paths during onboarding.</Note>

Download and verify the latest release:

```bash theme={null}
BUCKET="cartesia-onprem"

gsutil cp gs://${BUCKET}/cartesia-kube/latest/cartesia-kube-latest.tar.gz .
gsutil cp gs://${BUCKET}/cartesia-kube/latest/VERSION .

LATEST_VERSION=$(cat VERSION)
gsutil cp gs://${BUCKET}/cartesia-kube/releases/${LATEST_VERSION}/SHA256SUMS .

sha256sum -c SHA256SUMS  # macOS: shasum -a 256 -c SHA256SUMS
tar -xzf cartesia-kube-latest.tar.gz
```

Once extracted, `cartesia-kube` contains everything needed for all deployment methods:

```
cartesia-kube/
  benchmarking/          # Load testing and benchmarking tools
  cartesia/              # Helm chart + Docker Compose configs
    scripts/
      swarm/             # Docker Swarm deploy scripts
    templates/           # Kubernetes resource templates
      autoscaler/
      resources/
      services/
  infra/                 # Terraform configs
    aws/
      cartesia-eks/      # EKS deployment
    gcp/
      cartesia-gke/      # GKE deployment
```

## Container Registry

Images are hosted at `us-docker.pkg.dev/cartesia-external/self-serve` and tagged with a release tag (e.g. `sonic-20251118`). The full image reference format is:

```
us-docker.pkg.dev/cartesia-external/self-serve/<image-name>:<release-tag>
```

### Images

| Image Name                   | Description                        |
| ---------------------------- | ---------------------------------- |
| `cartesia-api`               | API server                         |
| `cartesia-license-proxy`     | License validation and enforcement |
| `cartesia-sonic-rosy-dragon` | TTS worker — sonic-3               |
| `cartesia-sonic-royal-plant` | TTS worker — sonic-2               |
| `cartesia-sonic-voice-clone` | TTS worker — voice cloning         |

NATS uses a public image and does not need to be pulled from the Cartesia registry.

### Listing Available Tags

List available image tags sorted by most recent:

```bash theme={null}
gcloud artifacts docker images list \
  us-docker.pkg.dev/cartesia-external/self-serve/cartesia-api \
  --include-tags \
  --sort-by="~UPDATE_TIME"
```

Replace `cartesia-sonic-rosy-dragon` with any image name from the table above. The `~` prefix sorts in descending order, showing the latest tags first.

### Mirroring to a Private Registry

For air-gapped or network-restricted environments, mirror images to your own registry before deployment.

Authenticate Docker with the service account:

```bash theme={null}
cat /path/to/service-account.json | \
  docker login -u _json_key --password-stdin https://us-docker.pkg.dev
```

Pull, retag, and push each image. For example:

```bash theme={null}
CARTESIA_REGISTRY="us-docker.pkg.dev/cartesia-external/self-serve"
PRIVATE_REGISTRY="your-registry.example.com/cartesia"
RELEASE_TAG="sonic-20251118"
IMAGE="cartesia-api"

docker pull ${CARTESIA_REGISTRY}/${IMAGE}:${RELEASE_TAG}
docker tag  ${CARTESIA_REGISTRY}/${IMAGE}:${RELEASE_TAG} ${PRIVATE_REGISTRY}/${IMAGE}:${RELEASE_TAG}
docker push ${PRIVATE_REGISTRY}/${IMAGE}:${RELEASE_TAG}
```

Repeat for each image in the table above.

Then set `infra.imageRegistry` (Helm) to your private registry URL.


# Testing and Benchmarking
Source: https://docs.cartesia.ai/self-hosted/testing-and-benchmarking

Validate and benchmark your Cartesia self-hosted deployment

Once your deployment is running, you can test it using the following commands. Ensure you have network access to your service via port-forwarding or an ingress.

## List Voices

```bash theme={null}
curl "http://<your-host>:<port>/voices" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```

## Text-to-Speech

```bash theme={null}
curl -X POST "http://<your-host>:<port>/tts/bytes" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "sonic-2",
    "transcript": "Hello, this is a test of the Cartesia text-to-speech API.",
    "voice": {
      "mode": "id",
      "id": "bf0a246a-8642-498a-9950-80c35e9276b5"
    },
    "output_format": {
      "container": "wav",
      "encoding": "pcm_f32le",
      "sample_rate": 44100
    },
    "language": "en"
  }' > output.wav
```

## Benchmarking

We provide a benchmarking tool in the [cartesia-kube](https://github.com/cartesia-ai/cartesia-kube) repository for measuring TTS performance metrics like TTFA and latency.

```bash theme={null}
cd cartesia-kube/benchmarking

export CARTESIA_API_KEY="your-api-key"
export CARTESIA_API_URL="wss://your-ingress-host"

# Run with default concurrency (4)
uv run tts_benchmark.py

# Run with custom concurrency
uv run tts_benchmark.py --concurrency 8
```

See the [benchmarking README](https://github.com/cartesia-ai/cartesia-kube/tree/main/benchmarking) for detailed usage and output format.


# MCP
Source: https://docs.cartesia.ai/tools/ai/mcp


The **`cartesia-mcp`** package exposes Cartesia through the **Model Context Protocol (MCP)** so MCP-capable clients—**Cursor**, **Claude Desktop**, **OpenAI Agents**, and similar—can list voices, run **TTS**, and use other Cartesia-backed tools via the protocol instead of custom scripts.

You need a [Cartesia API key](https://play.cartesia.ai/keys). The [PyPI package](https://pypi.org/project/cartesia-mcp/) currently requires **Python 3.13 or newer** as its minimum; confirm the supported version on PyPI before you install.

**Installation**, the **uvx** shortcut, and **MCP client configuration** (executable path, environment variables, Claude Desktop or Cursor) are documented in the **[cartesia-mcp](https://github.com/cartesia-ai/cartesia-mcp)** README so setup stays in sync with releases.

<Card title="cartesia-mcp" icon="github" href="https://github.com/cartesia-ai/cartesia-mcp">
  The official Cartesia MCP Server
</Card>


# JavaScript/TypeScript
Source: https://docs.cartesia.ai/tools/client-libraries/javascript-typescript

The library that powers the Cartesia Playground.

<Card title="Cartesia JS" icon="github" href="https://github.com/cartesia-ai/cartesia-js">
  The Official TS/JS client for the Cartesia API.
</Card>


# Python
Source: https://docs.cartesia.ai/tools/client-libraries/python

The official Python library for the Cartesia API.

<Card title="Cartesia Python" icon="github" href="https://github.com/cartesia-ai/cartesia-python">
  The official Python client for the Cartesia API.
</Card>


# API Conventions
Source: https://docs.cartesia.ai/use-the-api/api-conventions


<Warning>
  All endpoints use HTTPS. HTTP is not supported. API keys that call the API
  over HTTP may be subject to automatic rotation.
</Warning>

All API requests use the following base URL: `https://api.cartesia.ai`. (For WebSockets the corresponding protocol is `wss://`.)

### Always send a `Cartesia-Version` header

Each request you send our API should have a `Cartesia-Version` header containing the date (`YYYY-MM-DD`) when you tested your integration. For WebSockets, you can alternately use the `?cartesia_version` query parameter, which will take precedence.

This will help us provide you with timely deprecation notices and enable us to provide automatic backwards compatibility where possible.

For a given `Cartesia-Version`, we will preserve existing input and output fields, but we may make non-breaking changes, such as:

1. Add optional request fields.
2. Add additional response fields.
3. Change conditions for specific error types
4. Add variants to enum-like output values.

Our versioning scheme is inspired by the [Anthropic API](https://docs.anthropic.com/en/api/versioning).

### Use API keys when making requests from a server

Create a new API key at [play.cartesia.ai/keys](https://play.cartesia.ai/keys). Include `Authorization: Bearer <api_key>` in the headers of your requests.

### Use access tokens when making requests from a client app

Never use API keys in client apps; they grant full account access and can be extracted from browser or mobile code.

Instead, your server can generate a short-lived access token and send it to the client. See the [Access Token API Reference](/api-reference/auth/access-token) for how to generate one.

* For HTTP requests, include `Authorization: Bearer <access_token>` in the headers.

* For WebSocket connections, pass the token as the `?access_token=<access_token>` query parameter since browsers can't set headers on WebSocket handshakes.

### Check response codes

Our API uses standard HTTP response codes; refer to [httpstatuses.io](https://httpstatuses.io).

### Parse structured error responses

For `Cartesia-Version` values on or after `2026-03-01`, Cartesia returns structured JSON errors.

For the full error reference (all current error codes, schemas, and field nullability), see [API Errors](/use-the-api/api-errors).

```json HTTP error response (Cartesia-Version 2026-03-01 and newer) theme={null}
{
  "error_code": "concurrency_limited",
  "title": "Too many concurrent requests",
  "message": "You have exceeded your plan's concurrency limit.",
  "request_id": "550e8400-e29b-41d4-a716-446655440000"
}
```

Field meanings:

1. `error_code`: machine-readable identifier for client logic; can be `null`.
2. `title`: short human-readable summary.
3. `message`: detailed human-readable explanation.
4. `request_id`: request identifier for support/debugging.
5. `doc_url`: optional link to docs for the specific error (when available).

Common `error_code` values today include `quota_exceeded`, `concurrency_limited`, `voice_model_mismatch`, `voice_not_found`, `model_not_found`, `language_not_supported`, `file_too_large`, `unsupported_audio_format`, and `plan_upgrade_required`.

WebSocket and SSE error events include the same error fields plus transport context:

```json WebSocket/SSE error event (Cartesia-Version 2026-03-01 and newer) theme={null}
{
  "type": "error",
  "done": true,
  "status_code": 429,
  "error_code": "concurrency_limited",
  "title": "Too many concurrent requests",
  "message": "You have exceeded your plan's concurrency limit.",
  "request_id": "550e8400-e29b-41d4-a716-446655440000:happy-monkeys-fly:8a0f5f3a-3b2f-4f28-b73e-8c5f27e2f8bb",
  "context_id": "happy-monkeys-fly"
}
```

Notes:

1. `context_id` appears for TTS WebSocket errors when available.
2. `status_code` is included in WebSocket/SSE error payloads; for HTTP, status remains in the HTTP response status line.
3. `request_id` is always a string; HTTP and SSE request IDs are UUIDs, while WebSocket request IDs may include additional context.

For `Cartesia-Version` values before `2026-03-01` (and invalid versions), legacy error formats are returned instead:

1. HTTP errors are plain text in `Title: Message` format.
2. WebSocket/SSE errors use legacy envelopes with string-only error messages.

### Pass data according to the method

All GET requests use query parameters to pass data. All POST requests use a JSON body or `multipart/form-data`.


# Compare TTS Endpoints
Source: https://docs.cartesia.ai/use-the-api/compare-tts-endpoints

How bytes, SSE, and WebSocket differ for text-to-speech, and when to use each.

Cartesia exposes three ways to turn text into speech. The same models, voices, and core parameters apply everywhere. What changes is how you connect, how audio is framed on the wire, and whether you get timestamps, continuations (streaming model output into one spoken line), or many generations on one connection.

All three endpoints stream audio as it is produced. The bytes endpoint delivers that stream as a single HTTP response body (the same pattern the playground uses). SSE and WebSocket stream too; they chunk audio into multiple events or messages, which is how per-chunk metadata such as timestamps is carried.

## Feature comparison

|           | Multiple generations per connection | Timestamps | Continuations |
| --------- | ----------------------------------- | ---------- | ------------- |
| WebSocket | Yes                                 | Yes        | Yes           |
| Bytes     | No (one `POST` per generation)      | No         | No            |
| SSE       | No (one `POST` per generation)      | Yes        | No            |

An **utterance** is one stretch of speech you want pronounced as a single unit (usually a sentence or a line of UI copy). **Continuations** let you send that utterance as several WebSocket messages that share a `context_id`. See [Stream inputs using continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) and [contexts](/use-the-api/tts-websocket/contexts).

```mermaid theme={null}
flowchart TD
    Q1{"Are you streaming text from an LLM<br>or other partial input?"}
    Q2{"Do you need timestamps on HTTP<br>without WebSocket?"}
    Q3{"Will you speak often enough that<br>repeated connect/TLS cost hurts?"}
    WS["WebSocket"]
    SSE["SSE"]
    Bytes["Bytes"]

    Q1 -- "Yes" --> WS
    Q1 -- "No" --> Q2
    Q2 -- "Yes" --> SSE
    Q2 -- "No" --> Q3
    Q3 -- "Yes" --> WS
    Q3 -- "No" --> Bytes
```

If you care about time-to-first-byte on every turn, remember that a new HTTPS request pays for TCP and TLS again; that overhead is often on the same order as TTFB for the audio itself. WebSocket amortizes that cost when you keep the socket open.

SSE is still supported for stacks that already consume Server-Sent Events or when you want timestamps while staying on HTTP. For audio only, bytes is usually the better HTTP choice (smaller encoding than JSON-wrapped chunks).

## Pick an endpoint in one minute

| What you are building                                                                                                                              | Use this                                                   | Short label                         |
| -------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------- |
| Full transcript in one request; you want a streaming HTTP body (efficient; same pattern as the playground)                                         | [`POST /tts/bytes`](/api-reference/tts/bytes)              | Stream speech (bytes)               |
| Full transcript in one request; you need timestamps without WebSocket, or your stack already uses SSE                                              | [`POST /tts/sse`](/api-reference/tts/sse)                  | Stream speech with timestamps (SSE) |
| Long-lived session, partial transcript (for example LLM tokens), lowest latency across many turns, timestamps, or several utterances on one socket | [WebSocket `/tts/websocket`](/api-reference/tts/websocket) | Live session (WebSocket)            |

If the full transcript is not known up front, use WebSocket with contexts, not bytes or SSE.

***

## Bytes (`POST /tts/bytes`)

Best for batch jobs, caching files, notifications, and anywhere one `POST` per generation is enough.

The response body streams while audio is generated. You can read progressively or buffer to the end. For many output formats this is leaner on the wire than SSE because you receive raw or file bytes instead of JSON-wrapped chunks.

Typical flow:

1. One JSON payload with the full `transcript`, voice, model, and output format (WAV, MP3, raw PCM, and so on).
2. `POST` to `/tts/bytes`.
3. Read the body as data arrives, or consume it to completion.

One request is one generation. For another line of speech, send another `POST`.

See [bytes reference](/api-reference/tts/bytes).

***

## SSE (`POST /tts/sse`)

Best when you need timestamps while staying on HTTP without WebSocket, or when your integration already uses SSE. If you only need audio and not SSE-shaped events, bytes is usually simpler. WebSocket is otherwise the full-featured option for real-time use and supports timestamps as well.

SSE remains available largely for backward compatibility and for teams that standardize on Server-Sent Events.

Typical flow:

1. Same as bytes: one JSON body with the full transcript.
2. `POST` to `/tts/sse`.
3. Consume Server-Sent Events; each event carries the next chunk until completion.

Bytes vs SSE:

|            | Bytes                                           | SSE                                            |
| ---------- | ----------------------------------------------- | ---------------------------------------------- |
| Shape      | One streaming response body (raw or file bytes) | Many SSE events (often JSON plus base64 audio) |
| Timestamps | No                                              | Yes (in the event payload)                     |

You still send one full transcript per request: SSE does not support WebSocket-style continuations across multiple `POST`s. An optional `context_id` is echoed for your logs; it does not merge multiple HTTP requests into one utterance. To send text in pieces over time, use WebSocket.

See [SSE reference](/api-reference/tts/sse).

***

## WebSocket (`/tts/websocket`)

Best for assistants, games, telephony-style stacks, or any case where the connection stays open and transcript text may arrive over time.

Why people choose WebSocket:

1. Latency: you pay connect cost once; later generations avoid extra TCP/TLS round trips (often tens to low hundreds of ms per turn).
2. Streaming input: send fragments as they arrive (for example from an LLM) and keep prosody across them. See [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) and [contexts](/use-the-api/tts-websocket/contexts).
3. Timestamps: word- or segment-level timing (model and language limits apply; see WebSocket docs).
4. Multiplexing: multiple `context_id` values on one connection for overlapping utterances.

Typical flow:

1. Open the WebSocket.
2. Send JSON messages. When one utterance is split across messages, use `context_id` and `continue`: set `continue: true` on partials, and `continue: false` on the last part of that utterance (or use the empty-transcript pattern in [contexts](/use-the-api/tts-websocket/contexts) if you cannot know the final string yet).
3. Read audio until the server finishes that context.

See [WebSocket reference](/api-reference/tts/websocket).

***

## Continuations

If you are not streaming text from a model, start with bytes or SSE, not continuations.

When you do use WebSocket streaming, keep one stable `context_id` per utterance, `continue: true` on every partial, and `continue: false` on the final message for that utterance (see [contexts](/use-the-api/tts-websocket/contexts)).

Things that break text or prosody:

* Concatenation: chunks are joined verbatim. A missing space produces `"...world!How..."` instead of `"...world! How..."`.
* SSML and numbers: avoid splitting tokens that must stay together (for example decimals in SSML). See `max_buffer_delay_ms` in the [continuations guide](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).

If you leave `continue: true` longer than you meant, contexts eventually expire on their own and audio is still generated and flushed according to server rules. It is not a runaway failure mode. You should still send `continue: false` when you know the utterance is complete so your client state matches the server. Do not reuse old `context_id` values for unrelated utterances.

***

## Why WebSocket uses `context_id` (and HTTP does not)

On `POST /tts/bytes` and `POST /tts/sse`, you send a complete transcript in one JSON body. There is no continuation protocol across requests.

`context_id` and `continue` matter on WebSocket when one utterance's text is split across multiple messages. The server concatenates chunks that share a `context_id`. `continue: true` means more text is coming; `continue: false` finalizes that utterance.

Mental model:

* Whole line of speech in one string: bytes or SSE. No context API.
* Text arrives in pieces: WebSocket, one `context_id` per utterance, with continuations.

***

## API ergonomics (all endpoints)

* For server-side calls, prefer the API key in the `Authorization` header instead of query strings (headers are less likely to appear in access logs). WebSocket URLs in browsers may need different patterns for your platform.
* Model IDs, voices, and core generation parameters match across bytes, SSE, and WebSocket. What differs is wire format, how chunks are exposed, and whether input can be streamed with continuations.

***

## Where to go next

<CardGroup>
  <Card title="Stream speech (bytes)" icon="download" href="/api-reference/tts/bytes">
    One POST, streaming response body
  </Card>

  <Card title="Stream speech with timestamps (SSE)" icon="waveform" href="/api-reference/tts/sse">
    Timestamps and SSE-chunked audio
  </Card>

  <Card title="Live session (WebSocket)" icon="plug" href="/api-reference/tts/websocket">
    Streaming input, multiplexing, lowest latency across turns
  </Card>
</CardGroup>


# Concurrency and WebSocket Limits
Source: https://docs.cartesia.ai/use-the-api/concurrency-limits-and-timeouts

Learn about concurrency limits and timeouts with the Cartesia API.

Your account is subject to two types of rate limits: WebSocket limits and generation concurrency limits.

## Concurrency limits by subscription plan

Your subscription plan determines how many requests can be processed simultaneously. Sonic Text-to-Speech (TTS) and Ink Speech-to-Text (STT) each have separate concurrency limits with the same values per plan.

| Plan       | TTS Concurrent Requests | STT Concurrent Requests |
| ---------- | ----------------------- | ----------------------- |
| Free       | 2                       | 8                       |
| Pro        | 3                       | 12                      |
| Startup    | 5                       | 20                      |
| Scale      | 15                      | 60                      |
| Enterprise | Custom                  | Custom                  |

<Note>
  Sonic (Text-to-Speech) and Ink (Speech-to-Text) services have separate concurrent request limits. For example, if you're on the Scale plan, you can have up to 15 concurrent TTS requests AND 60 concurrent STT requests running simultaneously.
</Note>

## Text-to-Speech (TTS) Concurrency

We measure TTS generation concurrency in terms of the number of unique contexts active at a given time.

* For HTTP endpoints, each request is treated as a separate context and counts toward your concurrency limit.
* For WebSockets, a unique <code>context\_id</code> defines a context—sending additional requests with the same <code>context\_id</code> does not increase your concurrency usage. This is because requests to the same context are processed sequentially.
* STT **does not** count towards your TTS concurrency limit

If you exceed your TTS concurrency limit, you will receive a `429 Too Many Requests` error. You can check your concurrency limit and upgrade it on the playground at [play.cartesia.ai](https://play.cartesia.ai).

### Interpreting TTS concurrency limits

How you interpret your TTS concurrency limit depends on how you're using the Sonic model family.

<AccordionGroup>
  <Accordion title="Conversational use cases">
    For real-time conversational use cases, such as powering voice agents, we've found that the number of parallel conversations you can support is effectively 4X your concurrency limit. This is just a rule of thumb, and depends on the types of conversations you're supporting. You can reach out to us to discuss your specific use case.

    For example, if you have a TTS concurrency limit of 15, you can typically support 60 parallel conversations.
  </Accordion>

  <Accordion title="Non-conversational use cases">
    For non-conversational use cases, such as generating speech in batch jobs, there is a more direct relationship between your concurrency limit and the number of parallel generations you can support.

    For example, if you have a TTS concurrency limit of 15, you can typically support 15 parallel TTS generations. You can use a connection pool to ensure you don't exceed your concurrency limit.
  </Accordion>
</AccordionGroup>

### TTS WebSocket limits

We limit the number of parallel TTS WebSocket connections to 10X your concurrency limit. For example, if you have a concurrency limit of 15, you can have up to 150 parallel TTS WebSocket connections.

If you exceed your WebSocket limit, you will receive a `429 Too Many Requests` error on trying to open a new WebSocket connection.

Usually, when users run into TTS WebSocket limits (even at scale), it's because they're not properly closing idle connections. Beyond closing idle connections, you can also create a connection pool to ensure you don't exceed your WebSocket limit.

### TTS WebSocket timeouts

We close idle TTS WebSocket connections after 5 minutes. We recommend closing and re-opening a new websocket connection when connections stay idle for long periods of time.

## Speech-to-Text (STT) Concurrency

Each active transcription stream counts as one concurrent request, regardless of whether you're using HTTP or WebSocket connections.

* Each concurrent HTTP or WebSocket connection counts toward your STT concurrency limit
* Idle STT WebSockets still count towards your STT concurrency limit
* TTS **does not** count towards your STT concurrency limit

If you exceed your STT concurrency limit, you will receive a `429 Too Many Requests` error.

### STT WebSocket timeouts

We close idle STT WebSocket connections after 3 minutes. We recommend closing and re-opening a new websocket connection when connections stay idle for long periods of time.


# Migrating From OpenAI Whisper to Cartesia Ink
Source: https://docs.cartesia.ai/use-the-api/migrate-from-open-ai

Use Cartesia's Batch Speech-to-Text API with OpenAI's client libraries

<Info>
  Batch Speech-to-Text: This documentation covers OpenAI SDK compatibility for Cartesia Ink's batched transcription endpoint.

  For real-time transcription, use our [Streaming STT endpoint](/api-reference/stt/stt).
</Info>

Cartesia's Batch Speech-to-Text API is compatible with OpenAI's client libraries, enabling seamless migration from OpenAI Whisper.

## Endpoints

**Cartesia Native:** `/stt` - Full feature support\
**OpenAI Compatible:** `/audio/transcriptions` - Drop-in replacement for Whisper on the OpenAI SDK

## Migration Guide for OpenAI SDK

Replace your OpenAI base URL with `https://api.cartesia.ai` to use the compatibility layer for Cartesia:

### Parameter Support

**Supported Parameters**:

* `file` - The audio file to transcribe
* `model` - Use `ink-whisper` for Cartesia's latest model
* `language` - Input audio language (ISO-639-1 format)
* `timestamp_granularities` - Include `["word"]` to get word-level timestamps

**Response Format**: Always returns JSON with transcribed text, duration, language, and optionally word timestamps.

For the complete parameter reference, see our [Batch STT API documentation](/api-reference/stt/transcribe).

### Python Example

```python theme={null}
from openai import OpenAI

client = OpenAI(
    api_key="your-cartesia-api-key",
    base_url="https://api.cartesia.ai"
)

with open("audio.wav", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        file=audio_file,
        model="ink-whisper",
        language="en",
        timestamp_granularities=["word"]
    )
    
print(transcript.text)
```

### Node.js Example

```typescript theme={null}
import OpenAI from 'openai';
import fs from 'fs';

const client = new OpenAI({
  apiKey: 'your-cartesia-api-key',
  baseURL: 'https://api.cartesia.ai'
});

const transcription = await client.audio.transcriptions.create({
  file: fs.createReadStream('audio.wav'),
  model: 'ink-whisper',
  language: 'en',
  timestamp_granularities: ['word']
});

console.log(transcription.text);
```

## Direct API Usage

Both endpoints accept identical parameters and return the same JSON response format:

### Cartesia Native Endpoint

```bash theme={null}
curl -X POST https://api.cartesia.ai/stt \
  -H "X-API-Key: your-cartesia-api-key" \
  -F "file=@audio.wav" \
  -F "model=ink-whisper" \
  -F "language=en" \
  -F "timestamp_granularities[]=word"
```

### OpenAI-Compatible Endpoint

```bash theme={null}
curl -X POST https://api.cartesia.ai/audio/transcriptions \
  -H "X-API-Key: your-cartesia-api-key" \
  -F "file=@audio.wav" \
  -F "model=ink-whisper" \
  -F "language=en" \
  -F "timestamp_granularities[]=word"
```

## Migration from OpenAI

To migrate from OpenAI's Whisper API to Cartesia:

1. **Update the base URL**: Change from `https://api.openai.com/v1` to `https://api.cartesia.ai`
2. **Update authentication**: Replace your OpenAI API key with your Cartesia API key
3. **Update model names**: Use `ink-whisper` instead of OpenAI's model names
4. **Keep the same endpoint**: Continue using `/audio/transcriptions`
5. **Avoid unsupported parameters**: Remove `prompt`, `temperature`, and `response_format` parameters
6. **Use timestamp\_granularities (Optional)**: Add `timestamp_granularities: ["word"]` to get word-level timestamps

The core functionality remains the same, with JSON responses containing transcribed text and optional word timestamps.


# Buffering
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/buffering

Control how text is buffered before speech generation to balance prosody and latency.

Cartesia supports two buffering modes for streaming TTS: **managed buffering** and **custom buffering**. The right choice depends on how much control you need over the prosody-latency tradeoff.

<Tip>
  **Start with managed buffering.** It produces natural-sounding speech with minimal integration effort. Switch to custom buffering only if you need fine-grained control.
</Tip>

## Managed buffering

Stream LLM tokens directly to Cartesia and let the API decide when to start generating speech. This is the same approach used in Cartesia's managed voice agents platform.

Set `max_buffer_delay_ms` to a value greater than 0 (the default is 3000ms) and stream text token by token.

```json theme={null}
{
  "model_id": "sonic-3",
  "transcript": "Hello",
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-context",
  "continue": true,
  "max_buffer_delay_ms": 3000
}
```

The API buffers incoming text until it has enough context to produce high-quality speech, or until `max_buffer_delay_ms` elapses—whichever comes first. This produces results similar to sentence-level aggregation while still optimizing for latency.

**When to use managed buffering:**

* You're streaming LLM output token by token
* You want natural-sounding speech without building buffering logic
* You want a simple integration with good defaults

## Custom buffering

Handle buffering yourself and send complete phrases or sentences to Cartesia. Set `max_buffer_delay_ms` to `0` so the API generates speech immediately from whatever you provide.

```json theme={null}
{
  "model_id": "sonic-3",
  "transcript": "Hello, my name is Sonic.",
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-context",
  "continue": true,
  "max_buffer_delay_ms": 0
}
```

With custom buffering, you control the prosody-latency tradeoff directly:

* **Full sentences** produce the best prosody but add latency while you wait for the sentence to complete.
* **Partial sentences** reduce latency but may result in less natural speech at chunk boundaries.

**When to use custom buffering:**

* You need precise control over when speech generation starts
* You have your own sentence detection or text aggregation logic
* You're optimizing for a specific latency target

## Avoid the middle ground

A common mistake is to aggregate text client-side into sentences or phrases *and* use the default `max_buffer_delay_ms` of 3000ms. This can cause unnecessary latency—after receiving a complete sentence, the API may wait up to 3000ms for additional input before generating speech.

Pick one approach:

* **Managed buffering:** Stream tokens with `max_buffer_delay_ms > 0` and let Cartesia handle aggregation.
* **Custom buffering:** Aggregate text yourself and set `max_buffer_delay_ms = 0`.

## Configuration reference

<ParamField type="number">
  Maximum time in milliseconds the API waits for additional input before generating speech from buffered text.

  * **Range:** 0–5000ms
  * **Default:** 3000ms
  * Set to `0` for custom buffering (no server-side buffering)
  * Set to `> 0` for managed buffering
</ParamField>

<Warning>
  If you use `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with managed buffering, make sure decimal values are not split across tokens. Submitting `1.0` as `1`, `.`, `0` will cause parsing errors.
</Warning>

## Tips for best results

* **End sentences with punctuation.** Without closing punctuation (`.`, `?`, `!`), the model may treat text as incomplete and wait for the buffer delay to elapse before generating. See [streaming inputs with continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) for more details.
* **Signal when input is done.** When a turn is complete, use `continue: false` (WebSocket) or `no_more_inputs()` (SDK) so the model doesn't wait for more text.
* **Test with realistic input patterns.** Buffering behavior depends on how text arrives—test with actual LLM output rather than pre-written text.


# Context Flushing and Flush IDs
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/context-flushing-and-flush-i-ds

Learn about managing multiple transcript generations with context flushing.

## Overview

When using [context IDs with the WebSocket API](/use-the-api/tts-websocket/contexts), all audio chunks for transcripts submitted to a single context share the same context ID. This makes it difficult to determine which audio chunks correspond to specific transcript submissions.

While this behavior works well for streaming audio, some implementations require the ability to map audio chunks back to their originating transcripts.

<Frame>
  <img alt="context_flushing" />
</Frame>

## Manual Flushing

Manual flushing creates clear boundaries between transcript submissions within the same context.

### How It Works

Each time you trigger a manual flush, the system increments a `flush_id` counter. This ID is included in corresponding response audio chunk payloads, allowing you to track which transcript generated specific audio chunks.

### Implementation

To trigger a manual flush:

1. Send a request with these parameters:
   * `continue=True` (indicates you're continuing with the same context)
   * `flush=True` (triggering the flush operation)
   * Empty transcript
   * Same context ID as your previous request

### Example Flow

```
1. Submit transcript 1 on context 1
2. Flush context 1
3. Submit transcript 2 on context 1
```

In this flow:

* All audio chunks from transcript 1 will have `flush_id=1`
* The manual flush increments the ID
* All audio chunks from transcript 2 will have `flush_id=2`

## Payload Structure

Each audio chunk payload includes a `flush_id` field that serves as a transcript identifier. This ID increments with each manual flush operation, creating a clear boundary between transcript submissions.

## When to Use Manual Flushing

Consider using manual flushing when:

* You need to associate audio chunks with their originating transcripts
* Your application architecture expects a one-to-one relationship between transcripts and response streams
* You're integrating with frameworks that assume each transcript has a corresponding generator

This feature is particularly helpful when using multiple providers, as it aligns the Cartesia API with systems that expect discrete generator responses per transcript.


# Contexts
Source: https://docs.cartesia.ai/use-the-api/tts-websocket/contexts


<Info>
  This is a hands-on guide to input streaming using WebSocket contexts. For a conceptual overview of how input streaming works in Sonic, see the [input streaming guide](/build-with-cartesia/capability-guides/stream-inputs-using-continuations).
</Info>

> In many real time use cases, you don't have your transcripts available upfront—like when you're generating them using an LLM. For these cases, Sonic supports input streaming.

The context IDs you pass to the Cartesia API identify speech contexts. Contexts maintain prosody between their inputs—so you can send a transcript in multiple parts and receive seamless speech in return.

To stream in inputs on a context, just pass a `continue` flag (set to `true`) for every input that you expect will be followed by more inputs. (By default, this flag is set to `false`.)

To finish a context, just set `continue` to `false`. If you do not know the last transcript in advance, you can send an input with an empty transcript and `continue` set to `false`.

<Note>Contexts automatically expire 1 second after the last audio output is streamed out. Attempting to send another input on the same context ID after expiry is not supported.</Note>

<ParamField type="boolean">
  Whether this input may be followed by more inputs.
</ParamField>

### Input Format

1. Inputs on the same context must keep all fields except `transcript`, `continue`, and `duration` the same.
2. Transcripts are concatenated verbatim, so make sure they form a valid transcript when joined together. Make sure to include any spaces between words or punctuations as necessary. For example, in languages with spaces, you should include a space at the end of the preceding transcript, e.g. transcript 1 is `Thanks for coming, ` and transcript 2 is `it was great to see you.`

### Example

Let's say you're trying to generate speech for "Hello, Sonic! I'm streaming inputs." You should stream in the following inputs (repeated fields omitted for brevity). Note: all other fields (e.g. `model_id`, `language`) are required and should be passed unchanged between requests with input streaming.

```json Input Streaming theme={null}
{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "inputs.", "continue": false, "context_id": "happy-monkeys-fly"}
```

<Tip>
  If [streaming in input tokens](/build-with-cartesia/capability-guides/stream-inputs-using-continuations), we recommend using `max_buffer_delay_ms`, which sets the maximum time the model will buffer text before starting generation.

  Without this option set, the model will start generating immediately on the first request, giving you full control over buffering of inputs.
</Tip>

If you don't know the last transcript in advance, you can send an input with an empty transcript and `continue` set to `false`:

```json Input Streaming theme={null}
{"transcript": "Hello, Sonic!", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": " I'm streaming ", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "inputs.", "continue": true, "context_id": "happy-monkeys-fly"}
{"transcript": "", "continue": false, "context_id": "happy-monkeys-fly"}
```

### Output

You will only receive `done: true` after outputs for the entire context have been returned.

Outputs for a given context will always be in order of the inputs you streamed in. (That is, if you send input A and then input B on a context, you will first receive the chunks corresponding to input A, and then the chunks corresponding to input B.)

## Cancelling Requests

You may also cancel outgoing requests through the websocket.

To cancel a request, send a JSON message with the following structure:

```json WebSocket Request theme={null}
{
  "context_id": "happy-monkeys-fly",
  "cancel": true
}
```

When you send a cancel request:

1. It will only halt requests that have not begun generating a response yet.
2. Any currently generating request will continue sending responses until completion.

<Note>
  The `context_id` in the cancel request should match the `context_id` of the request you want to cancel.
</Note>


# Changelog 2024
Source: https://docs.cartesia.ai/changelog/2024

Product, API, and platform changes for 2024

<Update label="December 2024">
  ### API

  * Pricing updates; character usage columns migrated to bigint; presign URLs for Pro Voice Clone; **`voices/<id>/conditioning`** endpoint; file to dataset in presign; userID-level endpoint restrictions; Stripe Customer ID on checkout.
  * EU deployment and Hindi HC fixes.

  ### Playground

  * New model on Playground highlighting **transcript following** improvements (demo, not GA).
  * Blog and play.cartesia.ai live.

  ### Models / Voices

  * Model aliasing updated for **`sonic`** and **`sonic-preview`**; twilight-morning in API and enterprise; conditioning entries for voice clone and multilingual.
  * Embedding search for LoRA voice selection.

  ### Other

  * Infrastructure and scaling updates.
  * State of Voice blog and map.
</Update>

<Update label="November 2024">
  ### API

  * **Cartesia-Version 2024-11-13** — Upgrade to new API version; **unified clone voice endpoint**; datasets support; files endpoint pagination; FineTuneRequest status; fine-tunes API in Playground; presign URLs for Pro Voice Clone; **Flush Done** event for manual WebSocket flushing; **`<pause>`** tag for continuations.
  * GCP Enterprise.

  ### Playground

  * Changes for new API; replay suite; GCP Enterprise.

  ### Models / Voices

  * **Flush Done** event for manual flushing in WebSocket; **`<pause>`** tag for continuations within a single transcript; spelling fixes; manual flush and flush ID.
  * Empty encoding field allowed for mp3.

  ### Docs

  * API version **2024-11-13**: Sonic 2, capability guides (clone, pronunciations, speed/emotion, continuations, localize), formatting for Sonic 2.
  * Integrations: LiveKit, Pipecat, Rasa, Thoughtly, Twilio, MCP. Enterprise: SSO, organizations. See [API Conventions](/use-the-api/api-conventions).
</Update>

<Update label="October 2024">
  ### API

  * Cartesia JS bytes endpoint; gen blocks removed from character counting; health checks and middleware; **user-level queueing** with queue length cap and timeout; 10× queue size rejection; Slang (continuations) and ConditioningData; voice changer JS SDK.
  * Remove max limit from Playground.

  ### Playground

  * GCP: API and ingress for GCP US Central. Queueing: user-level queueing in API gateway; queue length cap and `queuedRequest` timeout.
  * Voice Changer: Playground UI polish; ConditioningData as part of ResolvedVoice; Slang rollout; flush on start/end of spell tags.
  * LoRA release UI; onboarding data upsert fix; welcome page submit loading state; enterprise contact links.

  ### Other

  * Canonical linking and sitemap.
  * Blog and navigation (Blog, Careers) updates.
</Update>

<Update label="September 2024">
  ### API

  * User-level queueing; queue size and websocket queueing rejection; **`api_status`** field for voice API usability; LoRA pricing and UX cleanup; **flush all audio on DONE token** (including CB); user option to obfuscate transcripts in logs.
  * LoRA and load balancer improvements.

  ### Playground

  * **Function calling**; agent creation, tests, and dev setup; voice agent infrastructure enabled.
  * LoRA: HiFi cloning endpoint and Playground page; 8 new voices on Playground; Indian accent.
  * **Voice Changer** Playground UI; JS SDK for voice changer. Language added to TTS request from `voices/[id]`; flush all audio on DONE token; user option to obfuscate transcripts in logs.

  ### Docs

  * Blog and sitemap updates.
</Update>

<Update label="August 2024">
  ### API

  * Reject invalid transcripts (docs and API gateway); `no_more_inputs` in WebSockets can use `voice_embedding` instead of `voice_id`.
  * Improved bad model id handling.

  ### Playground

  * **Localization** page in Playground and JS client; dialects and future-compatibility. Switch Playground to voice ID; allow both `id` and embedding for `TTSRequest`; archive voices (kept accessible via API).
  * Replay button; feedback form; fix multilingual recommended voices when switching back to English; better error messaging.

  ### Models / Voices

  * **LoRA** support (multiple voices per LoRA, new cache key, easy-brook-lora, vc-flowing-dream).

  ### Other

  * On-device homepage launch; proper links for "Request a demo" button.
  * **LoRA**: multiple voices per LoRA.
</Update>

<Update label="July 2024">
  ### API

  * **Voice Conversion endpoint** — New API endpoint. **Timestamps** on WebSocket endpoint; **per-generation voice controls** (speed, emotion) in API; polar-tree deployed (`sonic-multilingual`); continuous batching support; VocalWave (English) and long-generation support; `sonic-english` → vocal-wave, `sonic-multilingual` → ancient-voice aliasing.
  * **`buffer`** and **`mp3`** params on `/bytes`; MP3 streaming and WAV encoding fixes; request cancellation; empty transcript allowed when `continue=false`; Stripe webhook cache clear; subscription cancellation/reactivation; Redis cache for overage; keys endpoints.
  * Clerk-based auth in API.

  ### Playground

  * Optional **`enhance`** flag for voice cloning in JS client, Python client, and Playground; voice update endpoint and docs; gate voice cloning for free users.
  * Prevent playing audio while playback in progress; download button disabled until generation finished; API key deletion clearer with copy button; character usage indicator; subscription and checkout fixes; gating clone form for free users.

  ### Docs

  * Voice cloning docs; timestamps and continuations; user guides for voice control and Twilio; emotion control and timestamps; "phonemes" terminology.
  * Voice cloning from file.

  ### Other

  * Python client: continuations support, custom `base_url`, fallback for websockets; JS client v1.0.1: `onError` prop on useTTS.
  * Voice controls (speed, emotion) in Python client and docs.
</Update>

<Update label="June 2024">
  ### API

  * **Continuations** — Support for streaming input via SSE and Bytes; **`NoMoreInputs`** signal. **Cartesia Version** enforced via header; Playground and checkout/subscription endpoints send it.
  * 48 kHz added to valid sample rates; `.wav` byte streaming; HTTP streaming endpoint for raw bytes; API standardization (backwards-compatible); new voices endpoints; mulaw and alaw backwards compatibility; Python client v1.0.0 (overhaul, `output_format`); JS client: `pcm_s16le`, `pcm_alaw`, `pcm_mulaw` and improved typing; caching for voices; **`context_id`** in WebSocket response and docs.
  * Stripe webhooks for renewals and expiration; OpenAPI spec update.

  ### Playground

  * Multilingual: `language` parameter on voices API and in API; Playground language selection; multilingual copy on homepage; default `sonic-english` → feasible-haze.
  * Mobile layout improvements; multilingual UI papercuts; voice cloning and empty transcript styling fixes; filtering moved from `voices/[id]` to Speak page.

  ### Models / Voices

  * **`sonic-multilingual`** and **`sonic-english`** aliasing; `language` column on voices.
  * Recommended voices.

  ### Docs

  * Version **2024-06-10**: get-started, API conventions, integrations (LiveKit, Pipecat, Rasa, Thoughtly, Twilio, MCP), clone voices, embeddings/voice mixing. See [API Conventions](/use-the-api/api-conventions).

  ### Other

  * ToS changes; revised pricing tiers; legal notices on sign-in and sign-up; overage toggle in Playground.
  * Character usage limit blocks WebSocket when exceeded.
</Update>

<Update label="May 2024">
  ### API

  * **Cartesia Version** header; HTTP streaming for raw bytes; new voices endpoints; mulaw/alaw backwards compatibility; API standardization (backwards-compatible); Python client v1.0.0; JS client structure overhaul.
  * Clone voice upload fix.

  ### Playground

  * Redesign and Sonic launch copy; subscriptions page; favoriting voices; **emotion and speed sliders**; User vs Default voices; **tags** (Age, Accent) in DB and Playground; **`sample_text`** field (API Gateway and Playground); buffer streamed audio before playback; character usage indicator; API key auto-created on user creation; custom sign-in/sign-up and 404 on sign-out fix; disable generation button while audio playing; human-readable model names and skilled-cherry.
  * Character limit increase.

  ### Models / Voices

  * Human-readable model names; skilled-cherry; polar-tree (`sonic-multilingual`); continuations and output format; Python client numpy array support.
  * Voice cloning disclaimer.

  ### Docs

  * Mintlify docs added.

  ### Other

  * Stripe webhooks for subscriptions; subscription cancellation and reactivation; character usage checks on generation routes; free subscription by default; Scale plan limit (8M chars/month); checkout and receipts.
  * Custom sign-in/sign-up pages.
</Update>

<Update label="April 2024">
  ### API

  * **`model_id`** added as parameter to generate; minimum transcript length enforced; `voice` moved to `AudioGenerationRequest`; experimental router removed; speed controls and voice edit page; video generation endpoint.
  * WhisperX removed from dependencies.
</Update>

<Update label="March 2024">
  ### API

  * WebSocket interrupt support; get voice embedding route; Redis cache for API keys; streaming switched from Octet to JSON; new model `genial-planet-1346`; `voice` param required on requests; formatting support.
  * WhisperX for transcription (later removed).

  ### Playground

  * Voice cloning in the UI; connection info in JS client; audio downloadable; transcript length validation (max 400 chars, empty rejected); improved UX and crash handling when API key missing; welcome message and icons.
  * API key creation on sign-up via Clerk webhooks.

  ### Other

  * Voice cloning and connection info in JS client.
</Update>


# Changelog 2025
Source: https://docs.cartesia.ai/changelog/2025

Product, API, and platform changes for 2025

<Update label="December 2025">
  ### API

  * **sonic-3-latest** (preview) and dated **sonic-3-YYYY-MM-DD** snapshots.
  * **sonic-3-latest** added to Playground TTS with banner when selected. See [Changelog 2026](/changelog/2026).

  ### Voice changes

  * **Voice Library** — December: 25 new voices across 6 languages (12 English, 6 Hindi, 4 Arabic, 1 Spanish, 1 Japanese); 14 featured.
  * Voice library changes; featured voice badge on voice page; **`/voices/recent`** endpoint.

  ### Playground

  * **Report generation** (report button, alert when user reports).
  * **Voice move**; **archive and publish** voices.
  * **PVC**: custom PVC voices UI, multiple user errors surfaced to UI, feature flag for custom model during creation.
  * **Pronunciation dicts**: new backend APIs, generator on create/edit, case sensitivity badge.
  * **Agents**: new text-to-agent UI, create agent from **Github repo tarball**, system prompt generator for UI agent.
  * **Narrations sunset** notice; TTS History pagination; auth strategy for access-tokens.
  * **sonic-3-latest** banner and naming.

  ### Other

  * PVC, STT, and agent improvements.
  * Error handling and error codes.
</Update>

<Update label="November 2025">
  ### API

  * Improved error handling and public error responses; cache invalidation by voice ID.
  * IPVC train API (remove **`markAsReady`**); dataset files overfetch fix; default voice logic fix.

  ### Playground

  * Pronunciation dicts migrate to new backend APIs; persist visual theme to DB; PVC pipeline error and recommendations.
  * Call logs conversation view default; TTS textarea height fix; Sonic-3 model for partners shown.
  * Billing overage "blood bar" and alert fixes; PVC gate for Startup plan.
  * Pronunciation dict generator on create/edit; API version in dialog; featured voice toggle; narrations model selection.

  ### Line / Agents

  * No user audio warning (250ms); Pipecat DeepgramNovaVADFilter.
  * Call recording and artifact storage fixes.

  ### Models / Voices

  * Sonic 3 PVC and normalizer updates; LoRA and PVC error handling; expand option for dataset file count.
  * **`preview_file_url`**; **`tags_operator`** on GET /voices; restrict delete to non-public voices; **`owner_id`** check for fine tune voices; **`user_errors`** for PVCs.
  * New Arabic accents; African French and Canadian French.
</Update>

<Update label="October 2025">
  ### Model changes

  * **Sonic 3 launch (Oct 27)** — **sonic-3-2025-10-27** stable snapshot released; 42 languages; volume, speed, and emotion controls.
  * Real-time conversation with emotion and laughter; \~190ms median latency. See [Sonic 3](/build-with-cartesia/tts-models/latest) and [Volume, Speed, and Emotion](/build-with-cartesia/sonic-3/volume-speed-emotion).

  ### Other

  * Continued PVC, STT, and agent improvements; error handling and public errors; manifone voices; Sonic 3 PVC and normalizer updates.
  * Transcript buffer multilingual and Thai pronunciation dictionary fix; TTFA buffering and reporting; Voice Conversion operator reload; audio norm operator.
</Update>

<Update label="September 2025">
  ### API

  * **`user_id`** to **`owner_id`** in API (model aliasing / ownership).
  * Improved error handling and version/limit checks.

  ### Line / Agents

  * Warning if no user audio for 250ms+; Pipecat **DeepgramNovaVADFilter** for spurious `on_speech_started`.
  * Call recording and artifact storage fixes.

  ### Models / Voices

  * STT: Migrate STT providers to Deepgram where appropriate; Deepgram for non-English or language-detect agents; word-level user text chunks.
  * Sonic 3 / PVC: Sonic 3 PVC updates; Hindi Sonic 3 normalizer revert; LoRA data processing and expand option for dataset file count; PVC errors to webhook.
  * Manifone new voice; African French and Canadian French accents; partner agents can configure TTS models.

  ### Other

  * LoRA bugfixes.
</Update>

<Update label="August 2025">
  ### API

  * Production-facing agent WebSocket; **cancel endpoint** for ending live calls.
  * Improved error handling and public error codes; cache invalidation by voice ID.

  ### Playground

  * Telephony: stop billing for customer-managed numbers; Cartesia vs Twilio param separation.
  * Outbound number management columns.

  ### Line / Agents

  * **Deepgram Nova VAD** (`utterance_end_ms` configurable via **`vad_stop_secs`**).

  ### Models / Voices

  * New endpoint for **`<audio>`** tags; **accent** column on voice API; **`max_buffer_delay`** applied to continuations; eu-north-1 region.
  * **GET /voices** **`tags_operator`**; **`preview_file_url`**; restrict deleting voices to non-public; check **`owner_id`** when listing fine tune voices; **`user_errors`** for PVCs from API.
  * New Arabic accents migration.

  ### Other

  * Max rollover multiplier for credit plans.
</Update>

<Update label="July 2025">
  ### API

  * **`deploy_error`** status fix.

  ### Playground

  * **LangChain** launched voice agents with Cartesia Sonic TTS.
  * Billing: Stripe customer for enterprise if needed; call runtime logs in call logs side panel; Call Logs UI nits (from June work).

  ### Line / Agents

  * Partner pipeline parity with User Agent; **concurrency fix** (negative concurrency); agent metric LLM credit usage for evals; AgentEvaluations functionality.
  * User Code Connector WS handlers fix; agent end turn handling; summarization system prompt; **`user_prompt`** in API; transcript removed from agent metric result; deadlock fix in WS timeout.

  ### Other

  * Flushing and concurrency fixes.
</Update>

<Update label="June 2025">
  ### API

  * **UserCodeAgent** deployment URL; **cancel endpoint** for force-ending live calls via API; Agent EoUD metric; cartesia agent speed-up; user prompt stored separately in agent metrics; **`agent_evaluations`** table; async flush for aggregator; User Code Connector WS and last bot turn handling; deployment URL delay on pickup.
  * Concurrency and WS timeout fixes; improved goroutine handling; agent workers **`/chats`** timeout increase.

  ### Playground

  * **Call Logs** page for agents with data table and side panel; **Agents demo** with Twilio web dialer, visualizer, and like/dislike feedback; deployment detail page and list; **Twilio number provisioning** (Parts 1 & 2); GitConnector redeploy on commit; deployment logs; zip upload for deployment; feature flag by organization; agents gated behind feature flag; **Deepgram as default STT** for agents; orgs v2 (frontend and backend); 20K credits for organizations; enterprise free trial days and email invoice options.
  * **Credit usage**: separate TTS & STT concurrency panels; STT and Infill charts; voice page copyable fields; call runtime logs in call logs panel.

  ### Models / Voices

  * STT: Whisper large v3; serve multiple models in STT pipeline; word-level user text chunks.
  * FinetunedSTTContext fixes.
</Update>

<Update label="May 2025">
  ### API

  * Voice conversion in enterprise.

  ### Playground

  * Post–April: Following [April 2025](/changelog/2025) API changes (embeddings removed; use [Voice IDs](/build-with-cartesia/tts-models/voice-ids) and [Clone Voice](/api-reference/voices/clone)).

  ### Line / Agents

  * User code deployments from DB; **`agent_deployments`** table; STT cartesia-streaming and Pipecat streaming Whisper; Bedrock proxy for OpenAI-compatible; timestamp bug fixes and default to original timestamps.
  * Partner `/chat` and `/config` updates; DTMF support in UserCodeConnector; endpointing architecture.

  ### Models / Voices

  * STT: Batch engine utilization; Pipecat streaming Whisper.
  * Deepgram STT client `url`/`base_url` fix.

  ### Other

  * Voice clone uploads fix.
</Update>

<Update label="April 2025">
  ### Breaking

  * **sonic-2-2025-04-16** — Starting with **`sonic-2-2025-04-16`**, we're removing support for: Embeddings; **`stability`** cloning mode; Experimental controls for speed and emotion. The **`similarity`** cloning mode is dramatically better. To control speed and emotion today, use Instant Voice Cloning (e.g. FFMPEG, Voice Changer, or instant clones from **`sonic-2-2025-03-07`** embeddings). Users who need embeddings or experimental controls can use API version **`2024-11-13`** with model **`sonic-2-2025-03-07`** (both still available). See [Older models](/build-with-cartesia/tts-models/older-models).

  ### API

  * listVoices by ID for single voice; warm-monkey PVC; **access tokens** (JWT); Cartesia-Version 2024-11-13; phoneme/original timestamps language check; TTS History source; LoRA from fine-tune checkpoints; context expiry replaced by input stream delay.
  * **`sonic-2`** and **`sonic-2-2025-04-16`** ignore experimental controls on TTS generations; voice cloning supports only **`similarity`** clones.
  * Removed embeddings from all endpoints; voices may only be specified by Voice ID; **`/tts`** cannot be called with voice embeddings.
  * Deprecated **`/voices/create`** and **`/voices/mix`**.
</Update>

<Update label="March 2025">
  ### Breaking

  * **sonic-2-2025-03-07** is the last Sonic 2 snapshot supporting voice embeddings and experimental controls. Use with API version **`2024-11-13`** for legacy behavior.
  * sonic-preview → JollyTotem, RoseLion deprecated; sonic-2 alias to jolly-totem for speaker switching. See [Older models](/build-with-cartesia/tts-models/older-models).

  ### API

  * **Cartesia-Version** updated to **2024-11-13**; model latency via header on bytes endpoint; new Sonic PVC model warm-monkey; listVoices by ID (single voice); **access tokens** (JWT signing, validation); API-level check for languages supporting phoneme and original timestamps.
  * Organizations and billing; **free credits** 10k → 20k; overages product; subscription cache invalidation webhook; TTS History **source** column (api, playground, narrations); LoRA voices from base VoiceVariation and checkpoint for fine tunes.

  ### Playground

  * **sonic-2** and **sonic-turbo** aliases launched; Sonic 2 / Sonic Turbo messaging (Turbo = 40ms latency).
  * cartesia.ai/sonic and playground updates.

  ### Line / Agents

  * Agent ID in websocket URL; telephony info on partner calls; Pipecat version upgrade; partner demo tool calls; warm-monkey PVC model; prespeak and function call flow updates.
  * Twilio voice routes support agent IDs; Keypad DTMF on agent; half-duplex STT and LLM context; original timestamps support in API.

  ### Other

  * **sonic-pvc** alias and DryVoice as sonic-pvc model. **Python SDK** announced.
</Update>

<Update label="February 2025">
  ### API

  * **listVoices** by ID; localize endpoint voice name fix; 400s for bad body params; text forcing max transcript length; **OpenAI-compatible STT server**; agent with local STT; voice tags; on-device transcripts in evals; jolly-totem as default sonic-preview.
  * S2S and Agents foundational blocks.

  ### Playground

  * Instant cloning enabled for free users; voice tags; localize refactored to use conditioning; listVoices can query by ID for single voice; Sarah (Similarity) and Southern Woman migrations; on-device transcripts.
  * Narrations settings (JSONB).

  ### Line / Agents

  * Agent with local STT; foundational S2S + Agents blocks; design and pipeline work.

  ### Models / Voices

  * STT: cartesia-streaming and Pipecat streaming Whisper; on-device transcripts.
</Update>

<Update label="January 2025">
  ### API

  * **sonic-lite** added to API; EU deployment for prod API; save option for TTS bytes handler; CORS header for **Cartesia-File-ID**; Stripe credits default to `char_limit` in checkout; Redis cache for overage settings; polar-mountain and VC in EU; ListFiles paginator fix.
  * Eval break/spell tags and replacement/normalization mode.

  ### Models / Voices

  * sonic-preview routed to MisunderstoodFrog; polar-mountain added and staged; visionary-yogurt timestamp requests for any language.
  * jolly-totem as default sonic-preview.
</Update>


# Changelog 2026
Source: https://docs.cartesia.ai/changelog/2026

Product, API, and platform changes for 2026

<Update label="April 2026">
  ### Sonic 3.5

  *Sonic 3.5 is now available on `sonic-3-latest`. We'd love for you to try it and tell us what you think.*

  #### Why you should try it

  * **More natural speech, pacing, and emotional expression**, especially noticeable on expressive, conversational, and support-style transcripts.
  * **Cleaner audio quality** across all languages and voices.
  * **Better alphanumeric read-out** — confirmation codes, order numbers, phone numbers, IDs, and emails sound meaningfully more natural, in all supported languages.
  * **Step-change multilingual performance**, particularly Hebrew, Japanese, Spanish, Hindi, German, Korean, and French.
  * **English heteronyms** — tricky English heteronyms like "read," "bass," and "bow" now pronounce correctly in context.

  #### How to try it

  1. Point your API call or Playground request to the model ID `sonic-3-latest`.
  2. Keep your existing voice IDs, request shape, and prompting — no code changes required for most customers.
  3. Send us feedback on any voice or transcript that behaves differently than you expect.

  <Note>
    As with any `-latest` alias, `sonic-3-latest` can be updated without notice and is not recommended for production. Pin to a dated snapshot (e.g. `sonic-3`) for production traffic.
  </Note>

  #### What to know to be successful

  * **Spell tags still work the same way.** If you already wrap alphanumerics in `<spell>...</spell>`, you don't need to change anything — you'll just get better-sounding output. See [here](/build-with-cartesia/sonic-3-5/prompting-tips#controlling-pacing-and-spelling) for more details.
  * **If you use custom delimiters** (commas/periods between characters or groups) to control pacing, our recommended format has changed. Use **spaces between characters** and **commas between groups**, e.g. `A B C, 1 2 3` instead of `A, B, C. 1, 2, 3.`. See [Prompting tips for Sonic 3.5](/build-with-cartesia/sonic-3-5/prompting-tips) for more details.
  * **Speed and volume controls are temporarily disabled** on `sonic-3-latest`. If you rely on speed or volume augmentation (including via SSML), stay on `sonic-3` for now. We believe that Sonic 3.5 has more natural pacing and you may find that you don't need to use speed control as much when using this model.
  * **Timestamps behave slightly differently.** If you use end-of-word timestamps for interruption handling, you should not see a meaningful change. If you depend on beginning-of-word timestamps, please test carefully and reach out if you see regressions for your use case.
  * **Existing Professional Voice Clones (PVCs) do not carry over to `sonic-3-latest`.** Professional Voice Clones are pinned to the base model they were trained on (e.g. `sonic-3`) and will function as a standard voice clone for this model. For more information, see [Clone Voices (Pro)](/build-with-cartesia/capability-guides/clone-voices-pro/playground).
  * **Providing proper context to the model improves naturalness.** Please see our buffering guide [here](/use-the-api/tts-websocket/buffering) for more details.

  #### Where to look for help

  * [Sonic 3.5 model overview](/build-with-cartesia/tts-models/sonic-3-5)
  * [Prompting tips for Sonic 3.5](/build-with-cartesia/sonic-3-5/prompting-tips)
  * [Model aliases and snapshots](/build-with-cartesia/tts-models/latest#continuous-updates-and-model-snapshots)
</Update>

<Update label="March 2026">
  ### Breaking

  * **Text-to-Agent (T2A) API** — Text-to-Agent workflow for Line is **deprecated**.

  ### API

  * **Error responses** — For `Cartesia-Version: 2026-03-01`, we now return structured JSON. See [API Errors](/use-the-api/api-errors).
    * API versions before `2026-03-01` continue to return legacy error formats (for example HTTP `Title: Message`).
    * **Voices** — `PATCH /voices/{id}`: voice owners can now update accent and gender. Voice creation validates language. Invalid voice UUIDs and pronunciation-dictionary IDs return 404 instead of ambiguous errors.
  * **PVC model routing** — PVC voices require a dated model ID (e.g. **`sonic-3-2026-01-12`**) instead of **`sonic-3`**. See [Clone Voices (Pro)](/build-with-cartesia/capability-guides/clone-voices-pro/api).
  * **Voice search** — Name and metadata search is **diacritics-insensitive**.

  ### Playground

  * **Pro voice clones**
    * Clearer **language mismatch** messaging
    * **Background noise removal** is now a simple on/off control
    * **Fine-tuning model support**:
      * Removed support for older models
      * Now only **sonic-3-2026-01-12** is supported
  * **Multilingual agents** — Multilingual agent configuration is now supported in the Playground.
  * **Agents UI** — Search by **call ID** and **agent ID**.

  ### Billing

  * **Concurrency** — Organizations can receive **notifications** when concurrency nears configured **limits**.

  ### Model / voice

  * **Professional Voice Clones** — Backend updates improve stability of the professional voice cloning workflow.
  * **Accents & filters** — Additional **accent** options (e.g. **Irish**, **New Zealand**, **South African**, **Belgian**) and **locale aliases** for accent filtering in APIs and Playground.
  * **Voice Library** — **94** new voices across **17** locales (including Arabic, German, English variants, Spanish, Finnish, French, Hebrew, Hindi, Japanese, Korean, Polish, Portuguese, Swedish, Telugu, Thai, and more).

  ### Self-hosted

  * **On-premises** — API for managing voices on self-hosted deployments.

  ### Cartesia SDK

  * **cartesia-js v3.0.0** (Mar 2) — Major updates:

    * New features: `flush_id` included in chunk and voice changer binary responses; `output_format` and infill support; inline WebSocket response types; byte endpoint returns **ArrayBuffer**; improved **WebPlayer** and client export.
    * Fixes: memory leak and timing issues with abort signals/listeners, handling of empty `Content-Length`, and **TimeoutError** now includes a message.

    See [cartesia-js releases](https://github.com/cartesia-ai/cartesia-js/releases) for full details.
</Update>

<Update label="February 2026">
  ### Line

  * **[History Management API](/line/sdk/agents#history-management)**: You can add or replace the history provided to your agent, for example, to summarize a long conversation.
  * **[Custom User Events](/line/sdk/events#custom-event)**: You can send bidirectional custom events between your client and the agent. You could use this, for example, if you have a web application with UI interactions.
  * **[Uninterruptible Messages](/line/sdk/events#speech)**: You can set messages as uninterruptible. A common use case is a legal disclaimer at the beginning of a call.
  * **End Tool Call Improvements**: The default end call tool call is more conservative to prevent calls from ending prematurely.

  ### API

  * Increased reliability of API connections

  ### Cartesia SDK

  * **cartesia-python v3.0.0** (Feb 9). See full details in [cartesia-python releases](https://github.com/cartesia-ai/cartesia-python/releases).

  ### Playground

  * Shipped a new TTS page
  * Shipped a new Voice Creation page
  * Shipped a new Agents page

  ### Model changes

  * **Improved pronunciation of real-world text patterns across languages**
    * Enhanced support for structured and formatted speech patterns: numbers, dates, times, currency, phone numbers, IDs, percentages, and amounts/measurements.
    * Support for various date formats (YYYY-MM-DD, YYYY/MM/DD, 年月日).
    * Support for measurement units (meters, kg, tablespoon, gigabytes, etc.) with locale awareness.
    * Support for domestic and international phone number formats with locale-specific chunking for French, Italian, German, Portuguese, Korean, and more.
    * Improved alphanumeric ID handling with katakana/hiragana readings and Latin acronym transliteration to katakana for Japanese.
    * Improves all languages except English, Hindi & other Indic languages, Arabic, Hebrew, Chinese, Swedish, Georgian, Bulgarian, and Tagalog (targeted for future updates).
  * **Support for regional and locale-specific pronunciation within languages**
    * Regional voices use region-specific terms in addition to accent (e.g. Belgian and Swiss French "nonante" vs. Canadian and French "quatre-vingt-dix").
    * Region-specific number terminology, currency symbols, date formats, and measurement units.
    * Locale-aware date and time formatting (e.g. Russian year suffixes, French/Spanish time conventions).
    * Locale-aware currency symbol handling (e.g. \$ as "dollars" in en\_US and "pesos" in es\_MX).
    * Locale pronunciation falls back to the primary country for that language (e.g. US for English, Brazil for Portuguese). We will continue to expand locale-aware support.
    * Improves all languages except English, Hindi & other Indic languages, Arabic, Hebrew, Chinese, Swedish, Georgian, Bulgarian, and Tagalog (targeted for future updates). Existing regional pronunciation for English voices (e.g. British) is unaffected.

  ### Voice changes

  * **Voice Library**: 39 new voices across 21 locales

  ### Breaking changes effective June 1, 2026

  The following model snapshots and languages are discontinued effective June 1, 2026:

  | Model                | Snapshots                                                        | Languages                  |
  | -------------------- | ---------------------------------------------------------------- | -------------------------- |
  | `sonic`              | All                                                              | All                        |
  | `sonic-english`      | —                                                                | All                        |
  | `sonic-multilingual` | —                                                                | All                        |
  | `sonic-2`            | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | it, nl, pl, ru, sv, tr, hi |
  |                      | `sonic-2-2025-03-07`                                             | All                        |
  | `sonic-turbo`        | `sonic-turbo-2025-06-04`                                         | it, nl, pl, ru, sv, tr     |
  |                      | `sonic-turbo-2025-03-07`                                         | All                        |

  The following endpoints are discontinued effective June 1, 2026:

  | Discontinued Endpoint                      | Replacement                                |
  | ------------------------------------------ | ------------------------------------------ |
  | Voice Embedding: `POST /voices/clone/clip` | [Clone Voice](/api-reference/voices/clone) |
  | Mix Voices: `POST /voices/mix`             | —                                          |
  | Create Voice: `POST /voices`               | [Clone Voice](/api-reference/voices/clone) |

  The following endpoints stop accepting voice embeddings effective June 1, 2026:

  | Endpoint with a breaking change       | Replacement |
  | ------------------------------------- | ----------- |
  | TTS (bytes): `POST /tts/bytes`        | Voice ID    |
  | TTS (SSE): `POST /tts/sse`            | Voice ID    |
  | TTS (WebSocket): `WSS /tts/websocket` | Voice ID    |
</Update>

<Update label="January 2026">
  ### API

  * **Regionalization** — Calls routed to US, EU, APAC by origin.
  * **Parameterized outbound calls** — [Docs](/line/integrations/telephony/outbound-dialing)
  * **Pronunciation dictionaries** — [Docs](/line/sdk/agents#custom-pronunciations)

  ### Model changes

  * **Sonic-3 model versioning scheme introduced**
    * New preview track: **`sonic-3-latest`** (continuous updates for early access and feedback).
    * Stable track: **`sonic-3`** always points to the most recent stable release.
    * Immutable dated snapshots: **`sonic-3-YYYY-MM-DD`** never change.
    * Details: [Continuous updates and model snapshots](/build-with-cartesia/tts-models/latest#continuous-updates-and-model-snapshots)
  * **Promotion to stable checkpoint:** **`sonic-3-2026-01-12`**
    * Included improvements: consistent speed & volume, custom IPA pronunciations with stronger adherence, Hindi prosody improvements, Korean prosody/intonation improvements.

  ### Voice changes

  * **Featured Voices launched** — Curated set of 30+ best-performing voices (e.g. [Cathy](https://play.cartesia.ai/voices/e8e5fffb-252c-436d-b842-8879b84445b6), [Henry](https://play.cartesia.ai/voices/87286a8d-7ea7-4235-a41a-dd9fa6630feb)).
  * **Voice Library** — December: 25 new voices across 6 languages.
  * **Voice Library** — January: 9 Spanish voices (Mexican, Colombian, Castilian).

  ### Playground

  * Voice library usability improvements (test with your own scripts, call an agent per voice).
  * One-click **Report Issue** on TTS Playground.
  * Mini voice picker (recently used + saved) on TTS page.
  * PVC UI + reliability (loading skeletons, error messages, better behavior with large datasets and silence).

  ### Line

  * **Line SDK v0.2** — [Repo](https://github.com/cartesia-ai/line). Improved DX, long-running tool-call handling, **committed turns**, better turn-taking and transcription.
</Update>


# Error Handling
Source: https://docs.cartesia.ai/examples/error-handling

Example of error handling with SDK exceptions.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def error_handling_example(client: Cartesia) -> None:
        """Example of error handling with SDK exceptions."""
        try:
            _response = client.tts.generate(
                model_id="sonic-3",
                transcript="Hello, world!",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
            )
        except BadRequestError as e:
            print(f"Bad request: {e}")
        except AuthenticationError as e:
            print(f"Auth failed: {e}")
        except NotFoundError as e:
            print(f"Not found: {e}")
        except RateLimitError as e:
            print(f"Rate limited: {e}")
        except APIError as e:
            print(f"API error: {e}")
    ```

    From [cartesia-python/examples/examples.py:545](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L545)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function errorHandling(client: Cartesia): Promise<void> {
      /** Example of error handling with SDK exceptions. */
      try {
        await client.tts.generate({
          model_id: 'sonic-3',
          transcript: 'Hello, world!',
          voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
          output_format: { container: 'wav', encoding: 'pcm_f32le', sample_rate: 44100 },
        });
      } catch (e) {
        if (e instanceof BadRequestError) {
          console.log(`Bad request: ${e.message}`);
        } else if (e instanceof AuthenticationError) {
          console.log(`Auth failed: ${e.message}`);
        } else if (e instanceof NotFoundError) {
          console.log(`Not found: ${e.message}`);
        } else if (e instanceof RateLimitError) {
          console.log(`Rate limited: ${e.message}`);
        } else if (e instanceof APIError) {
          console.log(`API error: ${e.message}`);
        } else {
          throw e;
        }
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:398](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L398)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py error_handling_example
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts errorHandling
    ```
  </Tab>
</Tabs>


# Create Infill Audio
Source: https://docs.cartesia.ai/examples/infill-create

Create infill audio between two clips.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def infill_create(client: Cartesia) -> None:
        """Create infill audio between two clips."""
        from pathlib import Path
        # Can pass file paths directly (as Path objects)
        response = client.tts.infill(
            model_id="sonic-3",
            language="en",
            transcript="Infill text",
            left_audio=Path("left.wav"),
            right_audio=Path("right.wav"),
            voice_id="6ccbfb76-1fc6-48f7-b71d-91ac6298247b",
            output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        response.write_to_file("infill_output.wav")
        print(f"Saved audio to infill_output.wav")
        print(f"Play with: ffplay -f wav infill_output.wav")
    ```

    From [cartesia-python/examples/examples.py:504](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L504)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def infill_create_async(client: AsyncCartesia) -> None:
        """Create infill audio between two clips."""
        from pathlib import Path
        response = await client.tts.infill(
            model_id="sonic-3",
            language="en",
            transcript="Infill text",
            left_audio=Path("left.wav"),
            right_audio=Path("right.wav"),
            voice_id="6ccbfb76-1fc6-48f7-b71d-91ac6298247b",
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        await response.write_to_file("infill_output_async.wav")
        print("Saved audio to infill_output_async.wav")
        print("Play with: ffplay -f wav infill_output_async.wav")
    ```

    From [cartesia-python/examples/async\_examples.py:341](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L341)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py infill_create
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py infill_create_async
    ```
  </Tab>
</Tabs>


# Next.js Full Example
Source: https://docs.cartesia.ai/examples/nextjs

A complete Next.js application with batch TTS, HTTP streaming, and WebSocket streaming.

A full Next.js app demonstrating three approaches to Cartesia TTS in the browser:
batch generation, HTTP streaming, and WebSocket streaming. Includes a server-side
token endpoint so API keys are never exposed to the client.

## Token Endpoint

```typescript app/api/token/route.ts theme={null}
import Cartesia from "@cartesia/cartesia-js";

const client = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY });

export async function POST() {
  const { token } = await client.accessToken.create({
    grants: { tts: true },
    expires_in: 300,
  });
  return Response.json({ token });
}
```

## Batch and HTTP Streaming

```tsx app/page.tsx theme={null}
"use client";

import { useRef, useState } from "react";
import Cartesia from "@cartesia/cartesia-js";

const SAMPLE_RATE = 44100;
const BYTES_PER_SAMPLE = 4; // f32le

async function getToken(): Promise<string> {
  const res = await fetch("/api/token", { method: "POST" });
  const { token } = await res.json();
  return token;
}

// =============================================================================
// Batch: waits for the full response, then plays via <audio> element
// =============================================================================

function BatchCartesiaTTSExample() {
  const audioRef = useRef<HTMLAudioElement>(null);
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      const client = new Cartesia({ token: await getToken() });
      const response = await client.tts.generate({
        model_id: "sonic-3",
        transcript: "Hello! This audio was generated in one batch and then played.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "wav", encoding: "pcm_s16le", sample_rate: SAMPLE_RATE },
      });

      const blob = await response.blob();
      const url = URL.createObjectURL(blob);
      const audio = audioRef.current!;
      audio.src = url;
      audio.onended = () => URL.revokeObjectURL(url);
      await audio.play();
    } finally {
      setLoading(false);
    }
  }

  return (
    <section>
      <h2>Batch</h2>
      <p>Waits for the full audio, then plays via an audio element.</p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Generating..." : "Speak"}
      </button>
      <audio ref={audioRef} controls style={{ display: "block", marginTop: "0.5rem" }} />
    </section>
  );
}

// =============================================================================
// Streaming: plays audio chunks as they arrive via Web Audio API
// =============================================================================

function StreamingCartesiaTTSExample() {
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      const client = new Cartesia({ token: await getToken() });
      const response = await client.tts.generate({
        model_id: "sonic-3",
        transcript:
          "Hello! This audio is being streamed and played as chunks arrive.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: SAMPLE_RATE },
      });

      // Stream the response and play each chunk as it arrives.
      // We buffer incoming bytes so we only decode complete f32 samples —
      // getReader() can split chunks at arbitrary byte boundaries.
      const audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE });
      let nextStartTime = audioCtx.currentTime;
      const reader = response.body!.getReader();
      let leftover = new Uint8Array(0);

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        // Prepend any leftover bytes from the previous chunk
        let bytes: Uint8Array;
        if (leftover.length > 0) {
          bytes = new Uint8Array(leftover.length + value.length);
          bytes.set(leftover);
          bytes.set(value, leftover.length);
        } else {
          bytes = value;
        }

        // Only decode complete samples, save the remainder
        const usableBytes = bytes.length - (bytes.length % BYTES_PER_SAMPLE);
        leftover = bytes.slice(usableBytes);

        if (usableBytes === 0) continue;

        // Copy to an aligned buffer so Float32Array doesn't throw on unaligned offset
        const aligned = new ArrayBuffer(usableBytes);
        new Uint8Array(aligned).set(bytes.subarray(0, usableBytes));
        const floats = new Float32Array(aligned);

        const buf = audioCtx.createBuffer(1, floats.length, SAMPLE_RATE);
        buf.getChannelData(0).set(floats);

        const source = audioCtx.createBufferSource();
        source.buffer = buf;
        source.connect(audioCtx.destination);

        const startTime = Math.max(nextStartTime, audioCtx.currentTime);
        source.start(startTime);
        nextStartTime = startTime + buf.duration;
      }
    } finally {
      setLoading(false);
    }
  }

  return (
    <section>
      <h2>Streaming</h2>
      <p>Plays audio chunks as they arrive via the Web Audio API.</p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Streaming..." : "Speak"}
      </button>
    </section>
  );
}

// =============================================================================
// Page
// =============================================================================

export default function Home() {
  return (
    <main style={{ padding: "2rem", fontFamily: "system-ui" }}>
      <h1>Cartesia TTS — Next.js Example</h1>
      <div style={{ display: "flex", flexDirection: "column", gap: "2rem", marginTop: "1rem" }}>
        <BatchCartesiaTTSExample />
        <StreamingCartesiaTTSExample />
      </div>
      <p style={{ marginTop: "2rem" }}>
        <a href="/websocket">WebSocket streaming example →</a>
      </p>
    </main>
  );
}
```

## WebSocket Streaming

```tsx app/websocket/page.tsx theme={null}
"use client";

import { useState } from "react";
import Cartesia from "@cartesia/cartesia-js";

const SAMPLE_RATE = 44100;

export default function WebSocketExample() {
  const [loading, setLoading] = useState(false);

  async function speak() {
    setLoading(true);
    try {
      // 1. Get a short-lived token from our server
      const res = await fetch("/api/token", { method: "POST" });
      const { token } = await res.json();

      // 2. Connect via WebSocket from the browser
      const client = new Cartesia({ token });
      const ws = await client.tts.websocket();

      // 3. Stream audio and play each chunk as it arrives
      const audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE });
      let nextStartTime = audioCtx.currentTime;

      const resp = ws.generate({
        model_id: "sonic-3",
        transcript:
          "Hello from a WebSocket! Each audio chunk is played the moment it arrives, giving you the lowest possible latency.",
        voice: { mode: "id", id: "6ccbfb76-1fc6-48f7-b71d-91ac6298247b" },
        output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: SAMPLE_RATE },
      });

      for await (const event of resp) {
        if (event.type === "chunk" && event.audio) {
          // event.audio is a Uint8Array of f32le samples
          const aligned = new ArrayBuffer(event.audio.byteLength);
          new Uint8Array(aligned).set(event.audio);
          const floats = new Float32Array(aligned);

          const buf = audioCtx.createBuffer(1, floats.length, SAMPLE_RATE);
          buf.getChannelData(0).set(floats);

          const source = audioCtx.createBufferSource();
          source.buffer = buf;
          source.connect(audioCtx.destination);

          const startTime = Math.max(nextStartTime, audioCtx.currentTime);
          source.start(startTime);
          nextStartTime = startTime + buf.duration;
        }
      }

      ws.close();
    } finally {
      setLoading(false);
    }
  }

  return (
    <main style={{ padding: "2rem", fontFamily: "system-ui" }}>
      <h1>Cartesia TTS — WebSocket Streaming</h1>
      <p>
        Uses the SDK&apos;s WebSocket API directly from the browser.
        Audio plays as each chunk arrives for lowest latency.
      </p>
      <button onClick={speak} disabled={loading}>
        {loading ? "Streaming..." : "Speak"}
      </button>
      <p style={{ marginTop: "1rem" }}>
        <a href="/">← Back to HTTP examples</a>
      </p>
    </main>
  );
}
```

## Run this example

```sh theme={null}
cd cartesia-js/examples/nextjs
npm install
CARTESIA_API_KEY=YOUR_KEY npm run dev
```

Then open [http://localhost:3000](http://localhost:3000).

## Source

<Card title="View on GitHub" icon="github" href="https://github.com/cartesia-ai/cartesia-js/tree/main/examples/nextjs">
  Full Next.js example project
</Card>


# Transcribe Audio
Source: https://docs.cartesia.ai/examples/stt-transcribe

Transcribe audio with word timestamps.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def stt_transcribe(client: Cartesia) -> None:
        """Transcribe audio with word timestamps."""
        with open("audio.wav", "rb") as f:
            response = client.stt.transcribe(
                file=f,
                model="ink-whisper",
                language="en",
                timestamp_granularities=["word"],  # Optional: get word timestamps
            )
        print(response.text)
        if response.words:
            for word in response.words:
                print(f"{word.word}: {word.start}s - {word.end}s")
    ```

    From [cartesia-python/examples/examples.py:526](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L526)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function sttTranscribe(client: Cartesia): Promise<void> {
      /** Transcribe audio with word timestamps. */
      const file = fs.createReadStream('audio.wav');
      const response = await client.stt.transcribe({
        file,
        model: 'ink-whisper',
        language: 'en',
        timestamp_granularities: ['word'],
      });
      console.log(response.text);
      if (response.words) {
        for (const word of response.words) {
          console.log(`${word.word}: ${word.start}s - ${word.end}s`);
        }
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:377](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L377)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py stt_transcribe
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts sttTranscribe
    ```
  </Tab>
</Tabs>


# Download Audio File
Source: https://docs.cartesia.ai/examples/tts-download-file

Generate audio and trigger a file download in the browser.

```typescript theme={null}
async function ttsDownloadFile(client: Cartesia): Promise<void> {
  /** Generate audio and trigger a file download in the browser. */
  const response = await client.tts.generate({
    model_id: 'sonic-3',
    transcript: 'This audio will be downloaded as a file.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'wav', encoding: 'pcm_s16le', sample_rate: 44100 },
  });

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);

  const a = document.createElement('a');
  a.href = url;
  a.download = 'speech.wav';
  a.click();

  URL.revokeObjectURL(url);
}
```

From [cartesia-js/examples/browser\_examples.ts:54](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L54)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Generate to File
Source: https://docs.cartesia.ai/examples/tts-generate-to-file

Use generate() and write_to_file() to write a wav file.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_generate_to_file(client: Cartesia) -> None:
        """Use generate() and write_to_file() to write a wav file."""
        response = client.tts.generate(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100},
        )
        response.write_to_file("output.wav")
        print(f"Saved audio to output.wav")
        print(f"Play with: ffplay -f wav output.wav")
    ```

    From [cartesia-python/examples/examples.py:30](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L30)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsGenerateToFile(client: Cartesia): Promise<void> {
      /** Use generate() and write_to_file() to write a wav file. */
      const response = await client.tts.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'wav', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      const buffer = Buffer.from(await response.arrayBuffer());
      fs.writeFileSync('output.wav', buffer);
      console.log('Saved audio to output.wav');
      console.log('Play with: ffplay -f wav output.wav');
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:29](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L29)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_generate_to_file
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsGenerateToFile
    ```
  </Tab>
</Tabs>


# Play Audio in Browser
Source: https://docs.cartesia.ai/examples/tts-play-audio

Generate a wav and play it using an <audio> element.

```typescript theme={null}
async function ttsPlayAudio(client: Cartesia): Promise<void> {
  /** Generate a wav and play it using an <audio> element. */
  const response = await client.tts.generate({
    model_id: 'sonic-3',
    transcript: 'Hello from the browser!',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'wav', encoding: 'pcm_s16le', sample_rate: 44100 },
  });

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);

  const audio = new Audio(url);
  audio.onended = () => URL.revokeObjectURL(url);
  await audio.play();
}
```

From [cartesia-js/examples/browser\_examples.ts:33](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L33)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# SSE Streaming
Source: https://docs.cartesia.ai/examples/tts-sse-basic

Basic SSE streaming.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_sse_basic(client: Cartesia) -> None:
        """Basic SSE streaming."""
        stream = client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )

        import datetime
        filename = f"tts_sse_basic_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            for event in stream:
                if event.type == "chunk":
                    # v3.x returns raw bytes in event.audio
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:62](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L62)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_sse_basic_async(client: AsyncCartesia) -> None:
        """Basic SSE streaming."""
        stream = await client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        )

        filename = f"tts_sse_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            async for event in stream:
                if event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:52](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L52)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_basic
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_sse_basic_async
    ```
  </Tab>
</Tabs>


# SSE with Match Statement
Source: https://docs.cartesia.ai/examples/tts-sse-with-match

SSE streaming using match statement.

```python theme={null}
def tts_sse_with_match(client: Cartesia) -> None:
    """SSE streaming using match statement."""
    stream = client.tts.generate_sse(
        model_id="sonic-3",
        transcript="Hello, world!",
        voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
        output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
    )

    import datetime
    filename = f"tts_sse_with_match_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

    with open(filename, "wb") as f:
        for event in stream:
            if event.type == "chunk":
                # Audio chunk - event.audio contains bytes
                if event.audio:
                    f.write(event.audio)
                    process_audio(event.audio)
            elif event.type == "timestamps":
                # Word timestamps - event.word_timestamps
                process_timestamps(event.word_timestamps)
            elif event.type == "done":
                # Stream complete
                break
            elif event.type == "error":
                # Error occurred
                raise Exception(event.error)

    print(f"Saved audio to {filename}")
    print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
```

From [cartesia-python/examples/examples.py:151](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L151)

## Run this example

```sh theme={null}
cd cartesia-python
CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_match
```


# SSE with Phoneme Timestamps
Source: https://docs.cartesia.ai/examples/tts-sse-with-phoneme-timestamps

SSE streaming with phoneme timestamps.

```python theme={null}
def tts_sse_with_phoneme_timestamps(client: Cartesia) -> None:
    """SSE streaming with phoneme timestamps."""
    stream = client.tts.generate_sse(
        model_id="sonic-3",
        transcript="Hello, world!",
        voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
        output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
        add_phoneme_timestamps=True,
    )

    import datetime
    filename = f"tts_sse_with_phoneme_timestamps_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

    with open(filename, "wb") as f:
        for event in stream:
            if event.type == "phoneme_timestamps":
                pt = event.phoneme_timestamps
                if pt:
                    print(f"Phonemes: {pt.phonemes}, Starts: {pt.start}, Ends: {pt.end}")
            elif event.type == "chunk":
                if event.audio:
                    f.write(event.audio)
            elif event.type == "done":
                break
            elif event.type == "error":
                raise Exception(f"Error: {event.error}")

    print(f"Saved audio to {filename}")
    print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
```

From [cartesia-python/examples/examples.py:120](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L120)

## Run this example

```sh theme={null}
cd cartesia-python
CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_phoneme_timestamps
```


# SSE with Word Timestamps
Source: https://docs.cartesia.ai/examples/tts-sse-with-timestamps

SSE streaming with word timestamps.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_sse_with_timestamps(client: Cartesia) -> None:
        """SSE streaming with word timestamps."""
        stream = client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            add_timestamps=True,
        )

        import datetime
        filename = f"tts_sse_with_timestamps_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            for event in stream:
                if event.type == "timestamps":
                    wt = event.word_timestamps
                    if wt:
                        print(f"Words: {wt.words}, Starts: {wt.start}, Ends: {wt.end}")
                elif event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:89](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L89)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_sse_with_timestamps_async(client: AsyncCartesia) -> None:
        """SSE streaming with word timestamps."""
        stream = await client.tts.generate_sse(
            model_id="sonic-3",
            transcript="Hello, world!",
            voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
            output_format={"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            add_timestamps=True,
        )

        filename = f"tts_sse_timestamps_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

        with open(filename, "wb") as f:
            async for event in stream:
                if event.type == "timestamps":
                    wt = event.word_timestamps
                    if wt:
                        print(f"Words: {wt.words}, Starts: {wt.start}, Ends: {wt.end}")
                elif event.type == "chunk":
                    if event.audio:
                        f.write(event.audio)
                elif event.type == "done":
                    break
                elif event.type == "error":
                    raise Exception(f"Error: {event.error}")

        print(f"Saved audio to {filename}")
        print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:76](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L76)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_sse_with_timestamps
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_sse_with_timestamps_async
    ```
  </Tab>
</Tabs>


# WebSocket Basic
Source: https://docs.cartesia.ai/examples/tts-websocket-basic

Basic WebSocket usage with websocket_connect() context manager.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_basic(client: Cartesia) -> None:
        """Basic WebSocket usage with websocket_connect() context manager."""
        with client.tts.websocket_connect() as connection:
            connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "voice-id"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            import datetime
            filename = f"tts_websocket_basic_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.done:
                        break

            print(f"Saved audio to {filename}")
            print(f"Play with:\n  $ ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:196](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L196)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_basic_async(client: AsyncCartesia) -> None:
        """Basic WebSocket usage with websocket_connect() context manager."""
        async with client.tts.websocket_connect() as connection:
            await connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            filename = f"tts_ws_basic_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.done:
                        break
            
            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:109](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L109)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketBasic(client: Cartesia): Promise<void> {
      /** Basic WebSocket usage with websocket_connect() context manager. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const filename = `tts_websocket_basic_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ws.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      })) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with:\n  $ ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:48](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L48)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_basic
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_basic_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketBasic
    ```
  </Tab>
</Tabs>


# WebSocket Concurrent Contexts
Source: https://docs.cartesia.ai/examples/tts-websocket-concurrent-contexts

Two contexts on one connection, each using ctx.receive() to get their own audio.

<Tabs>
  <Tab title="Python">
    Since sync code can't receive from both contexts concurrently, we collect
    them sequentially — but the lazy-routing in receive() ensures that events
    consumed while reading context 1 are queued for context 2 (and vice-versa).

    ```python theme={null}
    def tts_websocket_concurrent_contexts(client: Cartesia) -> None:
        """Two contexts on one connection, each using ctx.receive() to get their own audio.

        Since sync code can't receive from both contexts concurrently, we collect
        them sequentially — but the lazy-routing in receive() ensures that events
        consumed while reading context 1 are queued for context 2 (and vice-versa).
        """
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx1 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )
            ctx2 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )

            # Send to both contexts before receiving
            ctx1.push("Context one is speaking now. This is a longer transcript to ensure that audio chunks from both contexts are interleaved on the wire. The quick brown fox jumps over the lazy dog.")
            ctx1.no_more_inputs()

            ctx2.push("Context two has a different message. We want to verify that the routing logic correctly separates the audio streams. Pack my box with five dozen liquor jugs.")
            ctx2.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # Receive from ctx1 — any ctx2 events read from the wire get queued
            filename1 = f"tts_concurrent_ctx1_{timestamp}.pcm"
            with open(filename1, "wb") as f:
                for response in ctx1.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            # Receive from ctx2 — picks up queued events first
            filename2 = f"tts_concurrent_ctx2_{timestamp}.pcm"
            with open(filename2, "wb") as f:
                for response in ctx2.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved context 1 audio to {filename1}")
            print(f"Saved context 2 audio to {filename2}")
            print(f"Play with:")
            print(f"  ffplay -f f32le -ar 44100 {filename1}")
            print(f"  ffplay -f f32le -ar 44100 {filename2}")
    ```

    From [cartesia-python/examples/examples.py:375](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L375)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_concurrent_contexts_async(client: AsyncCartesia) -> None:
        """Two contexts on one connection, each using ctx.receive() to get their own audio."""
        from cartesia.resources.tts import AsyncWebSocketContext

        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx1 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )
            ctx2 = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
            )

            # Send to both contexts
            await ctx1.push("Context one is speaking now. This is a longer transcript to ensure that audio chunks from both contexts are interleaved on the wire. The quick brown fox jumps over the lazy dog.")
            await ctx1.no_more_inputs()

            await ctx2.push("Context two has a different message. We want to verify that the routing logic correctly separates the audio streams. Pack my box with five dozen liquor jugs.")
            await ctx2.no_more_inputs()

            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # Receive concurrently via tasks, writing to files
            async def collect(ctx: AsyncWebSocketContext, filename: str) -> None:
                with open(filename, "wb") as f:
                    async for response in ctx.receive():
                        if response.type == "chunk" and response.audio:
                            f.write(response.audio)

            filename1 = f"tts_concurrent_async_ctx1_{timestamp}.pcm"
            filename2 = f"tts_concurrent_async_ctx2_{timestamp}.pcm"

            await asyncio.gather(
                collect(ctx1, filename1),
                collect(ctx2, filename2),
            )

            print(f"Saved context 1 audio to {filename1}")
            print(f"Saved context 2 audio to {filename2}")
            print(f"Play with:")
            print(f"  ffplay -f f32le -ar 44100 {filename1}")
            print(f"  ffplay -f f32le -ar 44100 {filename2}")
    ```

    From [cartesia-python/examples/async\_examples.py:288](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L288)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketConcurrentContexts(client: Cartesia): Promise<void> {
      /** Two contexts on one connection, each using ctx.receive() to get their own audio. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx1 = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      const ctx2 = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      // Send to both contexts before receiving.
      await ctx1.push({
        transcript:
          'Context one is speaking now. This is a longer transcript to ensure that ' +
          'audio chunks from both contexts are interleaved on the wire. ' +
          'The quick brown fox jumps over the lazy dog.',
      });
      await ctx1.no_more_inputs();

      await ctx2.push({
        transcript:
          'Context two has a different message. We want to verify that the routing ' +
          'logic correctly separates the audio streams. ' +
          'Pack my box with five dozen liquor jugs.',
      });
      await ctx2.no_more_inputs();

      const ts = timestamp();

      async function collect(ctx: { receive: typeof ctx1.receive }, filename: string): Promise<void> {
        const file = fs.createWriteStream(filename);
        for await (const event of ctx.receive()) {
          if (event.type === 'chunk' && event.audio) {
            file.write(event.audio);
          }
        }
        file.end();
      }

      const filename1 = `tts_concurrent_ctx1_${ts}.pcm`;
      const filename2 = `tts_concurrent_ctx2_${ts}.pcm`;

      await Promise.all([collect(ctx1, filename1), collect(ctx2, filename2)]);

      ws.close();
      console.log(`Saved context 1 audio to ${filename1}`);
      console.log(`Saved context 2 audio to ${filename2}`);
      console.log('Play with:');
      console.log(`  ffplay -f f32le -ar 44100 ${filename1}`);
      console.log(`  ffplay -f f32le -ar 44100 ${filename2}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:239](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L239)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_concurrent_contexts
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_concurrent_contexts_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketConcurrentContexts
    ```
  </Tab>
</Tabs>


# WebSocket Continuations
Source: https://docs.cartesia.ai/examples/tts-websocket-continuations

Streaming a transcript split into multiple parts, using continuations.

<Tabs>
  <Tab title="Python">
    Useful for streaming transcripts generated by an LLM.

    ```python theme={null}
    def tts_websocket_continuations(client: Cartesia) -> None:
        """Streaming a transcript split into multiple parts, using continuations.
        Useful for streaming transcripts generated by an LLM."""
        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format={
                    "container": "raw",
                    "encoding": "pcm_f32le",
                    "sample_rate": 44100,
                },
            )

            for part in ["The road ", "goes ever ", "on and ", "on."]:
                ctx.push(part)

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_websocket_continuations_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with:\n  $ ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:222](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L222)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_continuations_async(client: AsyncCartesia) -> None:
        """Streaming a transcript split into multiple parts, using continuations."""
        transcripts = ["The only thing we have to fear ", "is ", "fear itself."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context()

            for transcript in transcripts:
                await ctx.send(
                    model_id="sonic-3",
                    transcript=transcript,
                    voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                    output_format=output_format,
                    continue_=True,
                )

            await ctx.no_more_inputs()

            filename = f"tts_ws_continuations_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:131](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L131)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketContinuations(client: Cartesia): Promise<void> {
      /** Streaming a transcript split into multiple parts, using continuations. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      for (const part of ['The road ', 'goes ever ', 'on and ', 'on.']) {
        await ctx.push({ transcript: part });
      }
      await ctx.no_more_inputs();

      const filename = `tts_websocket_continuations_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with:\n  $ ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:73](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L73)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_continuations
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_continuations_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketContinuations
    ```
  </Tab>
</Tabs>


# WebSocket Emotion Control
Source: https://docs.cartesia.ai/examples/tts-websocket-emotion

Demonstrates changing emotion mid-stream using generation_config.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_emotion(client: Cartesia) -> None:
        """Demonstrates changing emotion mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending neutral text...")
            ctx.push("Well maybe if you just ")

            print("Sending angry text...")
            ctx.push("loosen up a little!", generation_config={"emotion": "angry"})

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_emotion_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:313](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L313)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_emotion_async(client: AsyncCartesia) -> None:
        """Demonstrates changing emotion mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending neutral text...")
            await ctx.push("Well maybe if you just ")

            print("Sending angry text...")
            await ctx.push("loosen up a little!", generation_config={"emotion": "angry"})

            await ctx.no_more_inputs()

            import datetime
            filename = f"tts_emotion_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:228](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L228)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketEmotion(client: Cartesia): Promise<void> {
      /** Demonstrates changing emotion mid-stream using generation_config. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      console.log('Sending neutral text...');
      await ctx.push({ transcript: 'Well maybe if you just ' });

      console.log('Sending angry text...');
      await ctx.send({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        transcript: 'loosen up a little!',
        continue: true,
        generation_config: { emotion: 'angry' },
      });

      await ctx.no_more_inputs();

      const filename = `tts_emotion_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:157](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L157)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_emotion
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_emotion_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketEmotion
    ```
  </Tab>
</Tabs>


# WebSocket Flushing
Source: https://docs.cartesia.ai/examples/tts-websocket-flushing

Demonstrates manual flushing to separate audio from different transcripts.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_flushing(client: Cartesia) -> None:
        """Demonstrates manual flushing to separate audio from different transcripts."""
        transcripts = ["Stay hungry, ", "stay foolish."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )  # Auto-generates context_id

            # 1. Send first transcript
            print("Sending first transcript...")
            ctx.push(transcripts[0])

            # 2. Flush! This forces all buffered audio for the first transcript to be generated
            # and increments the flush_id counter on the server.
            print("Flushing...")
            ctx.flush()

            # 3. Send second transcript
            print("Sending second transcript...")
            ctx.push(transcripts[1])

            ctx.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

            # We'll save audio to separate files based on flush_id
            files: dict[int, IO[bytes]] = {}

            for response in ctx.receive():
                # Log every response, but redact audio data to avoid swamping the console.
                loggable = {k: ("[...]" if k == "data" else v) for k, v in response.model_dump().items()}
                print(f"Event: {loggable}")

                if response.type == "chunk" and response.audio:
                    # Get flush_id from response (defaults to 0 if not present)
                    flush_id = response.flush_id or 0

                    if flush_id not in files:
                        filename = f"tts_flush_{flush_id}_{timestamp}.pcm"
                        files[flush_id] = open(filename, "wb")

                    files[flush_id].write(response.audio)

            # Close all open files
            for f in files.values():
                f.close()

            print("\nFinished.")
            print("You can play the generated audio files with these commands:")
            for flush_id, f in files.items():
                print(f"  Flush ID {flush_id}: ffplay -f f32le -ar 44100 {f.name}")
    ```

    From [cartesia-python/examples/examples.py:255](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L255)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_flushing_async(client: AsyncCartesia) -> None:
        """Demonstrates manual flushing to separate audio from different transcripts."""
        transcripts = ["First transcript.", "Second transcript."]
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context()

            # 1. Send first transcript
            print("Sending first transcript...")
            await ctx.send(
                model_id="sonic-3",
                transcript=transcripts[0],
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
            )

            # 2. Flush!
            print("Flushing...")
            await ctx.send(
                model_id="sonic-3",
                transcript="",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
                flush=True,
            )

            # 3. Send second transcript
            print("Sending second transcript...")
            await ctx.send(
                model_id="sonic-3",
                transcript=transcripts[1],
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format,
                continue_=True,
            )

            await ctx.no_more_inputs()

            import datetime
            timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
            
            files: dict[int, IO[bytes]] = {}

            async for response in ctx.receive():
                if response.type == "chunk" and response.audio:
                    flush_id = response.flush_id or 0

                    if flush_id not in files:
                        filename = f"tts_flush_async_{flush_id}_{timestamp}.pcm"
                        files[flush_id] = open(filename, "wb")
                        print(f"Created new file for flush_id {flush_id}: {filename}")

                    files[flush_id].write(response.audio)

                elif response.type == "flush_done":
                    print(f"Flush done received for flush_id: {response.flush_id}")

            for f in files.values():
                f.close()

            print("\nFinished.")
            print("You can play the generated audio files with these commands:")
            for flush_id, f in files.items():
                print(f"  Flush ID {flush_id}: ffplay -f f32le -ar 44100 {f.name}")
    ```

    From [cartesia-python/examples/async\_examples.py:160](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L160)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketFlushing(client: Cartesia): Promise<void> {
      /** Demonstrates manual flushing to separate audio from different transcripts. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      // 1. Send first transcript
      console.log('Sending first transcript...');
      await ctx.push({ transcript: 'Stay hungry, ' });

      // 2. Flush — forces all buffered audio for the first transcript to be generated.
      console.log('Flushing...');
      await ctx.flush();

      // 3. Send second transcript
      console.log('Sending second transcript...');
      await ctx.push({ transcript: 'stay foolish.' });

      await ctx.no_more_inputs();

      const ts = timestamp();
      const files: Map<number, fs.WriteStream> = new Map();

      for await (const event of ctx.receive()) {
        // Log every response, but redact audio data to avoid swamping the console.
        const loggable = { ...(event as any) };
        if (loggable.data) loggable.data = '[...]';
        console.log('Event:', JSON.stringify(loggable));

        if (event.type === 'chunk' && event.audio) {
          const flushId = (event as any).flush_id ?? 0;
          if (!files.has(flushId)) {
            const name = `tts_flush_${flushId}_${ts}.pcm`;
            files.set(flushId, fs.createWriteStream(name));
          }
          files.get(flushId)!.write(event.audio);
        }
      }

      for (const f of files.values()) f.end();
      ws.close();

      console.log('\nFinished. Play the generated audio files with:');
      for (const [flushId, f] of files) {
        console.log(`  Flush ID ${flushId}: ffplay -f f32le -ar 44100 ${(f as any).path}`);
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:104](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L104)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_flushing
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_flushing_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketFlushing
    ```
  </Tab>
</Tabs>


# WebSocket Low-Latency Playback
Source: https://docs.cartesia.ai/examples/tts-websocket-low-latency

Play audio chunks as they arrive for lowest latency.

```typescript theme={null}
async function ttsWebsocketLowLatency(client: Cartesia): Promise<void> {
  /** Play audio chunks as they arrive for lowest latency. */
  const sampleRate = 44100;
  const audioCtx = new AudioContext({ sampleRate });
  let nextStartTime = audioCtx.currentTime;

  const ws = await client.tts.websocket();

  for await (const event of ws.generate({
    model_id: 'sonic-3',
    transcript: 'Low latency streaming. Each chunk plays as soon as it arrives.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: sampleRate },
  })) {
    if (event.type === 'chunk' && event.audio) {
      const floats = new Float32Array(
        event.audio.buffer,
        event.audio.byteOffset,
        event.audio.byteLength / 4,
      );

      const audioBuffer = audioCtx.createBuffer(1, floats.length, sampleRate);
      audioBuffer.getChannelData(0).set(floats);

      const source = audioCtx.createBufferSource();
      source.buffer = audioBuffer;
      source.connect(audioCtx.destination);

      // Schedule this chunk right after the previous one
      const startTime = Math.max(nextStartTime, audioCtx.currentTime);
      source.start(startTime);
      nextStartTime = startTime + audioBuffer.duration;
    }
  }

  ws.close();
}
```

From [cartesia-js/examples/browser\_examples.ts:127](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L127)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# WebSocket Response Handling
Source: https://docs.cartesia.ai/examples/tts-websocket-response-handling

WebSocket response type handling.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_response_handling(client: Cartesia) -> None:
        """WebSocket response type handling."""
        with client.tts.websocket_connect() as connection:
            connection.send({
                "model_id": "sonic-3",
                "transcript": "Hello, world!",
                "voice": {"mode": "id", "id": "voice-id"},
                "output_format": {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100},
            })

            import datetime
            filename = f"tts_websocket_response_handling_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            # Write chunks to file as they arrive.
            # You could also send chunks over the network, play them in real-time, etc.
            with open(filename, "wb") as f:
                for response in connection:
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)
                    elif response.type == "timestamps":
                        process_timestamps(response.word_timestamps)
                    elif response.type == "done" or response.done:
                        break
                    elif response.type == "error":
                        raise Exception(response.error)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:427](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L427)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketResponseHandling(client: Cartesia): Promise<void> {
      /** WebSocket response type handling. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const filename = `tts_websocket_response_handling_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ws.generate({
        model_id: 'sonic-3',
        transcript: 'Hello, world!',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        add_timestamps: true,
      })) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        } else if (event.type === 'timestamps') {
          const wt = (event as any).word_timestamps;
          if (wt) {
            console.log(`Words: ${wt.words}, Starts: ${wt.start}, Ends: ${wt.end}`);
          }
        } else if (event.type === 'error') {
          throw new Error(JSON.stringify(event));
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:298](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L298)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_response_handling
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketResponseHandling
    ```
  </Tab>
</Tabs>


# WebSocket Speed Control
Source: https://docs.cartesia.ai/examples/tts-websocket-speed

Demonstrates changing speed mid-stream using generation_config.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def tts_websocket_speed(client: Cartesia) -> None:
        """Demonstrates changing speed mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending normal speed text...")
            ctx.push("I am speaking at a normal pace. ")

            print("Sending fast speed text...")
            ctx.push("But now I am speaking much faster!", generation_config={"speed": 1.5})

            ctx.no_more_inputs()

            import datetime
            filename = f"tts_speed_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/examples.py:344](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L344)
  </Tab>

  <Tab title="Python (Async)">
    ```python theme={null}
    async def tts_websocket_speed_async(client: AsyncCartesia) -> None:
        """Demonstrates changing speed mid-stream using generation_config."""
        output_format = {"container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100}

        async with client.tts.websocket_connect() as connection:
            ctx = connection.context(
                model_id="sonic-3",
                voice={"mode": "id", "id": "6ccbfb76-1fc6-48f7-b71d-91ac6298247b"},
                output_format=output_format
            )

            print("Sending normal speed text...")
            await ctx.push("I am speaking at a normal pace. ")

            print("Sending fast speed text...")
            await ctx.push("But now I am speaking much faster!", generation_config={"speed": 1.5})

            await ctx.no_more_inputs()

            import datetime
            filename = f"tts_speed_async_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.pcm"

            with open(filename, "wb") as f:
                async for response in ctx.receive():
                    if response.type == "chunk" and response.audio:
                        f.write(response.audio)

            print(f"Saved audio to {filename}")
            print(f"Play with: ffplay -f f32le -ar 44100 {filename}")
    ```

    From [cartesia-python/examples/async\_examples.py:258](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/async_examples.py#L258)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function ttsWebsocketSpeed(client: Cartesia): Promise<void> {
      /** Demonstrates changing speed mid-stream using generation_config. */
      const ws = await client.tts.websocket();
      ws.on('error', (err) => console.error('WS error:', err.message));

      const ctx = ws.context({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
      });

      console.log('Sending normal speed text...');
      await ctx.push({ transcript: 'I am speaking at a normal pace. ' });

      console.log('Sending fast speed text...');
      await ctx.send({
        model_id: 'sonic-3',
        voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
        output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: 44100 },
        transcript: 'But now I am speaking much faster!',
        continue: true,
        generation_config: { speed: 1.5 },
      });

      await ctx.no_more_inputs();

      const filename = `tts_speed_${timestamp()}.pcm`;
      const file = fs.createWriteStream(filename);

      for await (const event of ctx.receive()) {
        if (event.type === 'chunk') {
          if (event.audio) file.write(event.audio);
        }
      }

      file.end();
      ws.close();
      console.log(`Saved audio to ${filename}`);
      console.log(`Play with: ffplay -f f32le -ar 44100 ${filename}`);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:198](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L198)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py tts_websocket_speed
    ```
  </Tab>

  <Tab title="Python (Async)">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/async_examples.py tts_websocket_speed_async
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts ttsWebsocketSpeed
    ```
  </Tab>
</Tabs>


# WebSocket Stream to Web Audio
Source: https://docs.cartesia.ai/examples/tts-websocket-stream-audio

Stream audio from a WebSocket and play it in real-time with Web Audio API.

```typescript theme={null}
async function ttsWebsocketStreamAudio(client: Cartesia): Promise<void> {
  /** Stream audio from a WebSocket and play it in real-time with Web Audio API. */
  const sampleRate = 44100;
  const audioCtx = new AudioContext({ sampleRate });

  const ws = await client.tts.websocket();

  const chunks: Float32Array[] = [];

  for await (const event of ws.generate({
    model_id: 'sonic-3',
    transcript: 'This is being streamed in real time from a WebSocket connection.',
    voice: { mode: 'id', id: '6ccbfb76-1fc6-48f7-b71d-91ac6298247b' },
    output_format: { container: 'raw', encoding: 'pcm_f32le', sample_rate: sampleRate },
  })) {
    if (event.type === 'chunk' && event.audio) {
      // event.audio is a raw buffer of f32le samples
      const floats = new Float32Array(
        event.audio.buffer,
        event.audio.byteOffset,
        event.audio.byteLength / 4,
      );
      chunks.push(floats);
    }
  }

  ws.close();

  // Combine all chunks into a single AudioBuffer and play
  const totalSamples = chunks.reduce((sum, c) => sum + c.length, 0);
  const audioBuffer = audioCtx.createBuffer(1, totalSamples, sampleRate);
  const channelData = audioBuffer.getChannelData(0);

  let offset = 0;
  for (const chunk of chunks) {
    channelData.set(chunk, offset);
    offset += chunk.length;
  }

  const source = audioCtx.createBufferSource();
  source.buffer = audioBuffer;
  source.connect(audioCtx.destination);
  source.start();
}
```

From [cartesia-js/examples/browser\_examples.ts:78](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L78)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Clone a Voice
Source: https://docs.cartesia.ai/examples/voices-clone

Clone a voice from an audio clip.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_clone(client: Cartesia) -> Any:
        """Clone a voice from an audio clip."""
        with open("sample.wav", "rb") as clip:
            voice = client.voices.clone(
                clip=clip,
                name="My Voice",
                description="A custom voice",
                language="en",
            )
        return voice
    ```

    From [cartesia-python/examples/examples.py:474](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L474)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesClone(client: Cartesia): Promise<void> {
      /** Clone a voice from an audio clip. */
      const clip = fs.createReadStream('sample.wav');
      const voice = await client.voices.clone({
        clip,
        name: 'My Voice',
        description: 'A custom voice',
        language: 'en',
      });
      console.log('Cloned voice:', voice.id);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:348](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L348)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_clone
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesClone
    ```
  </Tab>
</Tabs>


# Delete a Voice
Source: https://docs.cartesia.ai/examples/voices-delete

Delete a voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_delete(client: Cartesia, voice_id: str) -> None:
        """Delete a voice."""
        client.voices.delete(voice_id)
    ```

    From [cartesia-python/examples/examples.py:495](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L495)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesDelete(client: Cartesia): Promise<void> {
      /** Delete a voice. */
      await client.voices.delete('voice-id');
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:368](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L368)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_delete
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesDelete
    ```
  </Tab>
</Tabs>


# Get a Voice
Source: https://docs.cartesia.ai/examples/voices-get

Get a specific voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_get(client: Cartesia) -> Any:
        """Get a specific voice."""
        voice = client.voices.get("voice-id")
        return voice
    ```

    From [cartesia-python/examples/examples.py:468](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L468)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesGet(client: Cartesia): Promise<void> {
      /** Get a specific voice. */
      const voice = await client.voices.get('6ccbfb76-1fc6-48f7-b71d-91ac6298247b');
      console.log(voice.name);
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:342](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L342)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_get
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesGet
    ```
  </Tab>
</Tabs>


# List Voices
Source: https://docs.cartesia.ai/examples/voices-list

List voices with pagination.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_list(client: Cartesia) -> None:
        """List voices with pagination."""
        voices = client.voices.list(limit=10)
        for voice in voices:
            print(voice.name)
    ```

    From [cartesia-python/examples/examples.py:461](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L461)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesList(client: Cartesia): Promise<void> {
      /** List voices with pagination. */
      for await (const voice of client.voices.list({ limit: 10 })) {
        console.log(voice.name);
      }
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:335](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L335)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_list
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesList
    ```
  </Tab>
</Tabs>


# List Voices to DOM
Source: https://docs.cartesia.ai/examples/voices-list-to-dom

Fetch voices and display them in a <ul> element.

```typescript theme={null}
async function voicesListToDOM(client: Cartesia): Promise<void> {
  /** Fetch voices and display them in a <ul> element. */
  const ul = document.createElement('ul');

  for await (const voice of client.voices.list({ limit: 20 })) {
    const li = document.createElement('li');
    li.textContent = `${voice.name} (${voice.language})`;
    ul.appendChild(li);
  }

  document.body.appendChild(ul);
}
```

From [cartesia-js/examples/browser\_examples.ts:169](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/browser_examples.ts#L169)

## Run this example

This example runs in the browser. See the [Next.js example](/examples/nextjs) for a working setup.


# Update a Voice
Source: https://docs.cartesia.ai/examples/voices-update

Update a voice.

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    def voices_update(client: Cartesia, voice_id: str) -> None:
        """Update a voice."""
        client.voices.update(
            voice_id,
            name="Updated Name",
            description="Updated description",
        )
    ```

    From [cartesia-python/examples/examples.py:486](https://github.com/cartesia-ai/cartesia-python/blob/main/examples/examples.py#L486)
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={null}
    async function voicesUpdate(client: Cartesia): Promise<void> {
      /** Update a voice. */
      await client.voices.update('voice-id', {
        name: 'Updated Name',
        description: 'Updated description',
      });
    }
    ```

    From [cartesia-js/examples/node\_examples.ts:360](https://github.com/cartesia-ai/cartesia-js/blob/main/examples/node_examples.ts#L360)
  </Tab>
</Tabs>

## Run this example

<Tabs>
  <Tab title="Python">
    ```sh theme={null}
    cd cartesia-python
    CARTESIA_API_KEY=YOUR_KEY python3 examples/examples.py voices_update
    ```
  </Tab>

  <Tab title="TypeScript">
    ```sh theme={null}
    cd cartesia-js
    CARTESIA_API_KEY=YOUR_KEY npx ts-node examples/node_examples.ts voicesUpdate
    ```
  </Tab>
</Tabs>


# Air-Gapped Deployments
Source: https://docs.cartesia.ai/self-hosted/air-gapped

Deploy Cartesia without internet connectivity to licensing servers

For deployments without internet connectivity to Cartesia's licensing servers, you can run in air-gapped mode. This mode uses an offline license file instead of real-time authentication.

<Note>Download your offline license file from the [on-prem portal](https://play.cartesia.ai/on-prem). See [Provisioned Resources](/self-hosted/provisioned-resources) for details.</Note>

## Configuration

<Tabs>
  <Tab title="Terraform">
    ```hcl theme={null}
    # In your .tfvars file
    authenticate               = false
    license_proxy_persistence  = true   # Required for air-gapped mode
    ```
  </Tab>

  <Tab title="Helm">
    ```yaml theme={null}
    infra:
      authenticate: false
    licenseProxy:
      persistence:
        enabled: true
        storageClass: gp2  # Use appropriate storage class for your cluster
    ```
  </Tab>
</Tabs>

## Loading a License

In air-gapped mode, the `/license` endpoint is exposed for license management.

### Via Port-Forward

```bash theme={null}
kubectl port-forward svc/cartesia-license-proxy 8080:8080 -n cartesia
```

In another terminal:

```bash theme={null}
curl -X POST http://localhost:8080/license -d '<license-json>'
```

### Via Ingress

If ingress is enabled:

```bash theme={null}
curl -X POST https://<your-domain>/license -d '<license-json>'
```

## Retrieving Audit Logs

The `/audit` endpoint is available in air-gapped mode for retrieving usage audit logs:

```bash theme={null}
curl -X GET https://<your-domain>/audit --output audits.tar
```

These audit logs contain usage metadata for billing reconciliation. No transcript data is included, which you can validate by looking at the contents of the output.


# Architecture
Source: https://docs.cartesia.ai/self-hosted/architecture

Overview of the core components in a Cartesia self-hosted deployment.

Cartesia's self-hosted services support a configurable trade-off between latency and throughput for both TTS and STT deployments.

<Frame>
  <img alt="Self-hosted Architecture" />
</Frame>

## Core Components

### API Server

The API Server is the entrypoint for all requests for your self-hosted Cartesia Service. It handles incoming REST API requests and WebSocket connections.

### PubSub Controller (NATS)

We leverage an async communication protocol between the API server and the model containers to manage smooth low latency request handling. This design allows :

* Model containers to leave and join the cluster freely.
* Efficient stateful management of long running request lifecycles.
* Coordination between the API server and Model containers for the lowest latency pathways for a request.

### Model Workers (Engine)

Cartesia provides batched engine workers for both TTS and STT. The core parameter to customize here is the `batch_size (B)`. We'll discuss tradeoffs
for this and other parameters in the Performance Tuning sections.

### License Proxy Server

We deploy a single service which talks to our cloud environment for authenticating and ensuring license validity of the self-hosted deployment.  We
do this for several reasons, primarily: this becomes the only service making outbound calls, thus making it easier to configure network security
policies.

Proxy allows you to choose the level of isolation you want:

* `Connected`: The deployment validates licensing by pinging our cloud periodically and sends telemetry regarding usage.
* `Air-gapped`: Completely isolated offering, where you work with an offline license.  In air-gapped mode, we work with you directly to get usage
  information via audit-logs.

For most customers, we recommend deploying in `Connected` mode, however if you have need for completely isolated deployments,
our GTM team can work with you in setting things up.

For both `Connected` and `Air-gapped` mode, we have grace periods configured, so we don't immediately terminate the operations on getting disconnected or license expiring.


# Autoscaling
Source: https://docs.cartesia.ai/self-hosted/auto-scaling


## Pod Auto-Scaling (KEDA)

KEDA ScaledObjects use Prometheus-based metrics with two triggers:

| Trigger     | Metric                                            | Threshold | Condition               |
| ----------- | ------------------------------------------------- | --------- | ----------------------- |
| Worker Load | inferno\_worker\_load / inferno\_worker\_capacity | 0.8 (80%) | Always active           |
| Queue-based | api\_queue\_size / capacity (overflow mode)       | 1.0       | Only when minReplicas=0 |
| Queue-based | api\_unserviceable\_requests\_size                | 0.9       | Only when minReplicas=0 |

Scaling behavior:

* Polling interval: 15 seconds
* Scale-up stabilization: 30 seconds
* Scale-down stabilization: 900 seconds (15 min)
* Scale-down policy: Remove 1 pod per 60 seconds

## Cluster/Node Auto-Scaling

<Tabs>
  <Tab title="AWS EKS">
    Uses the Cluster Autoscaler:

    * Scan interval: 10 seconds
    * Scale-down delay: 10 minutes after node add
    * Scale-down unneeded time: 10 minutes
    * Expander: least-waste (bin-packing)
    * Metric: Pending pods that can't be scheduled due to insufficient resources
  </Tab>

  <Tab title="GCP GKE">
    Uses the Native Autoscaler:

    * Profile: BALANCED
    * Resource limits: CPU (1-128), Memory (1-512GB), nvidia-l4 GPUs (0-8)
    * Metric: Pending pods + resource utilization
  </Tab>
</Tabs>

## Metrics Used for Scaling

The autoscaling triggers above use [Prometheus metrics](/self-hosted/metrics) exposed by the application. See the [Metrics and Monitoring](/self-hosted/metrics) page for the full list of available metrics.


# Changelog
Source: https://docs.cartesia.ai/self-hosted/changelog

Release history for Cartesia self-hosted deployments

## sonic-20260310

<AccordionGroup>
  <Accordion title="Add voices API">
    New `POST /onprem/add-voices` endpoint to migrate voices from the Cartesia cloud to your self-hosted deployment. Supports up to 50 voices per request.

    See [Managing Artifacts](/self-hosted/managing-artifacts) for details.
  </Accordion>

  <Accordion title="Add pronunciation dictionaries API">
    New `POST /onprem/add-pdict` endpoint to migrate pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment. Supports up to 50 dictionaries per request.

    See [Managing Artifacts](/self-hosted/managing-artifacts) for details.
  </Accordion>

  <Accordion title="Hot reload">
    New artifacts (voices, migration files) are picked up automatically without requiring a rollout. Enabled by default.

    ```hcl theme={null}
    enable_hot_reload = false  # to disable
    ```

    <Warning>
      Hot reload does not support PVC voices. Migrations with `include_loras: true` require a restart of the worker pods.
    </Warning>
  </Accordion>
</AccordionGroup>


# Cloud Service Provisioning
Source: https://docs.cartesia.ai/self-hosted/cloud-service-provisioning

Deploy Cartesia using Amazon SageMaker Jumpstart

Amazon SageMaker Jumpstart provides the quickest path to deploying Cartesia's self-hosted solution with managed infrastructure, automatic scaling, and integrated monitoring. This deployment method is ideal for teams new to self-hosted AI or those wanting managed infrastructure.

To get started, visit the [Sonic 3 on AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-w2bmik3jypagm) to subscribe.

## Overview

SageMaker Jumpstart deployment offers:

* **Managed Infrastructure**: AWS handles server provisioning and maintenance
* **Automatic Scaling**: Built-in auto-scaling based on demand
* **Integrated Monitoring**: CloudWatch integration for metrics and logging
* **Pay-per-use**: Cost optimization through on-demand resource allocation
* **Quick Setup**: Deploy in minutes using pre-configured notebooks

## Prerequisites

### AWS Account Requirements

* AWS account with SageMaker access
* Sufficient service limits for GPU instances (ml.g6e.xlarge)
* IAM role with Sagemaker Full Access and Marketplace Subscription Access (ViewSubscriptions, Unsubscribe, Subscribe)
* VPC configuration (optional, for private deployment)

## Getting Started

To get started with deploying an inference endpoint for Sonic 3 on Sagemaker, please refer to [the steps in this notebook](https://github.com/cartesia-ai/cartesia-aws/blob/main/Sonic-3-Jumpstart.ipynb)

## Inference Setup

Sonic 3 supports only real time inference on Sagemaker. Please select `ml.g6e.xlarge` as your inference endpoint instance type. Each instance is capable of serving 8 concurrent requests. In order to get the best performance, Sagemaker suggests that you reuse the client-to-SageMaker connection, as it can save the time to re-establish the connection. In boto3, you can configure max\_pool\_connections . Multiple requests will reuse the connections, which avoids the cost of establishing new TCP/TLS connections for each request.

## Inputs and Outputs

### Input Summary

The response streaming endpoint takes in a JSON object as the input that specifies the transcript, voice, language, and output format for the generation

### Input Parameters

| Parameter                   | Description                                                                                                                                                                                                                                                                                                                           | Type      | Required |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | -------- |
| `context_id`                | A unique ID provided by the client to identify the request. It can be any string value and helps with tracking or debugging.                                                                                                                                                                                                          | `string`  | Yes      |
| `transcript`                | The text that will be converted into speech. You can include additional controls (e.g., emotion, speed, volume) as supported by Sonic 3 models.<br /><a href="https://docs.cartesia.ai/build-with-cartesia/sonic-3/volume-speed-emotion">Docs</a>                                                                                     | `string`  | Yes      |
| `language`                  | The language code of the transcript text.<br /><br />Supported codes:<br />`en`, `fr`, `de`, `es`, `pt`, `zh`, `ja`, `hi`, `it`, `ko`, `nl`, `pl`, `ru`, `sv`, `tr`, `tl`, `bg`, `ro`, `ar`, `cs`, `el`, `fi`, `hr`, `ms`, `sk`, `da`, `ta`, `uk`, `hu`, `no`, `vi`, `bn`, `th`, `he`, `ka`, `id`, `te`, `gu`, `kn`, `ml`, `mr`, `pa` | `string`  | Yes      |
| `output_format`             | Must match the `raw` option from the Cartesia TTS SSE API. Only `raw` is supported.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-output-format">Docs</a>                                                                                                                                                         | `string`  | Yes      |
| `voice`                     | Matches the `voice` field from the Cartesia TTS SSE API. Only **mode = `id`** is supported.<br /><br />Example:<br />`{ "mode": "id", "id": "voice_123" }`<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-voice">Docs</a>                                                                                          | `object`  | Yes      |
| `generation_config`         | Optional configuration object matching the API schema.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-generation-config">Docs</a>                                                                                                                                                                                  | `object`  | No       |
| `add_timestamps`            | Whether to include word-level timestamps in the output.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-add-timestamps">Docs</a>                                                                                                                                                                                    | `boolean` | No       |
| `add_phoneme_timestamps`    | Whether to include phoneme-level timestamps in the output.<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-add-phoneme-timestamps">Docs</a>                                                                                                                                                                         | `boolean` | No       |
| `use_normalized_timestamps` | Whether timestamps should be normalized (0–1 range).<br /><a href="https://docs.cartesia.ai/api-reference/tts/sse#body-use-normalized-timestamps">Docs</a>                                                                                                                                                                            | `boolean` | No       |

### Data Sample

```json theme={null}
{
    "context_id": "0",
    "transcript": "The detective burst through the door. 'We've got maybe five minutes before they realize we're here, so listen carefully and listen well: <speed ratio='1.5'/> the artifact is hidden beneath the old courthouse, exactly three feet below the cornerstone, and <volume ratio='0.5'/>whatever you do, DO NOT touch it with your bare hands!' She paused, catching her breath. 'Now... here's the important part... <speed ratio='0.6'/>you need to... very slowly... very carefully... wrap it in the copper wire first... then the silk cloth... then seal it in the lead box.' <volume ratio='2.0'/> Footsteps echoed in the hallway. 'GO GO GO! They're coming up the stairs RIGHT NOW!'",
    "language": "en",
    "output_format": {
        "container": "raw",
        "sample_rate": 44100,
        "encoding": "pcm"
    },
    "voice_id": {
        "mode": "id",
        "id": "bf0a246a-8642-498a-9950-80c35e9276b5"
    }
}
```

### Output Details

#### Output Events

Sagemaker sends back the response events in a [Response Stream](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_ResponseStream.html). The payload is sent to you as base 64 encoded blobs. Due to Sagemaker limitation, it may truncate one event into several segements. Or API always attach a linebreak to the end of each complete event, such that you can reconciliate them on client side. Each event we send back is a json object that contains the generated audio chunk and some metadatas. The event can be one of the following types, identified by `event.type`:

##### Chunk Event

A chunk event always contains at most 20 ms worth of audio chunk in the output format and sample rate you specified.

| Parameter       | Description                                                                                                            | Type     | Required |
| --------------- | ---------------------------------------------------------------------------------------------------------------------- | -------- | -------- |
| `type`          | The type of response event. For chunk events, this value is always `"chunk"`.                                          | `string` | Yes      |
| `context_id`    | Optional identifier for the response context. Useful for correlating responses with requests or sessions.              | `string` | No       |
| `status_code`   | The HTTP-like status code representing the success or error state of the chunk event.                                  | `int`    | Yes      |
| `done`          | Indicates whether this is the final chunk (`true`) or if more chunks are expected (`false`).                           | `bool`   | Yes      |
| `data`          | The base 64 encoded chunk of audio data. Each chunk represents a portion of the full audio output.                     | `string` | Yes      |
| `sampling_rate` | The sampling rate (in Hz) of the audio data in this chunk (e.g., `44100` or `8000`).                                   | `int`    | Yes      |
| `step_time`     | The time (in seconds) representing the generation step for this chunk, useful for synchronization or latency tracking. | `float`  | Yes      |

##### Done Event

A done event signals the completion of the generation. Done events are identified by `event.type == "done"` and `event.done == True`.

##### Timestamp Event

A **timestamp event** provides timing information for recognized words or tokens.

| Parameter         | Description                                                                        | Type                | Required |
| ----------------- | ---------------------------------------------------------------------------------- | ------------------- | -------- |
| `type`            | The response type. Always `"timestamps"`.                                          | `string`            | Yes      |
| `context_id`      | Optional identifier correlating this timestamp event with its request/session.     | `string`            | No       |
| `status_code`     | Status code indicating success or failure.                                         | `int`               | Yes      |
| `done`            | Indicates whether this is the final timestamp event.                               | `bool`              | Yes      |
| `word_timestamps` | A dictionary describing word-level timestamps (format may vary by implementation). | `dict<string, any>` | Yes      |

##### Phoneme Timestamp Event

A **phoneme timestamp event** provides timing data at the phoneme level, typically for detailed speech analysis.

| Parameter            | Description                                                            | Type                | Required |
| -------------------- | ---------------------------------------------------------------------- | ------------------- | -------- |
| `type`               | The response type. Always `"phoneme_timestamps"`.                      | `string`            | Yes      |
| `context_id`         | Optional identifier for correlating this event with a request/session. | `string`            | No       |
| `status_code`        | Processing status code.                                                | `int`               | Yes      |
| `done`               | Indicates whether this is the final phoneme timestamp event.           | `bool`              | Yes      |
| `phoneme_timestamps` | A dictionary containing phoneme-level timing information.              | `dict<string, any>` | Yes      |

## Error Handling

If an error occurs during the generation type, Sagemaker will send back the error as a [Model Error](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html#API_runtime_InvokeEndpoint_ResponseElements:~:text=Status%20Code%3A%20500-,ModelError,-Model%20\(owned%20by\)). To handle the error, you may inspect the `OriginalStatusCode` field of the error object (See examples for error handling in python).

### 422 Errors

A 422 error indicates that your input is not of the correct format. You may see more details in the `Message` field.

### 429 Errors

A 429 error indicates that the model container you are hitting does not have capacity to serve requests at the point. Our models serve at most 4 concurrent generation requests at a time. If you are running multiple inference container replicas, we suggest that you use load-aware routing in sagemaker by configuring the parameters `RoutingConfig` inside the `ProductionVariants` configuration, Set it to `LEAST_OUTSTANDING_REQUESTS` for optimal load distribution.

## Container Logs

You should be able to see container logs in cloudwatch. Most logs should be emitted with a request id. The server side request id is of the format `{uuid}-{client supplied context id}`.


# Docker
Source: https://docs.cartesia.ai/self-hosted/docker-compose

Deploy Cartesia on bare-metal or VM nodes using Docker Compose or Docker Swarm

<Note>Docker Compose and Docker Swarm deployment are currently in **beta**. Connect with the Cartesia team for support.</Note>

Deploy Cartesia TTS on a **single machine** with Docker Compose, or across a **multi-node cluster** with Docker Swarm.

|                 | Docker Compose                                       | Docker Swarm                              |
| --------------- | ---------------------------------------------------- | ----------------------------------------- |
| **Nodes**       | Single host                                          | Multiple hosts (managers + workers)       |
| **GPU scaling** | Multiple workers via `WORKER_REPLICAS` (one per GPU) | Workers scheduled on labeled GPU nodes    |
| **MIG support** | Auto-detected via `--mig` flag                       | Per-node via node labels and `--mig` flag |
| **Networking**  | Bridge (default)                                     | Overlay (Swarm-managed)                   |

## Prerequisites

* One or more machines with Docker installed (your user must be in the `docker` group)
* **Compose only:** Docker Compose V2 (`docker compose`)
* **Swarm only:** nodes meet Docker's [Swarm networking requirements](https://docs.docker.com/engine/swarm/networking/)
* At least one NVIDIA GPU with drivers installed. MIG (Multi-Instance GPU) partitioning is supported on compatible NVIDIA GPUs
* GPU nodes have the **nvidia Docker runtime set as default** (see below)
* The `cartesia-kube` repo downloaded as described in [Downloading cartesia-kube](/self-hosted/getting-started#downloading-kube)
* A Cartesia API key file (`container_key`) and a GCS service account JSON file, provided during onboarding

### GPU runtime check

On each GPU node, verify the NVIDIA runtime:

```bash theme={null}
nvidia-smi

docker info | grep "Default Runtime"
# Expected: Default Runtime: nvidia

docker run --rm nvidia/cuda:12.3.1-base-ubuntu22.04 nvidia-smi
```

If `nvidia` is not the default runtime, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) and run:

```bash theme={null}
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker
```

**If using MIG:** After enabling MIG and creating instances on the host, verify they are visible:

```bash theme={null}
nvidia-smi -L
# Each MIG instance appears as a MIG-... UUID line beneath its parent GPU.
# The deploy script reads these UUIDs automatically — no manual configuration required.
```

<Note>MIG must be enabled and instances created on the host before deploying. Recreating MIG instances generates new UUIDs; redeploy the stack if this happens.</Note>

***

## Step 1 — Prepare secrets

Place these files on the host (Compose) or **manager node** (Swarm):

* `container_key` — file containing your Cartesia API key
* `service-account.json` — GCS service account JSON with `roles/artifactregistry.reader` (image pull) and `roles/storage.objectViewer` (GCS sync)

Make the deploy script executable:

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    chmod +x local/scripts/deploy-compose.sh
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    chmod +x local/scripts/deploy-swarm.sh
    ```
  </Tab>
</Tabs>

***

## Step 2 — Initialize the cluster (Swarm only)

Skip this step if you are using Docker Compose.

On the **manager node**:

```bash theme={null}
docker swarm init --advertise-addr <MANAGER_IP>
```

Copy the `docker swarm join` command from the output. On **each additional node**, run:

```bash theme={null}
docker swarm join --token <TOKEN> <MANAGER_IP>:2377
```

Label each node from the manager. Use `docker node ls` to list node IDs:

```bash theme={null}
docker node update --label-add cpu=true <node-id>   # CPU services (API, NATS, etc.)
docker node update --label-add gpu=true <node-id>   # Standard GPU workers
```

**If using MIG:** Label MIG-enabled nodes with `mig=true` and a comma-separated list of their MIG instance UUIDs (obtained from `nvidia-smi -L` on that node). Do **not** apply `gpu=true` to MIG nodes.

```bash theme={null}
docker node update --label-add mig=true <node-id>
docker node update --label-add 'mig.uuids=MIG-<uuid1>,MIG-<uuid2>' <node-id>
```

Mixed clusters with both standard GPU nodes and MIG nodes are supported — the deploy script handles scheduling for both automatically.

***

## Step 3 — Configure environment

Set [environment variables](#configuration) before deploying. Use a `.env` file in `local/` (see `local/.env.example`) or export them in your shell.

```bash theme={null}
export IMAGE_REGISTRY="YOUR_IMAGE_REGISTRY"
export RELEASE_TAG="YOUR_RELEASE_TAG"
export MODEL_NAME="YOUR_MODEL_NAME"

export CONTAINER_KEY_FILE=/path/to/cartesia-api-key
export GCS_SA_FILE=/path/to/service-account.json

# Optional
export WORKER_REPLICAS=1
export WORKER_CAPACITY=4
export BUCKET_NAME=""
export CLUSTER_NAME="cartesia-compose"   # or "cartesia-swarm"
export USE_MIG=0                         # set to 1 to enable MIG mode (or pass --mig to the deploy script)
```

See [Configuration](#configuration) for full details on each variable.

***

## Step 4 — Deploy

<Tabs>
  <Tab title="Compose">
    From the repo root:

    ```bash theme={null}
    # Standard deployment
    ./local/scripts/deploy-compose.sh

    # With MIG support (auto-detects MIG instances via nvidia-smi)
    ./local/scripts/deploy-compose.sh --mig
    ```

    When `--mig` is used, the script auto-detects MIG instance UUIDs from `nvidia-smi`, generates a per-slice worker configuration, and scales the standard worker to zero.
  </Tab>

  <Tab title="Swarm">
    On the **manager node**:

    ```bash theme={null}
    # Standard deployment
    ./local/scripts/deploy-swarm.sh

    # With MIG support (reads UUIDs from node labels)
    ./local/scripts/deploy-swarm.sh --mig
    ```

    This will:

    1. Verify that nodes are labeled (fails with instructions if not).
    2. Create encrypted Swarm secrets from your key and service account files.
    3. Deploy all services. With `--mig`, one dedicated worker service is created per MIG instance, each pinned to its node.
  </Tab>
</Tabs>

<Warning>
  TTS workers take a few minutes to load the model into GPU memory. During this time, TTS requests will return errors even though containers appear healthy. Wait for the ready signal:

  <Tabs>
    <Tab title="Compose">
      ```bash theme={null}
      cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs -f tts-worker 2>&1 | grep -i "ready"
      ```
    </Tab>

    <Tab title="Swarm">
      ```bash theme={null}
      docker service logs cartesia_tts-worker -f 2>&1 | grep -i "ready"
      ```
    </Tab>
  </Tabs>
</Warning>

***

## Step 5 — Verify

Check that services are running:

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml ps
    ```

    If deployed with MIG, verify each worker sees exactly one MIG device:

    ```bash theme={null}
    # List all running services (MIG workers appear as tts-worker-mig-0, tts-worker-mig-1, etc.)
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml -f docker-compose.mig.generated.yaml ps
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    docker stack services cartesia
    ```

    If deployed with MIG, verify MIG worker services are scheduled and running:

    ```bash theme={null}
    docker stack ps cartesia --filter 'name=cartesia_tts-worker-mig'
    ```
  </Tab>
</Tabs>

Test the API:

```bash theme={null}
curl http://localhost:5000/status
```

Test TTS:

```bash theme={null}
curl -s -X POST "http://localhost:5000/tts/bytes" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Cartesia-Version: 2024-06-10" \
  -d '{
    "model_id": "sonic-3",
    "transcript": "Hello from Cartesia.",
    "voice": {"mode": "id", "id": "00510a15-4216-4fdc-a0ab-05d74cd9f795"},
    "language": "en",
    "output_format": {"container": "mp3", "sample_rate": 44100, "bit_rate": 128000}
  }' --output test.mp3
```

***

## Troubleshooting

<Tabs>
  <Tab title="Compose">
    ```bash theme={null}
    cd local

    docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs api
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml logs tts-worker

    # Restart everything
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml down
    docker compose -f docker-compose.base.yaml -f docker-compose.yaml up -d
    ```

    If the API exits with `no servers available for connection` (NATS not ready), restart the API after the stack is up:

    ```bash theme={null}
    cd local && docker compose -f docker-compose.base.yaml -f docker-compose.yaml up -d && docker compose -f docker-compose.base.yaml -f docker-compose.yaml restart api
    ```
  </Tab>

  <Tab title="Swarm">
    ```bash theme={null}
    docker stack ps cartesia --no-trunc

    docker service logs cartesia_api
    docker service logs cartesia_tts-worker

    # Restart the stack
    docker stack rm cartesia
    sleep 10
    cd local && docker stack deploy --with-registry-auth -c docker-compose.base.yaml -c docker-compose.swarm.yaml cartesia
    ```
  </Tab>
</Tabs>

***

## Configuration

Set these environment variables before running the deploy script. You receive `IMAGE_REGISTRY`, `RELEASE_TAG`, and `MODEL_NAME` from Cartesia during onboarding. If you mirror images into your own registry, use your mirror URL for `IMAGE_REGISTRY`.

### Required

| Variable             | Description                                                        |
| -------------------- | ------------------------------------------------------------------ |
| `IMAGE_REGISTRY`     | Container image registry URL (Cartesia registry or your mirror).   |
| `RELEASE_TAG`        | Image tag for the release you are deploying (updates per release). |
| `MODEL_NAME`         | TTS model identifier for the worker image.                         |
| `CONTAINER_KEY_FILE` | Path to file containing your Cartesia API key.                     |
| `GCS_SA_FILE`        | Path to GCS service account JSON file.                             |

### Optional

| Variable            | Default                               | Description                                                                                                                     |
| ------------------- | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| `WORKER_REPLICAS`   | `1`                                   | Number of TTS worker containers. For Compose, set to your GPU count on the host. For Swarm, scale to match your GPU node count. |
| `WORKER_CAPACITY`   | `4`                                   | Max concurrent TTS requests per worker. Lower if you run out of GPU memory.                                                     |
| `BUCKET_NAME`       | *(empty)*                             | GCS bucket for migrations/LoRAs. Leave empty to disable sync.                                                                   |
| `CLUSTER_NAME`      | `cartesia-compose` / `cartesia-swarm` | Identifier for logs and metrics.                                                                                                |
| `GCS_SYNC_INTERVAL` | `300`                                 | GCS sync interval in seconds.                                                                                                   |
| `USE_MIG`           | `0`                                   | Set to `1` to enable MIG mode.                                                                                                  |


# Getting Started
Source: https://docs.cartesia.ai/self-hosted/getting-started

Prerequisites and initial setup for Cartesia self-hosted deployments

# Prerequisites

Before deploying Cartesia's self-hosted solution, you'll need:

## Enterprise Contract

Cartesia's self-hosted products generally require an enterprise contract. Please reach out to [support@cartesia.ai](mailto:support@cartesia.ai) to request a conversation with our Go-to-Market team.

## Infrastructure

### Hardware Requirements

Cartesia models require GPUs running NVidia devices from the Ampere family or newer, with at least 24GB GPU Memory. We'll provide more specifics
depending on how you run your GPU clusters.  See [Hardware Selection](/self-hosted/hardware-selection) for more details.

### Deployment Options

You can deploy a self-hosted Cartesia cluster in one of 3 ways that we provide today:

* Via Helm Charts on a Managed Kubernetes Cluster with the right hardware.
* Via Docker Compose / Docker Swarm on bare-metal or VM nodes (beta).
* Via managed endpoints on Sagemaker Jumpstart.

Since all of our code executes in containers, you can go with a much more customized deployment option as well.

# Setup Stages

<Note>We highly recommend trying out our cloud offering first, since you can test your application and integrate it without all the work required for self-hosting.</Note>

<Steps>
  <Step title="Create Cartesia Account">
    Sign up at [play.cartesia.ai](https://play.cartesia.ai) and create an API key.
    Navigate to [play.cartesia.ai/keys](https://play.cartesia.ai/keys) and select your organization.
  </Step>

  <Step title="Request Enterprise Access">
    Contact [support@cartesia.ai](mailto:support@cartesia.ai) for getting enterprise access.

    If you're deploying on [AWS Sagemaker](/self-hosted/cloud-service-provisioning), you can request directly on the cloud platform itself.
  </Step>

  <Step title="Choose Deployment Method">
    Select your preferred deployment approach based on your infrastructure:

    * [**Managed Kubernetes**](/self-hosted/managed-kubernetes)
    * [**Docker**](/self-hosted/docker-compose) (beta)
    * [**Cloud Service Provisioning**](/self-hosted/cloud-service-provisioning)

    Depending on how you're deploying, you'll also decide on the hardware at this stage.
  </Step>

  <Step title="Deploy">
    Once approved, you'll receive access to:

    * Google Cloud Storage bucket containing cartesia-kube and related artifacts (Docker images, voices, LoRA weights)
    * Private Docker registry credentials
    * Helm chart repositories
    * Terraform configuration examples
    * Deployment documentation and support
    * An offline license (required if you are doing an [air-gapped deployment](/self-hosted/air-gapped))

    See [Provisioned Resources](/self-hosted/provisioned-resources) for a full breakdown of what's included and how to access each resource, including downloading cartesia-kube.

    Download cartesia-kube from the GCS bucket and follow the guide for your chosen deployment method to get up and running. The provided configurations work out of the box, but can be customized to fit your infrastructure needs.
  </Step>

  <Step title="Post Deployment">
    Post deployment, we provide some resources to validate and benchmark your deployment on your own hardware. See [Testing and Benchmarking](/self-hosted/testing-and-benchmarking).
    If you're looking to setup monitoring on the deployment, checkout [Metrics](/self-hosted/metrics)
  </Step>
</Steps>


# Hardware Selection
Source: https://docs.cartesia.ai/self-hosted/hardware-selection


Cartesia's models are portable enough to run on widely available GPU hardware.

In the table below we show the recommended concurrency for our TTS and STT model workers.

| GPU  | Sonic Concurrency | Ink Concurrency |
| ---- | ----------------- | --------------- |
| A10G | 4                 |                 |
| L40S | 4                 |                 |
| A100 | 4                 |                 |
| H100 | 8                 | 16              |

See [Metrics](/self-hosted/metrics) for more details on performance metrics.

When choosing hardware you need to consider the tradeoffs between latency (TTFA), and throughput.
See the table below for the metrics on the different set of GPUs we test on:

<Tabs>
  <Tab title="H100">
    | Concurrency | TTFA (ms) | RTF Avg | RTF P95 | Throughput (chars/s) |
    | ----------- | --------- | ------- | ------- | -------------------- |
    | 1           | 95        | 0.20    | 0.25    | 30                   |
    | 2           | 115       | 0.25    | 0.35    | 50                   |
    | 4           | 165       | 0.30    | 0.55    | 90                   |
    | 8           | 280       | 0.40    | 0.70    | 165                  |
  </Tab>

  <Tab title="L40s">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 90              | 0.20          | 0.20          | 50                   |
    | 2           | 120             | 0.25          | 0.25          | 90                   |
    | 4           | 180             | 0.30          | 0.45          | 145                  |
    | 8           | 185             | 0.30          | 0.55          | 180                  |
  </Tab>

  <Tab title="A100">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 130             | 0.30          | 0.30          | 45                   |
    | 2           | 180             | 0.30          | 0.35          | 70                   |
    | 4           | 280             | 0.40          | 0.40          | 120                  |
    | 8           | 260             | 0.40          | 0.60          | 135                  |
  </Tab>

  <Tab title="A10g">
    | Concurrency | Model TTFA (ms) | Model RTF Avg | Model RTF P95 | Throughput (chars/s) |
    | ----------- | --------------- | ------------- | ------------- | -------------------- |
    | 1           | 140             | 0.30          | 0.30          | 40                   |
    | 2           | 205             | 0.35          | 0.35          | 60                   |
    | 4           | 335             | 0.45          | 0.50          | 100                  |
    | 8           | 600             | 0.65          | 0.70          | 155                  |
  </Tab>
</Tabs>

With these you'll setup your per worker configurations.  For handling your application's scaling requirements, you'll need to configure autoscaling behavior.  See [autoscaling](/self-hosted/auto-scaling) for more details.


# Introduction
Source: https://docs.cartesia.ai/self-hosted/introduction


Cartesia's models can be self-hosted into customer provisioned cloud environments, such as GCP, AWS, or on-premise data centers.

## Why Self-Host

Cartesia's public API is globally available for the lowest latency, complete with GDPR, SOC 2 Type II, PCI Level 1,
and HIPAA compliance with enterprise contract options for Service Level Agreements (SLA) and Business Associate Agreement (BAA), and more.

However certain use cases may still warrant Self-Hosted Voice AI and Cartesia supports both private cloud and on-premise hosting options.
In those circumstances we recommend a self-hosted offering that is feature complete and as performant as the cloud offering.

### Colocation

With self-hosted deployments, you can choose to colocate your Voice AI models with other offerings
and establish your own SLAs around uptime and throughput. Colocated TTS would save a lot on network latencies depending on where
your datacenters are located.

### Isolation (Single Tenant)

Even though we provide a tenant level isolation in our cloud offering, nothing will beat the isolation you can achieve by self-hosting.

### Security

Self-hosted deployments allow you to maintain tight security posture without running Voice AI traffic over the internet to our public APIs. The self-hosted deployments will only contact
the Cartesia server to authenticate model access and report usage information. Usage information is limited to metadata such as character count and voice id, and does not contain any transcript information.
We also support [air-gapped deployments](/self-hosted/air-gapped) where there's no contact to our cloud, instead your deployment works with an offline license.

### Sovereignty

You can choose to host your Voice AI offering in any geographic region with GPU availability to meet jurisdictional requirements.

## Supported Products

| Product       | Support                   |
| ------------- | ------------------------- |
| Sonic 2       | Kubernetes                |
| Sonic 3       | Kubernetes, AWS SageMaker |
| Ink Whisper   | Kubernetes                |
| Voice Agents  | Not supported             |
| Voice Cloning | Not supported             |


# Managed Kubernetes
Source: https://docs.cartesia.ai/self-hosted/managed-kubernetes

Deploy Cartesia on AWS EKS and GCP GKE

Cartesia provides Terraform configurations that deploy both infrastructure and the application, or you can deploy the Helm chart directly to an existing cluster.

<Note>Complete configurations are provided at deployment time by your Cartesia representative.</Note>

## Terraform Deployment

Terraform creates the cluster, networking, GPU drivers, and deploys Cartesia via Helm.
This is the fastest way for you to get started with self-hosting Cartesia.

<Note>Download cartesia-kube from the GCS bucket as described in [Downloading cartesia-kube](/self-hosted/provisioned-resources#deployment-configurations).</Note>

```bash theme={null}
# Download and extract cartesia-kube from GCS (see Downloading cartesia-kube guide)
cd cartesia-kube

# Copy example config for your platform
cp aws-terraform.tfvars.example aws-terraform.tfvars  # or gcp-terraform.tfvars.example

# Deploy from the platform directory
cd infra/aws/cartesia-eks  # or infra/gcp/cartesia-gke
terraform init
terraform apply -var-file="../../../aws-terraform.tfvars" \
                -var "cartesia_api_key=$CARTESIA_API_KEY" \
                -var "service_account_json=$(cat /path/to/service-account.json)"
```

### Configuration

<Tabs>
  <Tab title="AWS EKS">
    ```hcl theme={null}
    region = "us-west-2"
    name = "cartesia-production"

    eks_admin_users = ["arn:aws:iam::123456789:user/admin"]

    node_groups = {
      default = {
        ami_type = "AL2023_x86_64_STANDARD"
        instance_types = ["m7a.4xlarge"]
        min_size = 1
        max_size = 3
        desired_size = 1
      }
      gpu = {
        ami_type = "AL2023_x86_64_NVIDIA"
        instance_types = ["g5.2xlarge", "g5.4xlarge"]
        min_size = 1
        max_size = 5
        desired_size = 2
        disk_size = 100
        labels = { "nvidia.com/gpu.present" = "true" }
      }
    }

    # Ingress (optional)
    enable_ingress = true
    ingress_route = "api.cartesia.yourdomain.com"
    certificate_arn = "arn:aws:acm:us-west-2:123456789:certificate/abc123"

    # Hot reload (enabled by default)
    enable_hot_reload = true
    ```
  </Tab>

  <Tab title="GCP GKE">
    ```hcl theme={null}
    project_id = "your-gcp-project"
    region = "us-central1"
    zone = "us-central1-a"
    name = "cartesia-production"

    gke_admin_users = ["user@yourdomain.com"]

    node_pools = {
      default = {
        machine_type = "e2-standard-8"
        min_count = 1
        max_count = 3
        initial_node_count = 1
      }
      gpu = {
        machine_type = "g2-standard-8"
        accelerator_type = "nvidia-l4"
        accelerator_count = 1
        min_count = 1
        max_count = 5
        initial_node_count = 2
        disk_size_gb = 100
      }
    }

    # Ingress (optional)
    enable_ingress = true
    ingress_route = "api.cartesia.yourdomain.com"

    # Hot reload (enabled by default)
    enable_hot_reload = true
    ```
  </Tab>
</Tabs>

See [Managing Artifacts](/self-hosted/managing-artifacts) for details on hot reload and adding voices and pronunciation dictionaries to your deployment.

### Worker Configuration

Workers are defined in your tfvars file:

```hcl theme={null}
workers = [
  {
    name = "tts-worker"
    workerArgs = {
      model = "<model-name>"
      image = "cartesia-sonic-<model-name>"
      gpuType = "nvidia.com/gpu"
      capacity = 4
      operation = "TTS"
      useCB = true
      useLora = false
    }
    autoscaling = {
      enabled = true
      threshold = 0.6
      minReplicas = 1
      maxReplicas = 10
    }
  }
]
```

All the model workers have the images with prefix `cartesia-sonic-` followed by the specific model name. For instance, sonic-3 would use `cartesia-sonic-rosy-dragon`.

## Helm-Only Deployment

For existing Kubernetes clusters, deploy the Helm chart directly.

### 1. Install Prerequisites

If you want autoscaling and metrics, install KEDA and Prometheus first:

```bash theme={null}
# Prometheus
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace
```

### 2. Create Secrets

```bash theme={null}
kubectl create namespace cartesia

kubectl create secret docker-registry gar-pull-secret \
  --namespace cartesia \
  --docker-server=us-docker.pkg.dev \
  --docker-username=_json_key \
  --docker-password="$(cat /path/to/service-account.json)"
```

### 3. Configure values.yaml

```yaml theme={null}
clusterName: cartesia-production

infra:
  provider: gcp  # or aws
  authenticate: true
  imageRegistry: us-docker.pkg.dev/cartesia-external/self-serve
  imagePullSecret: gar-pull-secret
  gcsSecretName: gar-pull-secret
  serviceAccount: cartesia-image-sa

release:
  version: "1.0.0"
  releaseTag: "sonic-20251118"

filesystem:
  storageClass:
    name: standard-rwo

ingress:
  enabled: true
  routes:
    - api.cartesia.yourdomain.com
  globalStaticIpName: cartesia-ingress-ip  # GKE only

metrics:
  enabled: true

legacyComponents:
  enabled: false

workers:
  - name: tts-worker
    workerArgs:
      model: <model-name>
      image: cartesia-sonic-<model-name>
      gpuType: nvidia.com/gpu
      capacity: 4
      operation: TTS
      useCB: true
      useLora: false
    autoscaling:
      enabled: true
      threshold: "0.6"
      minReplicas: 1
      maxReplicas: 10
```

### 4. Deploy

```bash theme={null}
cd cartesia-kube/cartesia
helm upgrade --install cartesia . \
  --values values.yaml \
  --namespace cartesia
```

### Verify

```bash theme={null}
kubectl get pods -n cartesia
kubectl get ingress -n cartesia
```

## Autoscaling

Cartesia supports two levels of autoscaling for Kubernetes deployments.

### Cluster Autoscaler

Scales nodes based on pending pods. Enable in your tfvars:

```hcl theme={null}
enable_cluster_autoscaler = true
```

Node groups/pools will scale within their configured `min_size`/`max_size` bounds when pods can't be scheduled due to insufficient resources.

### Pod Autoscaler (KEDA)

Scales worker pods based on load metrics. Enable in your tfvars:

```hcl theme={null}
enable_pod_autoscaler = true
enable_metrics = true  # Required for KEDA
```

KEDA uses two scaling triggers:

* **Queue depth** - Scales when unserviceable requests accumulate
* **Worker load** - Scales when GPU utilization exceeds threshold

### Per-Worker Scaling

Each worker can have its own scaling configuration:

```hcl theme={null}
workers = [
  {
    name = "tts-worker"
    workerArgs = { ... }
    autoscaling = {
      enabled = true
      threshold = 0.6      # Scale up when load > 60%
      minReplicas = 1
      maxReplicas = 10
    }
  }
]
```

Or in Helm values.yaml:

```yaml theme={null}
workers:
  - name: tts-worker
    workerArgs: { ... }
    autoscaling:
      enabled: true
      threshold: "0.6"
      minReplicas: 1
      maxReplicas: 10
```

### Scaling Behavior

* **Scale up**: 30 second stabilization window
* **Scale down**: 900 second (15 min) stabilization window to avoid flapping
* Workers scale independently based on their individual load


# Managing Artifacts
Source: https://docs.cartesia.ai/self-hosted/managing-artifacts

Add voices and pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment

<Note>
  Hot reload and the on-prem migration APIs (`add-voices`, `add-pdict`) require release tag `sonic-20260310` or later.
</Note>

## Hot reload

New voice artifacts are picked up automatically by your self-hosted deployment without requiring an API server restart. Hot reload is enabled by default.

When a migration file lands in your GCS bucket, the API server detects and applies it automatically. No API server restarts or Helm upgrades are needed.

To disable hot reload, set `enable_hot_reload` to `false` in your tfvars — see [Managed Kubernetes](/self-hosted/managed-kubernetes) for full configuration.

```hcl theme={null}
enable_hot_reload = false
```

<Warning>
  Hot reload does not support PVC voices. If you migrate voices with `include_loras: true`, you must restart the worker pods for the LoRA checkpoints to take effect.
</Warning>

## Adding voices

Add voices from the Cartesia voice library to your self-hosted deployment using the `POST /onprem/add-voices` endpoint. You can migrate up to 50 voices per request. The migration runs asynchronously — voices typically become available on your self-hosted deployment within 4–5 minutes.

```bash theme={null}
curl -X POST "https://api.cartesia.ai/onprem/add-voices" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_ids": ["a0e99841-438c-4a64-b679-ae501e7d6091"],
    "model_id": "sonic-3",
    "include_loras": true
  }'
```

<Note>
  This endpoint is called against the **Cartesia cloud API** (`api.cartesia.ai`), not your self-hosted deployment. Your API key must belong to an organization with an active on-prem contract.
</Note>

### Request body

| Parameter       | Type       | Required | Description                                                                       |
| --------------- | ---------- | -------- | --------------------------------------------------------------------------------- |
| `voice_ids`     | `string[]` | Yes      | Voice IDs or aliases to add. Maximum 50 per request.                              |
| `model_id`      | `string`   | Yes      | The model the voices will be used with (e.g., `"sonic-3"`, `"sonic-english"`).    |
| `include_loras` | `boolean`  | No       | Set to `true` to include LoRA checkpoints for cloned voices. Defaults to `false`. |

### Headers

| Header             | Required | Description                                   |
| ------------------ | -------- | --------------------------------------------- |
| `X-API-Key`        | Yes      | Your Cartesia API key.                        |
| `Cartesia-Version` | No       | API version header. Defaults to `2025-04-16`. |

### Error responses

| Status | Condition                                                                    |
| ------ | ---------------------------------------------------------------------------- |
| `400`  | Missing or empty `voice_ids`, missing `model_id`, or more than 50 voice IDs. |
| `403`  | No on-prem access, or a requested voice is not accessible.                   |
| `422`  | Malformed request body.                                                      |
| `500`  | Internal server error.                                                       |

## Verifying a voice

After migration completes, verify a voice is available on your self-hosted deployment with `GET /voices/<id>`.

```bash theme={null}
curl -X GET "http://<your-host>:<port>/voices/<voice-id>" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```

## Adding pronunciation dictionaries

Add pronunciation dictionaries from the Cartesia cloud to your self-hosted deployment using the `POST /onprem/add-pdict` endpoint. You can migrate up to 50 dictionaries per request. The migration runs asynchronously — dictionaries typically become available on your self-hosted deployment within 4–5 minutes.

```bash theme={null}
curl -X POST "https://api.cartesia.ai/onprem/add-pdict" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "pronunciation_dict_ids": ["a0e99841-438c-4a64-b679-ae501e7d6091"]
  }'
```

<Note>
  This endpoint is called against the **Cartesia cloud API** (`api.cartesia.ai`), not your self-hosted deployment. Your API key must belong to an organization with an active on-prem contract, and must own each dictionary being migrated.
</Note>

### Request body

| Parameter                | Type       | Required | Description                                                                                      |
| ------------------------ | ---------- | -------- | ------------------------------------------------------------------------------------------------ |
| `pronunciation_dict_ids` | `string[]` | Yes      | Pronunciation dictionary IDs to add. Maximum 50 per request. Duplicates are removed server-side. |

### Headers

| Header             | Required | Description                                   |
| ------------------ | -------- | --------------------------------------------- |
| `X-API-Key`        | Yes      | Your Cartesia API key.                        |
| `Cartesia-Version` | No       | API version header. Defaults to `2025-04-16`. |

### Error responses

| Status | Condition                                                                |
| ------ | ------------------------------------------------------------------------ |
| `400`  | Missing or empty `pronunciation_dict_ids`, or more than 50 entries.      |
| `403`  | No on-prem access, or a requested dictionary is not owned by the caller. |
| `404`  | A requested dictionary ID does not exist.                                |
| `422`  | Malformed request body.                                                  |
| `500`  | Internal server error.                                                   |

## Verifying a pronunciation dictionary

After migration completes, verify a dictionary is available on your self-hosted deployment with `GET /pronunciation-dicts/<id>`.

```bash theme={null}
curl -X GET "http://<your-host>:<port>/pronunciation-dicts/<dict-id>" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```


# Metrics and Monitoring
Source: https://docs.cartesia.ai/self-hosted/metrics


Cartesia's inference cluster includes support for [Prometheus](https://prometheus.io/), an open source
metrics and monitoring solution.

All metrics are scraped every 5 seconds via PodMonitor on port 8080 `/metrics`.

## Prometheus Metrics

| Metric Name                       | Description                                                               | Normal Range                                          |
| --------------------------------- | ------------------------------------------------------------------------- | ----------------------------------------------------- |
| `inferno_worker_load`             | # of concurrent chunks the worker is processing now                       | \< Capacity                                           |
| `inferno_worker_capacity`         | # of concurrent chunks a worker can process                               | [hardware](/self-hosted/hardware-selection) dependent |
| `inferno_worker_ttfa`             | Time to First Audio                                                       | \< 200 ms                                             |
| `inferno_worker_rtf`              | [Real time factor](https://openvoice-tech.net/index.php/Real-time-factor) | \< 1                                                  |
| `api_queue_size`                  | Request queue size per offering                                           | Low                                                   |
| `api_unserviceable_requests_size` | Unserviceable requests count                                              | 0                                                     |


# Provisioned Resources
Source: https://docs.cartesia.ai/self-hosted/provisioned-resources

Reference for all resources provisioned as part of your self-hosted deployment

When your enterprise contract is finalized, Cartesia provisions the following resources for your account. All provisioned resources are available for download from the [on-prem portal](https://play.cartesia.ai/on-prem).

<Note>The on-prem portal is only accessible under the organization that has on-prem enabled. If you don't see it, switch to that organization in the account switcher.</Note>

## Service Account

A service account is created for your account, this service account has the following accesses:

* Access to a private artifact registry, which is used to host cartesia provided container images.
* Access to a common storage bucket: `gs://cartesia-onprem` containing the deployment configurations.
* Access to a private storage bucket: `gs://cartesia-{{name}}` used for hosting customer specific artifacts.

Download the JSON key for this service account from the [on-prem portal](https://play.cartesia.ai/on-prem).

Activate the service account before accessing resources hosted on GCloud:

```bash theme={null}
gcloud auth activate-service-account --key-file=/path/to/service-account.json
gsutil ls gs://cartesia-onprem/  # Verify access
```

## Deployment Configurations

The `cartesia-onprem` bucket contains versioned repository `cartesia-kube` which holds all of our deployment configurations.

```
gs://cartesia-onprem/
  cartesia-kube/
    latest/
      cartesia-kube-latest.tar.gz   # Latest release archive
      VERSION                        # Current version string
    releases/
      <version>/
        SHA256SUMS                   # Checksums for verification
```

<Note>Voice model files and LoRA weights are provided in a separate bucket or as part of `cartesia-kube`. Your Cartesia representative will confirm the exact paths during onboarding.</Note>

Download and verify the latest release:

```bash theme={null}
BUCKET="cartesia-onprem"

gsutil cp gs://${BUCKET}/cartesia-kube/latest/cartesia-kube-latest.tar.gz .
gsutil cp gs://${BUCKET}/cartesia-kube/latest/VERSION .

LATEST_VERSION=$(cat VERSION)
gsutil cp gs://${BUCKET}/cartesia-kube/releases/${LATEST_VERSION}/SHA256SUMS .

sha256sum -c SHA256SUMS  # macOS: shasum -a 256 -c SHA256SUMS
tar -xzf cartesia-kube-latest.tar.gz
```

Once extracted, `cartesia-kube` contains everything needed for all deployment methods:

```
cartesia-kube/
  benchmarking/          # Load testing and benchmarking tools
  cartesia/              # Helm chart + Docker Compose configs
    scripts/
      swarm/             # Docker Swarm deploy scripts
    templates/           # Kubernetes resource templates
      autoscaler/
      resources/
      services/
  infra/                 # Terraform configs
    aws/
      cartesia-eks/      # EKS deployment
    gcp/
      cartesia-gke/      # GKE deployment
```

## Container Registry

Images are hosted at `us-docker.pkg.dev/cartesia-external/self-serve` and tagged with a release tag (e.g. `sonic-20251118`). The full image reference format is:

```
us-docker.pkg.dev/cartesia-external/self-serve/<image-name>:<release-tag>
```

### Images

| Image Name                   | Description                        |
| ---------------------------- | ---------------------------------- |
| `cartesia-api`               | API server                         |
| `cartesia-license-proxy`     | License validation and enforcement |
| `cartesia-sonic-rosy-dragon` | TTS worker — sonic-3               |
| `cartesia-sonic-royal-plant` | TTS worker — sonic-2               |
| `cartesia-sonic-voice-clone` | TTS worker — voice cloning         |

NATS uses a public image and does not need to be pulled from the Cartesia registry.

### Listing Available Tags

List available image tags sorted by most recent:

```bash theme={null}
gcloud artifacts docker images list \
  us-docker.pkg.dev/cartesia-external/self-serve/cartesia-api \
  --include-tags \
  --sort-by="~UPDATE_TIME"
```

Replace `cartesia-sonic-rosy-dragon` with any image name from the table above. The `~` prefix sorts in descending order, showing the latest tags first.

### Mirroring to a Private Registry

For air-gapped or network-restricted environments, mirror images to your own registry before deployment.

Authenticate Docker with the service account:

```bash theme={null}
cat /path/to/service-account.json | \
  docker login -u _json_key --password-stdin https://us-docker.pkg.dev
```

Pull, retag, and push each image. For example:

```bash theme={null}
CARTESIA_REGISTRY="us-docker.pkg.dev/cartesia-external/self-serve"
PRIVATE_REGISTRY="your-registry.example.com/cartesia"
RELEASE_TAG="sonic-20251118"
IMAGE="cartesia-api"

docker pull ${CARTESIA_REGISTRY}/${IMAGE}:${RELEASE_TAG}
docker tag  ${CARTESIA_REGISTRY}/${IMAGE}:${RELEASE_TAG} ${PRIVATE_REGISTRY}/${IMAGE}:${RELEASE_TAG}
docker push ${PRIVATE_REGISTRY}/${IMAGE}:${RELEASE_TAG}
```

Repeat for each image in the table above.

Then set `infra.imageRegistry` (Helm) to your private registry URL.


# Testing and Benchmarking
Source: https://docs.cartesia.ai/self-hosted/testing-and-benchmarking

Validate and benchmark your Cartesia self-hosted deployment

Once your deployment is running, you can test it using the following commands. Ensure you have network access to your service via port-forwarding or an ingress.

## List Voices

```bash theme={null}
curl "http://<your-host>:<port>/voices" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" | jq '.'
```

## Text-to-Speech

```bash theme={null}
curl -X POST "http://<your-host>:<port>/tts/bytes" \
  -H "Cartesia-Version: 2025-04-16" \
  -H "X-API-Key: $CARTESIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "sonic-2",
    "transcript": "Hello, this is a test of the Cartesia text-to-speech API.",
    "voice": {
      "mode": "id",
      "id": "bf0a246a-8642-498a-9950-80c35e9276b5"
    },
    "output_format": {
      "container": "wav",
      "encoding": "pcm_f32le",
      "sample_rate": 44100
    },
    "language": "en"
  }' > output.wav
```

## Benchmarking

We provide a benchmarking tool in the [cartesia-kube](https://github.com/cartesia-ai/cartesia-kube) repository for measuring TTS performance metrics like TTFA and latency.

```bash theme={null}
cd cartesia-kube/benchmarking

export CARTESIA_API_KEY="your-api-key"
export CARTESIA_API_URL="wss://your-ingress-host"

# Run with default concurrency (4)
uv run tts_benchmark.py

# Run with custom concurrency
uv run tts_benchmark.py --concurrency 8
```

See the [benchmarking README](https://github.com/cartesia-ai/cartesia-kube/tree/main/benchmarking) for detailed usage and output format.