# Delete Agent Source: https://docs.cartesia.ai/api-reference/agents/agents/delete /latest.yml DELETE /agents/{agent_id} # Get Agent Source: https://docs.cartesia.ai/api-reference/agents/agents/get /latest.yml GET /agents/{agent_id} Returns the details of a specific agent. To create an agent, use the CLI or the Playground for the best experience and integration with Github. # List Agents Source: https://docs.cartesia.ai/api-reference/agents/agents/list /latest.yml GET /agents Lists all agents associated with your account. # List Phone Numbers Source: https://docs.cartesia.ai/api-reference/agents/agents/phone-numbers /latest.yml GET /agents/{agent_id}/phone-numbers List the phone numbers associated with an agent. Currently, you can only have one phone number per agent and these are provisioned by Cartesia. # List Templates Source: https://docs.cartesia.ai/api-reference/agents/agents/templates /latest.yml GET /agents/templates List of public, Cartesia-provided agent templates to help you get started. # Update Agent Source: https://docs.cartesia.ai/api-reference/agents/agents/update /latest.yml PATCH /agents/{agent_id} # Download Call Audio Source: https://docs.cartesia.ai/api-reference/agents/calls/download-call-audio /latest.yml GET /agents/calls/{call_id}/audio The downloaded audio file is in .wav format. This endpoint streams the audio file content (WAV format) to the client. # Get Call Source: https://docs.cartesia.ai/api-reference/agents/calls/get-call /latest.yml GET /agents/calls/{call_id} # Get Call Runtime Logs Source: https://docs.cartesia.ai/api-reference/agents/calls/get-call-logs /latest.yml GET /agents/calls/{call_id}/logs Returns the runtime logs for a specific call. These are the logs produced by your agent's code during the call. Logs may not be available if the call is still in progress or if they have been removed due to data retention settings. # List Calls Source: https://docs.cartesia.ai/api-reference/agents/calls/list-calls /latest.yml GET /agents/calls Lists calls sorted by start time in descending order for a specific agent. `agent_id` is required and if you want to include `transcript` in the response, add `expand=transcript` to the request. This endpoint is paginated. # Get Deployment Source: https://docs.cartesia.ai/api-reference/agents/deployments/get-deployment /latest.yml GET /agents/deployments/{deployment_id} Get a deployment by its ID. # List Deployments Source: https://docs.cartesia.ai/api-reference/agents/deployments/list-deployments /latest.yml GET /agents/{agent_id}/deployments List of all deployments associated with an agent. # Add Metric to Agent Source: https://docs.cartesia.ai/api-reference/agents/metrics/add-metric-to-agent /latest.yml POST /agents/{agent_id}/metrics/{metric_id} Add a metric to an agent. Once the metric is added, it will be run on all calls made to the agent automatically from that point onwards. # Create Metric Source: https://docs.cartesia.ai/api-reference/agents/metrics/create-metric /latest.yml POST /agents/metrics Create a new metric. # Export Metric Results as CSV Source: https://docs.cartesia.ai/api-reference/agents/metrics/export-metric-results /latest.yml GET /agents/metrics/results/export Export metric results to a CSV file. This endpoint streams at most 100k results as the CSV file directly to the client. Use the optional filters to narrow down the results to export. # Get Metric Source: https://docs.cartesia.ai/api-reference/agents/metrics/get-metric /latest.yml GET /agents/metrics/{metric_id} Get a metric by its ID. # List Metric Results Source: https://docs.cartesia.ai/api-reference/agents/metrics/list-metric-results /latest.yml GET /agents/metrics/results Paginated list of metric results. Filter results using the query parameters, # List Metrics Source: https://docs.cartesia.ai/api-reference/agents/metrics/list-metrics /latest.yml GET /agents/metrics List of all LLM-as-a-Judge metrics owned by your account. # Remove Metric from Agent Source: https://docs.cartesia.ai/api-reference/agents/metrics/remove-metric-from-agent /latest.yml DELETE /agents/{agent_id}/metrics/{metric_id} Remove a metric from an agent. Once the metric is removed, it will no longer be run on all calls made to the agent automatically from that point onwards. Existing metric results will remain. # API Status and Version Source: https://docs.cartesia.ai/api-reference/api-status/get /latest.yml GET / # Speech-to-Text (Streaming) Source: https://docs.cartesia.ai/api-reference/stt/stt This endpoint creates a bidirectional WebSocket connection for real-time speech transcription. Our STT endpoint enables sending in a stream of audio as bytes, and provides transcription results as they become available. **Usage Pattern**: 1. Connect to the WebSocket with appropriate query parameters 2. Send audio chunks as binary WebSocket messages in the specified encoding format 3. Receive transcription messages as JSON with word-level timestamps 4. Send `finalize` as a text message to flush any remaining audio (receives `flush_done` acknowledgment) 5. Send `done` as a text message to close the session cleanly (receives `done` acknowledgment and closes) **Performance Recommendation**: For best performance, it is recommended to resample audio before streaming and send audio chunks in `pcm_s16le` format at 16kHz sample rate. **Pricing**: Speech-to-text streaming is priced at **1 credit per 1 second** of audio streamed in. For WebSocket connection limits, see the [concurrency limits and timeouts](/use-the-api/concurrency-limits-and-timeouts) page. # Speech-to-Text (Batch) Source: https://docs.cartesia.ai/api-reference/stt/transcribe /latest.yml POST /stt Transcribes audio files into text using Cartesia's Speech-to-Text API. Upload an audio file and receive a complete transcription response. Supports arbitrarily long audio files with automatic intelligent chunking for longer audio. **Supported audio formats:** flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, wav, webm **Response format:** Returns JSON with transcribed text, duration, and language. Include `timestamp_granularities: ["word"]` to get word-level timestamps. **Pricing:** Batch transcription is priced at **1 credit per 2 seconds** of audio processed. For migrating from the OpenAI SDK, see our [OpenAI Whisper to Cartesia Ink Migration Guide](/use-the-api/migrate-from-open-ai). # Text to Speech (Bytes) Source: https://docs.cartesia.ai/api-reference/tts/bytes /latest.yml POST /tts/bytes # Text to Speech (SSE) Source: https://docs.cartesia.ai/api-reference/tts/sse /latest.yml POST /tts/sse # Text to Speech (WebSocket) Source: https://docs.cartesia.ai/api-reference/tts/websocket This endpoint creates a bidirectional WebSocket connection. The connection supports multiplexing, so you can send multiple requests and receive the corresponding responses in parallel. The WebSocket API is built around contexts: - When you send a generation request, you pass a `context_id`. Further inputs on the same `context_id` will [continue the generation](/build-with-cartesia/capability-guides/stream-inputs-using-continuations), maintaining prosody. - Responses for a context contain the `context_id` you passed in so that you can match requests and responses. Read the guide [on working with contexts](/use-the-api/tts-websocket/contexts) to learn more. For the best performance, we recommend the following usage pattern: 1. **Do many generations over a single WebSocket**. Just use a separate context for each generation. The WebSocket scales up to dozens of concurrent generations. 2. **Set up the WebSocket before the first generation**. This ensures you don’t incur latency when you start generating speech. 3. **Include necessary spaces and punctuation**: This allows Sonic to generate speech more accurately and with better prosody. For conversational agent use cases, we recommend the following usage pattern: 1. **Each turn in a conversation should correspond to a context**: For example, if you are using Sonic to power a voice agent, each turn in the conversation should be a new context. 2. **Start a new context for interruptions**: If the user interrupts the agent, start a new context for the agent’s response. To learn more about managing concurrent generations and WebSocket connection limits, see the [concurrency limits and timeouts](/use-the-api/concurrency-limits-and-timeouts) page. # Clone Voice Source: https://docs.cartesia.ai/api-reference/voices/clone /latest.yml POST /voices/clone Clone a high similarity voice from an audio clip. Clones are more similar to the source clip, but may reproduce background noise. For these, use an audio clip about 5 seconds long. # Delete Voice Source: https://docs.cartesia.ai/api-reference/voices/delete /latest.yml DELETE /voices/{id} # Get Voice Source: https://docs.cartesia.ai/api-reference/voices/get /latest.yml GET /voices/{id} # List Voices Source: https://docs.cartesia.ai/api-reference/voices/list /latest.yml GET /voices # Localize Voice Source: https://docs.cartesia.ai/api-reference/voices/localize /latest.yml POST /voices/localize Create a new voice from an existing voice localized to a new language and dialect. # Update Voice Source: https://docs.cartesia.ai/api-reference/voices/update /latest.yml PATCH /voices/{id} Update the name, description, and gender of a voice. To set the gender back to the default, set the gender to `null`. If gender is not specified, the gender will not be updated. # Audio encodings Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/audio-encodings Pick the encoding that matches your downstream pipeline. ## TTS output encodings Used in the `output_format.encoding` field when generating audio. | Encoding | Bit depth | Best for | Pair with sample rate | | ----------- | ---------------- | --------------------------------------------------------------- | --------------------------------- | | `pcm_s16le` | 16-bit int | General-purpose playback, browsers, audio players, most devices | 44100 (CD quality) or 16000–48000 | | `pcm_f32le` | 32-bit float | ML post-processing, high-fidelity recording, audio analysis | 48000 | | `pcm_mulaw` | 8-bit compressed | North American / Japanese telephony (G.711μ), Twilio | 8000 | | `pcm_alaw` | 8-bit compressed | European / international telephony (G.711A) | 8000 | ### `pcm_s16le` 16-bit signed integer PCM, little-endian. Matches the standard audio CD format and is the most widely supported encoding across audio players, browsers, and hardware. Use this as your default unless you have a specific reason to choose another format. ```json theme={null} { "container": "raw", "encoding": "pcm_s16le", "sample_rate": 44100 } ``` ### `pcm_f32le` 32-bit floating point PCM, little-endian. Provides the highest precision and dynamic range. Use when your pipeline handles float audio end-to-end—for example, feeding generated audio into an ML model, performing signal processing with NumPy/SciPy, or recording to a lossless format for later mastering. ```json theme={null} { "container": "raw", "encoding": "pcm_f32le", "sample_rate": 48000 } ``` ### `pcm_mulaw` 8-bit μ-law compressed PCM. The standard encoding for North American and Japanese telephone networks (G.711μ). Use this when sending audio to Twilio or any telephony provider that expects μ-law. Always pair with an 8000 Hz sample rate to match the telephony standard. ```json theme={null} { "container": "raw", "encoding": "pcm_mulaw", "sample_rate": 8000 } ``` ### `pcm_alaw` 8-bit A-law compressed PCM. The standard encoding for European and international telephone networks (G.711A). Use when your telephony infrastructure expects A-law rather than μ-law. Always pair with an 8000 Hz sample rate. ```json theme={null} { "container": "raw", "encoding": "pcm_alaw", "sample_rate": 8000 } ``` ## STT input encodings Used in the `encoding` parameter when sending audio for transcription. Must match the actual encoding of your audio source. | Encoding | Bit depth | Common sources | | ----------- | ---------------- | ------------------------------------------------------------------- | | `pcm_s16le` | 16-bit int | Microphones, browsers (Web Audio API), most audio capture libraries | | `pcm_s32le` | 32-bit int | Professional audio interfaces | | `pcm_f16le` | 16-bit float | Half-precision ML pipelines | | `pcm_f32le` | 32-bit float | ML models, Web Audio API `AudioWorklet` nodes, NumPy/SciPy | | `pcm_mulaw` | 8-bit compressed | North American telephony, Twilio streams | | `pcm_alaw` | 8-bit compressed | European telephony systems | For best STT performance, resample your audio to `pcm_s16le` at 16000 Hz before sending. # Choosing a Voice Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/choosing-a-voice How to pick the best voice for your Voice Agents When designing a voice agent experience, the voice that your agents will speak in is a critical choice that will influence your customers' experience. Cartesia offers 500+ voices out-out-of-box, as well as the ability to clone your own voices. ### Featured Voices We feature a set of Voices that we've found work well for our customers and pass our internal quality checks. These voices are a great starting point to find the best Voice for your voice agent. Featured Voices are displayed with a check mark icon next to their names on [play.cartesia.ai](https://play.cartesia.ai/). ### Stable voices (best for voice agents) For voice agents in production, we've found that more stable, realistic voices perform better than studio quality, emotive voices. From our testing, we think these are the top performing English Voices for voice agents in Sonic 3: * **Male**: Ronald, Carson * **Female**: Katie, Jacqueline, Brooke ### Emotive voices (best for AI characters) Our latest model, Sonic 3, is very expressive with some voices like Tessa and Maya labeled as emotive in the playground, and respond well to [emotion instructions](/build-with-cartesia/sonic-3/volume-speed-emotion). If your use case requires more expressive speech (e.g. companion apps, game characters), then we suggest trying: * **Male**: Kyle, Cory * **Female**: Tessa, Ariana We tag such voices as Emotive in our playground and you can see a full list [here](https://play.cartesia.ai/voices?tags=Emotive). # Choosing TTS parameters Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/choosing-tts-parameters Our Text-to-Speech API includes many parameters that can be bewildering to developers who have not worked with audio before. In general, you should pick the highest precision and sample rate supported by every stage of your audio pipeline, including telephony and device outputs. A typical digital audio setup will perform well with these settings, which match the standard audio CD format: ``` output_format: { container: "raw", encoding: "pcm_s16le", sample_rate: 44100, } ``` If you know your pipeline supports a higher encoding and sample rate end to end, the highest quality settings are: ``` output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: 48000, } ``` ## Reference The container format (if any), for the audio output. Available options: `RAW`, `WAV`, `MP3`. Only the Bytes endpoint supports all container formats; our streaming endpoints (SSE, Websockets) only support `RAW`. The encoding of the output audio. Available options: `pcm_f32le`, `pcm_s16le`, `pcm_mulaw`, `pcm_alaw`. For detailed guidance on when to use each encoding, see [Audio encodings](/build-with-cartesia/capability-guides/audio-encodings). The sample rate of the output audio. Remember that to represent a given signal, the sample rate must be at least twice the highest frequency component of the signal (Nyquist theorem). Available options: `8000`, `16000`, `22050`, `24000`, `44100`, `48000`. ## Examples ### Audio CD quality Standard audio CDs are encoded as `pcm_s16le` at 44.1kHz sample rate: ``` output_format: { container: "raw", encoding: "pcm_s16le", sample_rate: 44100, } ``` This performs well for consumer digital audio setups. ### Telephony Many customers send their audio output over Twilio. Since all audio sent over Twilio is transcoded to µlaw encoding with 8kHz sample rate (to match the telephony standard), you should specify the following output\_format: ``` output_format: { container: "raw", encoding: "pcm_mulaw", sample_rate: 8000, } ``` ### Bluetooth headsets If you happen to know that that the user is using a Bluetooth headset (such as AirPods) to multiplex both microphone input and headphone output, the user will be on the Bluetooth Hands-Free Profile (HFP), limiting sample rate to 16kHz. (In practice, it's difficult to programmatically determine the end-user's microphone/speaker devices, so this example is a bit contrived.) ``` output_format: { container: "raw" encoding: "pcm_s16le", sample_rate: 16000, } ``` # Clone Voices Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices Learn how to get the best voice clones from your audio clips. Voice cloning is available through the [playground](https://play.cartesia.ai) and the [API](/2024-11-13/api-reference/voices/clone). With current API versions, instant cloning uses **high-similarity** mode: clones sound more like the source clip, but may reproduce background noise. For the legacy **stability** workflow, pin API version `2024-11-13` and see [Older TTS models](/build-with-cartesia/tts-models/older-models). For the best voice clones, we recommend following these best practices: ## General best practices for voice cloning 1. **Choose an appropriate script to speak.** You want your recording to align as closely as possible with the voice you want to generate. For example, don't read a colorless transcript in a monotone voice unless you're aiming for a monotonous clone. Instead, prepare a script that is suited to your use case and has the right energy. 2. **Speak as clearly as possible and avoid background noise.** For example, when recording yourself, try to use a high-quality microphone and be in a quiet space. 3. **Avoid long pauses.** Pauses in the recording will be mimicked by the cloned voice, such as between sentences. Ensure your recording matches the pacing you want your voice to follow. 4. **Trim your recording.** The audio you provide should roughly contain speech from start to finish. Make sure the speaker is not cut-off and that there's no excessive silence at the beginning or end. You can use a tool like Audacity or our playground make the perfect clip from your recording. 5. **Speak in the target language.** For instance, if you want the cloned voice to speak Spanish, speak Spanish in the recording. If this is not possible, you can use Cartesia's localization feature—available in the playground and in the API—to convert your clone to a different language. ## Best practices for high-similarity clones 1. **Limit your recording to ten seconds.** This is the sweet spot for high-similarity clones. A longer clip will not result in a better clone. 2. **Set `enhance` to `false` when cloning.** Unless your source clip has substantial background noise, any postprocessing will reduce the similarity of the clone to the source clip. # End-to-end Pro Voice Cloning (Python) Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices-pro/api Use Cartesia's REST API to create a Pro Voice Clone. > **Prerequisites** > > 1. You have a **Cartesia API token** (export it as `CARTESIA_API_TOKEN`). > 2. You have at least 1 M credits on your account. > 3. You have a folder called `samples/` with one or more `.wav` files. ```python lines theme={null} """ End-to-end Pro Voice Cloning example. Steps ----- 1. Create a dataset. 2. Upload audio files from samples/ to the dataset. 3. Kick off a fine-tune from that dataset. 4. Poll until fine-tune is completed. 5. Get the voices produced by the fine-tune. """ import os import time from pathlib import Path import requests API_BASE = "https://api.cartesia.ai" API_HEADERS = { "Cartesia-Version": "2025-04-16", "Authorization": f"Bearer {os.environ['CARTESIA_API_KEY']}", } def create_dataset(name: str, description: str) -> str: """POST /datasets → dataset id.""" res = requests.post( f"{API_BASE}/datasets", headers=API_HEADERS, json={"name": name, "description": description}, ) res.raise_for_status() return res.json()["id"] def upload_file_to_dataset(dataset_id: str, path: Path) -> None: """POST /datasets/{dataset_id}/files (multipart/form-data).""" with path.open("rb") as fp: res = requests.post( f"{API_BASE}/datasets/{dataset_id}/files", headers=API_HEADERS, files={"file": fp, "purpose": (None, "fine_tune")}, ) res.raise_for_status() def create_fine_tune(dataset_id: str, *, name: str, language: str, model_id: str) -> str: """POST /fine-tunes → fine-tune id.""" body = { "name": name, "description": "Pro Voice Clone demo", "language": language, "model_id": model_id, "dataset": dataset_id, } res = requests.post(f"{API_BASE}/fine-tunes", headers=API_HEADERS, json=body, timeout=60) res.raise_for_status() return res.json()["id"] def wait_for_fine_tune(ft_id: str, every: float = 10.0) -> None: """Poll GET /fine-tunes/{id} until status == completed.""" start = time.monotonic() while True: res = requests.get(f"{API_BASE}/fine-tunes/{ft_id}", headers=API_HEADERS) res.raise_for_status() status = res.json()["status"] print(f"fine-tune {ft_id} -> {status}. Elapsed: {time.monotonic() - start:.0f}s") if status == "completed": return if status == "failed": raise RuntimeError(f"fine-tune ended with status={status}") time.sleep(every) def list_voices(ft_id: str) -> list[dict]: """GET /fine-tunes/{id}/voices → list of voices.""" res = requests.get(f"{API_BASE}/fine-tunes/{ft_id}/voices", headers=API_HEADERS) res.raise_for_status() return res.json()["data"] if __name__ == "__main__": # Create the dataset DATASET_ID = create_dataset("PVC demo", "Samples for a Pro Voice Clone") print("Created dataset:", DATASET_ID) # Upload .wav files to the dataset for wav_path in Path("samples").glob("*.wav"): upload_file_to_dataset(DATASET_ID, wav_path) print(f"Uploaded {wav_path.name} to dataset {DATASET_ID}") # Ask for confirmation before kicking off the fine-tune confirmation = input( "Are you sure you want to start the fine-tune? It will cost 1M credits upon successful completion (yes/no): " ) if confirmation.lower() != "yes": print("Fine-tuning cancelled by user.") exit() # Kick off the fine-tune FINE_TUNE_ID = create_fine_tune( DATASET_ID, name="PVC demo", language="en", model_id="sonic-2", ) print(f"Started fine-tune: {FINE_TUNE_ID}") # Wait for training to finish wait_for_fine_tune(FINE_TUNE_ID) print("Fine-tune completed!") # Fetch the voices created by the fine-tune voices = list_voices(FINE_TUNE_ID) print("Voices IDs:") for voice in voices: print(voice["id"]) ``` # Pro Voice Cloning Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/clone-voices-pro/playground ## Why use Pro Voice Cloning? A Professional Voice Clone (PVC) is a voice that uses a fine-tune of our TTS model on your data, which allows it to create an almost exact replica of the voice it hears including accent, speaking style, and audio quality. Compared to [Instant Voice Cloning](/build-with-cartesia/capability-guides/clone-voices), Pro Voice Cloning can capture the exact nuances of your hours of studio-quality audio voice data. ## Overview Pro Voice Cloning is available in the [Playground](https://play.cartesia.ai/pro-voice-cloning) for anyone with a Cartesia subscription of Startup or higher. It allows you to create highly accurate voice clones by leveraging a larger amount of data compared to instant cloning. | Feature | Required audio data | Pricing: cost to create | Pricing: cost to use for TTS | | ------------------- | ------------------- | ----------------------- | ---------------------------- | | Instant Voice Clone | 10 seconds | Free | 1 credit per character | | Pro Voice Clone | 3 hours | 1M credits on success | 1.5 credits per character | When you create a Pro Voice Clone, Cartesia first fine-tunes a model on your data, then creates Voices from selected clips of your data. These Voices are tied to the fine-tuned model and will be automatically used with these Voices for text-to-speech. ## Get started Visit the Pro Voice Clone tab to get started on your first PVC. On the home page, you can to see all your fine-tuned models and their statuses (i.e Draft, Failed, Training, Completed). Fill out the form to create a Pro Voice Clone. Then, upload all of the audio files you want to use for training. You can upload multiple files at once. Files must be one of the following audio formats: * .wav * .mp3 * .flac * .ogg * .oga * .ogx * .aac * .wma * .m4a * .opus * .ac3 * .webm Pro Voice Clones require a minimum of 30 minutes of audio, but we recommend 2 hours of audio for optimal balance of quality and effort. The Pro Voice Clone will closely match your uploaded data, so make sure it sounds the way you like in terms of background noise, loudness, and speech quality. Generally, it's better to upload audio with only the speaker you which to clone. Multi-speaker audio can interfere with cloning quality. If you also reused data from past Pro Voice Clones. Switch to the **Select dataset** tab to view previous datasets. These datasets can be edited separately from your PVCs and are helpful for managing your audio files. Training should take 3 hours to complete. You'll only be charged if the training is successful. If training fails, you can click the `Re-attempt Training` button to try again or contact [support](mailto:support@cartesia.ai) if the failures persist. Once training is complete, we'll automatically create four Voices based on different source audio clips from your dataset. These Voices are internally linked to your fine-tuned model, which will be used when you specify the model ID of the fine-tuned model in your requests. The Voices are also available in the Voice Library under My Voices and can be used through the API. **Note about base model updates:** We've fine-tuned the latest base model available in production, which is reflected in the displayed model ID. This means that the fine-tuned model is fixed to this particular model ID and will not be activated if you use a different `model-id`. PVCs will not automatically be updated for future base models, and will need to be retrained on each new base model. Retraining a new fine-tuned model with new data or the latest base model will again cost 1M credits. # Localize voices Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/localize-voices Learn how to localize voices for your brand or product. The localization feature accepts a voice to localize, the gender of the voice, and the target language and accent to localize to, and produces a Voice that you can use to generate speech (or save as a new voice). # Stream Inputs using Continuations Source: https://docs.cartesia.ai/build-with-cartesia/capability-guides/stream-inputs-using-continuations Learn how to stream input text to Sonic TTS. In many real-time use cases, you don't have input text available upfront—like when you're generating it on the fly using a language model. For these cases, we support input streaming through a feature we call *continuations*. This guide will cover how input streaming works from the perspective of the TTS model. If you just want to implement input streaming, see [the WebSocket API reference](/api-reference/tts/tts), which implements continuations using *contexts*. ## Continuations Continuations are generations that extend already generated speech. They're called continuations because you're continuing the generation from where the last one left off, maintaining the *prosody* of the previous generation. If you don't use continuations, you get sudden changes in prosody that create seams in the audio. Prosody refers to the rhythm, intonation, and stress in speech. It's what makes speech flow naturally and sound human-like. Let's say we're using an LLM and it generates a transcript in three parts, with a one second delay between each part: 1. `Hello, my name is Sonic.` 2. ` It's very nice` 3. ` to meet you.` To generate speech for the whole transcript, we might think to generate speech for each part independently and stitch the audios together: no_continuations Unfortunately, we end up with speech that has sudden changes in prosody and strange pacing: Your browser does not support the audio element. Now, let's try the same transcripts, but using continuations. The setup looks like this: continuations Here's what we get: Your browser does not support the audio element. As you can hear, this output sounds seamless and natural. You can scale up continuations to any number of inputs. There is no limit. ## Caveat: Streamed inputs should form a valid transcript when joined This means that `"Hello, world!"` can be followed by `" How are you?"` (note the leading space) but not `"How are you?"`, since when joined they form the invalid transcript `"Hello, world!How are you?"`. In practice, this means you should maintain spacing and punctuation in your streamed inputs. **End complete sentences with closing punctuation** (for example `.`, `?`, or `!`). If a streamed chunk does not end with sentence-ending punctuation, the model often treats it as an incomplete sentence. That can cause: * **Extra latency:** Text may stay in the automatic input buffer until the model sees a clearer boundary or until `max_buffer_delay_ms` elapses (**3000ms by default**), so audio starts later than you expect. * **Audio artifacts:** The model expects natural sentence endings; without closing punctuation, the generated audio sometimes ends with odd or distorted sounds. When a user-facing utterance is finished, put terminal punctuation on the final segment (and signal that no more text is coming on the context when appropriate, for example `no_more_inputs()` in the SDK or `continue: false` over the WebSocket). ## Automatic buffering with `max_buffer_delay_ms` When streaming inputs from LLMs word-by-word or token-by-token, we buffer text until the optimal transcript length for our model. The default buffer is 3000ms, if you wish to modify this you can use the `max_buffer_delay_ms` parameter, though we *do not recommend making this change*. If you plan on using `speed` or `volume` [SSML tags](/build-with-cartesia/sonic-3/ssml-tags) with buffering, make sure decimal values are not split up. Submitting `1.0` as `1`, `.`, `0` will result in unintended failure modes. ### How it works When set, the model will buffer incoming text chunks until it's confident it has enough context to generate high-quality speech, or the buffer delay elapses, whichever comes first. Without this buffer, the model would immediately start generating with each input, which could result in choppy audio or unnatural prosody if inputs are very small (like single words or tokens). ### Configuration * **Range**: Values between 0-5000ms are supported * **Default**: 3000ms Use this *only* if * you have custom buffering client side, in which case you can set this to 0 * you have choppiness even at 3000ms, in which case you can try a higher value ```js lines theme={null} // Example WebSocket request with `max_buffer_delay_ms` { "model_id": "sonic-3", "transcript": "Hello", // First word/token "voice": { "mode": "id", "id": "a0e99841-438c-4a64-b679-ae501e7d6091" }, "context_id": "my-conversation-123", "continue": true, "max_buffer_delay_ms": 3000 // Buffer up to 3000ms } ``` Let's try the following transcripts with continuations and the default `max_buffer_delay_ms=3000`: `['Hello', 'my name', 'is Sonic.', "It's ", 'very ', 'nice ', 'to ', 'meet ', 'you.']` Your browser does not support the audio element. # Custom Pronunciations Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/custom-pronunciations Learn how to specify custom pronunciations for words that are hard to get right, like proper nouns or domain-specific terms. All models in the Sonic TTS family support custom pronunciations in your transcripts. Try out the pronunciation tool on our [demo](https://play.cartesia.ai/demos/pronunciation) page. `sonic-3` supports custom pronunciation dictionaries, which allow specifying how to pronounce a specific word or words more easily and sustainably. At its core, a dictionary is a simple search and replace, which directs the model to use another string in lieu of the text for the transcript. The pronunciation can either be an [IPA pronunciation](/build-with-cartesia/sonic-3/phonemes), or a "sounds-like" guidance: ```json lines theme={null} [ { "text": "bayou", "pronunciation": "<<ˈ|b|ɑ|ˈ|j|u>>" }, { "text": "jambalaya", "pronunciation": "<<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>>" }, { "text": "tchoupitoulas", "pronunciation": "chop-uh-TOO-liss" } ] ``` These JSONs can then be saved as a pronunciation dictionaries [through our API](https://docs.cartesia.ai/api-reference/pronunciation-dicts/create), or through our [playground](https://play.cartesia.ai/pronunciation). The playground gives affordances for creating and manipulating dictionaries also directly in the UI: image.png Once the dictionaries are created, they can be used in any of the TTS APIs by specifying the id in `pronunciation_dict_id`. With the above dictionary, the string: `I ate some jambalaya on tchoupitoulas street` would become `I ate some <<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>> on chop-uh-TOO-liss street` before being handed off to the model, which in turn, would do a better job in pronouncing it properly. ## Case Sensitivity Dictionary matching is **case-sensitive**, with one exception: a lowercase entry also matches its sentence-start capitalized form. For example, `cat` matches both `cat` and `Cat`, but not `CAT`. An entry for `CAT` only matches `CAT`. This applies to multi-word entries too. An entry for `green valley` matches `green valley` and `Green valley`, but not `Green Valley`. **Use lowercase entries for common words.** These match the word both mid-sentence (`cat`) and at the start of a sentence (`Cat`), covering the two most common positions. **Use exact capitalization for proper nouns.** A term like "LaTeX" should be entered as `LaTeX` so it doesn't collide with a different pronunciation for the common word `latex`. For multi-word proper nouns, enter the exact casing as it appears in your transcripts — for example, `Green Valley` if the transcript capitalizes both words. > For the best controllability around pronunciation, we recommend using `sonic-3`. `sonic-2` and `sonic-turbo` use MFA-style IPA for all languages. For the best controllability around pronunciation, we recommend using `sonic-2`. You can also get custom pronunciations with older Sonic models. The `sonic`, `sonic-2024-12-12`, and `sonic-2024-10-19` models use Sonic-flavored IPA phonemes for English. The `sonic` and `sonic-2024-12-12` use MFA-style IPA for languages other than English, and the Sonic Preview model uses MFA-style IPA for all languages. Note that `sonic-2024-10-19` does not support custom pronunciations for languages other than English. We will soon be updating all models to use MFA-style IPA. Custom words should be wrapped in double angle brackets `<<` `>>` , with pipe characters `|` between phonemes and no whitespace. For example: * `Can I get <> on that?` (MFA-style IPA) * `Can I get <> on that?` (Sonic-flavored IPA) Each individual word should be wrapped in it’s own set of angle brackets. # MFA-style IPA ## Constructing Pronunciations We use the IPA phoneset as defined by the [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/). Because of the size and complexity of this phoneset, you may find it easier to construct your custom pronunciation starting from existing words with known phonemizations. We suggest the following workflow for constructing a custom pronunciation for a word: 1. Go to the [MFA pronunciation dictionary index](https://mfa-models.readthedocs.io/en/latest/dictionary/index.html) and find the page corresponding to your language. Make sure the phoneset is MFA, and try to download the latest version (for most languages, v3.0 or v3.1). 1. This page will give you the full range of acceptable phones for your language under the “phones” section. 2. Scroll down to the `Installation` section and click on the `Download from the release page` link. 3. Scroll to the bottom of the release page and download the .dict file; this is a text file mapping words to their constituent phonemes. 1. The first column in the file contains words, and the last column contains space delimited phonemes. Ignore the other columns. 4. Look up your word or words that sound similar to your intended pronunciation in the dictionary. Use these pronunciations as a starting point to construct your custom pronunciation. Automatic pronunciation suggestions based on audio samples will be added in a future update. Note that MFA-style IPA does not support stress markers. ## Example Suppose I want to generate the text “This is a generation from Cartesia” and the model is not pronouncing “Cartesia” correctly. I would do the following: 1. Go to the [MFA pronunciation dictionary index](https://mfa-models.readthedocs.io/en/latest/dictionary/index.html) and look for English pronunciation dictionaries. I see that for US English, the most recent version is v3.1. 1. I note that the page says that the acceptable phones for US english are `aj aw b bʲ c cʰ cʷ d dʒ dʲ d̪ ej f fʲ h i iː j k kʰ kʷ l m mʲ m̩ n n̩ ow p pʰ pʲ pʷ s t tʃ tʰ tʲ tʷ t̪ v vʲ w z æ ç ð ŋ ɐ ɑ ɑː ɒ ɒː ɔj ə ɚ ɛ ɝ ɟ ɟʷ ɡ ɡʷ ɪ ɫ ɫ̩ ɱ ɲ ɹ ɾ ɾʲ ɾ̃ ʃ ʉ ʉː ʊ ʎ ʒ ʔ θ` 2. Download the .dict file from the bottom of the [release page](https://github.com/MontrealCorpusTools/mfa-models/releases/tag/dictionary-english_us_mfa-v3.1.0). 3. Find a word in this dictionary that sounds similar to how I want “Cartesia” to be pronounced. I see this entry in the dictionary: `cartesian 0.99 0.14 1.0 1.0 kʰ ɑ ɹ tʲ i ʒ ə n` 4. Ignore the middle four numeric columns. I want to cut off the part of the pronunciation that corresponds to “-an” and replace it with an “uh” sound. I know that the MFA phoneme for “uh” is `ɐ` (if I didn’t know that, I could also look up “uh” in the dictionary). So the pronunciation I want is `kʰ ɑ ɹ tʲ i ʒ ɐ`. 5. Format the phonemes it in angle brackets with pipe characters between phonemes and no whitespace. So my transcript is `This is a generation from <>`. # (Deprecated) Sonic-flavored IPA Sonic-flavored IPA is only for `sonic` and users of our latest models (`sonic-2` and `sonic-turbo`) should use MFA-style IPA. Here is a pronunciation guide for Sonic-flavored IPA. It follows the [English phonology article on Wikipedia](https://en.wikipedia.org/wiki/English_phonology) for most phonemes, but in spots where our model requires different notation than you may expect, we've included a blue `<=` in the margins. You can copy/paste some of these uncommon symbols from the original [charts here](https://docs.google.com/spreadsheets/d/1OJbiKtxLyodpNPqVfOu43X2HloLsAixTtFppEuQ_4pI/edit?usp=sharing). ## Stresses and vowel length markers Sonic English requires stress markers for first (`ˈ`) and second (`ˌ`) stressed syllables, which go directly before the vowel. We also use annotations for vowel length (`ː`). The model can also operate without them, but you will have noticeably better robustness and control when using them. # Prompting tips Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/prompting-tips 1. **Use appropriate punctuation.** Add punctuation where appropriate and at the end of each transcript whenever possible. 2. **Use dates in MM/DD/YYYY form.** For example, 04/20/2023. 3. **Add spaces between time and AM/PM.** For example, `7:00 PM`, `7 PM`, `7:00 P.M`. 4. **Insert pauses.** To insert pauses, insert "-" or use [break tags](/build-with-cartesia/formatting-text-for-sonic-2/inserting-breaks-pauses) where you need the pause. These tags are considered 1 character and do not need to be separated with adjacent text using a space -- to save credits you can remove spaces around break tags. 5. **Match the voice to the language.** Each voice has a language that it works best with. You can use the playground to quickly understand which voices are most appropriate for a language. 6) **Stream in inputs for contiguous audio.** Use [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) if generating audio that should sound contiguous in separate chunks. 7) **Specify [custom pronunciations](/build-with-cartesia/sonic-3/custom-pronunciations) for domain-specific or ambiguous words.** You may want to do this for proper nouns and trademarks, as well as for words that are spelled the same but pronounced differently, like the city of Nice and the adjective "nice." 8) **Force [spelling out numbers and letters](/build-with-cartesia/sonic-3/ssml-tags#spelling-out-numbers-and-letters).** You may want to do this for IDs, email addresses, or numeric values. For sonic-2, see [Formatting Text for Sonic-2](/build-with-cartesia/formatting-text-for-sonic-2/best-practices). # SSML Tags Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/ssml-tags Tags for volume, speed, and emotions is in beta and subject to change in the future. Sonic-3 supports SSML-like (Speech Synthesis Markup Language) tags to control generated speech. ## Speed Note that if you're streaming token by token, you'll need to buffer the whole value of the speed or volume tags. Passing in `1`, `.`, `0` as separate inputs, for example, will result in reading out the tags. You can guide the speed of a TTS generation with a `speed` tag, which takes a scalar between `0.6` and `1.5`. This value is roughly a multiplier on the default speed. For example, `1.5` will generate audio at roughly 1.5x the default speed. ```xml theme={null} I like to speak quickly because it makes me sound smart. ``` ## Volume You can guide the volume of a TTS generation with a `volume` tag, which is a number between `0.5` and `2.0`. The default volume is `1`. ```xml theme={null} I will speak softly. ``` ## Emotion Beta Emotion control is highly experimental, particularly when emotion shifts occur mid-generation. If you need to change the emotion in a transcript, we recommend using separate generation contexts for each emotion. For best results, use [Voices tagged as "Emotive"](https://play.cartesia.ai/voices?tags=Emotive), as emotions may not work reliably with other Voices. ```xml theme={null} I will not allow you to continue this! I was hoping for a peaceful resolution. ``` ## Pauses and breaks To insert breaks (or pauses) in generated speech, use a `break` tags with one attribute, `time`. For example, ``. You can specify the time in seconds (`s`) or milliseconds (`ms`). For accounting purposes, these tags are considered 1 character and do not need to be separated with adjacent text using a space -- to save credits you can remove spaces around break tags. ```xml theme={null} Hello, my name is Sonic.Nice to meet you. ``` ## Spelling out numbers and letters To spell out input text, you can wrap it in `` tags. This is particularly useful for pronouncing long numbers or identifiers, such as credit card numbers, phone numbers, or unique IDs. ```xml theme={null} My name is Bob, spelled Bob, my account number is ABC-123, my phone number is (123) 456-7890, and my credit card is 1234-5678-9012-3456. ``` If you want to spell out numbers or identifiers and have planned breaks between the generations (e.g. taking a break between the area code of a phone number and the rest of that number), you can combine `` and `` tags. These tags are considered 1 character and do not need to be separated with adjacent text using a space -- to save credits you can remove spaces around break and spell tags. ```xml theme={null} My phone number is (123)4712177 and my credit card number is 12345678 63474537. ``` # Volume, Speed, and Emotion Source: https://docs.cartesia.ai/build-with-cartesia/sonic-3/volume-speed-emotion Sonic-3 provides rich controls for the speed, volume, and emotion of generated speech. These controls are available on play.cartesia.ai using the UI controls, or by passing in a `generation_config` parameter, or by using SSML tags within the transcript itself. **Sonic-3 interprets these parameters as guidance** instead of as strict adjustments to ensure natural speech, so we recommend testing them against your content to ensure the output matches your expectations. ## Speed and Volume Controls You can guide the speed and volume of a TTS generation with the `generation_config.speed` and `generation_config.volume` parameters. These values are roughly a multiplier on the default speed and volume, eg, `1.5` will generate audio at 1.5x the default speed. The speed of the generation, ranging from `0.6` to `1.5`. The volume of the generation, ranging from `0.5` to `2.0`. You can also specify these inside the transcript itself, using [SSML](/build-with-cartesia/sonic-3/ssml-tags), for example: ```xml lines theme={null} I like to speak quickly because it makes me sound smart. And I can be loud, too! ``` ## Emotion Controls Beta By default, the model attempts to interpret the emotional subtext present in the provided transcript. You can guide the emotion of a TTS generation, like a director providing guidance to an actor, using the `generation_config.emotion` parameter. Emotion tags are good to push the model to be more emotive, but they only work when the emotion is consistent with transcript. For instance, the mismatch below is unlikely to work well: ```xml theme={null} I'm so excited! ``` The emotional guidance for a generation, one of the emotions below. The primary emotions, for which we have the most data and produce the best results, are: `neutral`, `angry`, `excited`, `content`, `sad`, and `scared`. The complete list of available emotions is: `happy`, `excited`, `enthusiastic`, `elated`, `euphoric`, `triumphant`, `amazed`, `surprised`, `flirtatious`, `joking/comedic`, `curious`, `content`, `peaceful`, `serene`, `calm`, `grateful`, `affectionate`, `trust`, `sympathetic`, `anticipation`, `mysterious`, `angry`, `mad`, `outraged`, `frustrated`, `agitated`, `threatened`, `disgusted`, `contempt`, `envious`, `sarcastic`, `ironic`, `sad`, `dejected`, `melancholic`, `disappointed`, `hurt`, `guilty`, `bored`, `tired`, `rejected`, `nostalgic`, `wistful`, `apologetic`, `hesitant`, `insecure`, `confused`, `resigned`, `anxious`, `panicked`, `alarmed`, `scared`, `neutral`, `proud`, `confident`, `distant`, `skeptical`, `contemplative`, `determined`. The Voices with the best emotional response are: * [Leo](https://play.cartesia.ai/voices/0834f3df-e650-4766-a20c-5a93a43aa6e3) (id: `0834f3df-e650-4766-a20c-5a93a43aa6e3`) * [Jace](https://play.cartesia.ai/voices/6776173b-fd72-460d-89b3-d85812ee518d) (id: `6776173b-fd72-460d-89b3-d85812ee518d`) * [Kyle](https://play.cartesia.ai/voices/c961b81c-a935-4c17-bfb3-ba2239de8c2f) (id: `c961b81c-a935-4c17-bfb3-ba2239de8c2f`) * [Gavin](https://play.cartesia.ai/voices/f4a3a8e4-694c-4c45-9ca0-27caf97901b5) (id: `f4a3a8e4-694c-4c45-9ca0-27caf97901b5`) * [Maya](https://play.cartesia.ai/voices/cbaf8084-f009-4838-a096-07ee2e6612b1) (id: `cbaf8084-f009-4838-a096-07ee2e6612b1`) * [Tessa](https://play.cartesia.ai/voices/6ccbfb76-1fc6-48f7-b71d-91ac6298247b) (id: `6ccbfb76-1fc6-48f7-b71d-91ac6298247b`) * [Dana](https://play.cartesia.ai/voices/cc00e582-ed66-4004-8336-0175b85c85f6) (id: `cc00e582-ed66-4004-8336-0175b85c85f6`) * [Marian](https://play.cartesia.ai/voices/26403c37-80c1-4a1a-8692-540551ca2ae5) (id: `26403c37-80c1-4a1a-8692-540551ca2ae5`) View the full list of emotive Voices on our [Voice Library with voices tagged 'Emotive'](https://play.cartesia.ai/voices?tags=Emotive). You can also use [SSML](/build-with-cartesia/sonic-3/ssml-tags) tags for emotions, for example: ```xml theme={null} How dare you speak to me like I'm just a robot! ``` ## Nonverbalisms Insert `[laughter]`in your transcript to make the model laugh. In the future we plan to add more non-speech verbalisms like sighs and coughs. # STT Models Source: https://docs.cartesia.ai/build-with-cartesia/stt-models Ink is a new family of streaming speech-to-text (STT) models for developers building real-time voice applications. * the latest **stable** snapshot of the model To use the stable version of the model, we recommend using the base model name (e.g. `ink-whisper`). In many cases the stable and preview snapshots are the same, but in some cases the preview snapshot may have additional features or improvements. ## `ink-whisper` Ink Whisper is the fastest, most affordable speech-to-text model — engineered for enterprise deployment in production-grade voice agents. It delivers higher accuracy than baseline Whisper and is optimized for real-time performance in a wide variety of real-world conditions. Additional Capabilities: * Handles variable-length audio chunks and interruptions gracefully using dynamic chunking. * Reliably transcribes speech with background noise. * Accurately transcribes audio with telephony artifacts, accents, and disfluencies. * Excels at transcribing proper nouns and domain-specific terminology. | Snapshot | Release Date | Languages | Status | | ------------------------------------ | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | | `ink-whisper` | June 10, 2025 | `en`, `zh`, `de`, `es`, `ru`, `ko`, `fr`, `ja`, `pt`, `tr`, `pl`, `ca`, `nl`, `ar`, `sv`, `it`, `id`, `hi`, `fi`, `vi`, `he`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `ta`, `no`, `th`, `ur`, `hr`, `bg`, `lt`, `la`, `mi`, `ml`, `cy`, `sk`, `te`, `fa`, `lv`, `bn`, `sr`, `az`, `sl`, `kn`, `et`, `mk`, `br`, `eu`, `is`, `hy`, `ne`, `mn`, `bs`, `kk`, `sq`, `sw`, `gl`, `mr`, `pa`, `si`, `km`, `sn`, `yo`, `so`, `af`, `oc`, `ka`, `be`, `tg`, `sd`, `gu`, `am`, `yi`, `lo`, `uz`, `fo`, `ht`, `ps`, `tk`, `nn`, `mt`, `sa`, `lb`, `my`, `bo`, `tl`, `mg`, `as`, `tt`, `haw`, `ln`, `ha`, `ba`, `jw`, `su`, `yue` | Stable | | `ink-whisper-2025-06-04` | June 4, 2025 | `en`, `zh`, `de`, `es`, `ru`, `ko`, `fr`, `ja`, `pt`, `tr`, `pl`, `ca`, `nl`, `ar`, `sv`, `it`, `id`, `hi`, `fi`, `vi`, `he`, `uk`, `el`, `ms`, `cs`, `ro`, `da`, `hu`, `ta`, `no`, `th`, `ur`, `hr`, `bg`, `lt`, `la`, `mi`, `ml`, `cy`, `sk`, `te`, `fa`, `lv`, `bn`, `sr`, `az`, `sl`, `kn`, `et`, `mk`, `br`, `eu`, `is`, `hy`, `ne`, `mn`, `bs`, `kk`, `sq`, `sw`, `gl`, `mr`, `pa`, `si`, `km`, `sn`, `yo`, `so`, `af`, `oc`, `ka`, `be`, `tg`, `sd`, `gu`, `am`, `yi`, `lo`, `uz`, `fo`, `ht`, `ps`, `tk`, `nn`, `mt`, `sa`, `lb`, `my`, `bo`, `tl`, `mg`, `as`, `tt`, `haw`, `ln`, `ha`, `ba`, `jw`, `su`, `yue` | Stable | To learn how to use the Ink STT family, see [the Speech-to-Text API Reference](/api-reference/stt/stt). You can find a detailed mapping of codes to languages, see the [STT supported languages](/api-reference/stt/stt#request.query.language) list. ## Selecting a Model When making API calls, you can specify either: ```python lines theme={null} // Use the base model (automatically routes to the latest snapshot) { model = "ink-whisper", ... } // Or specify a particular snapshot for consistency { model = "ink-whisper-2025-06-04", ... } ``` ### Continuous updates All models have a base model name (e.g. `ink-whisper`). We recommend using these for prototyping and development, then switching to a date-versioned model for production use cases to ensure stability. ## Future Updates New snapshots are released periodically with improvements in performance, additional language support, and new capabilities. Check back regularly for updates. # API Changes Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/api-changes Starting June 1, 2026, we are discontinuing several models, snapshots, and languages, and removing voice embeddings from our voice API. Migrate to `sonic-3` for improved naturalness, 42-language support, and fine-grained controls. ## Deprecated models and languages You can check if you're making requests to deprecated models on [play.cartesia.ai/deprecation/traffic](https://play.cartesia.ai/deprecation/traffic). ### Fully deprecated models These models will stop serving requests on June 1, 2026. | Model | Snapshots affected | Deprecated languages | | -------------------- | ------------------------ | -------------------- | | `sonic` | All | All | | `sonic-english` | — | All | | `sonic-multilingual` | — | All | | `sonic-2` | `sonic-2-2025-03-07` | All | | `sonic-turbo` | `sonic-turbo-2025-03-07` | All | ### Partially deprecated models These models will continue to serve a reduced set of languages. The languages listed below will be discontinued on June 1, 2026. | Model | Snapshots affected | Deprecated languages | | ------------- | ---------------------------------------------------------------- | -------------------------- | | `sonic-2` | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | it, nl, pl, ru, sv, tr, hi | | `sonic-turbo` | `sonic-turbo-2025-06-04` | it, nl, pl, ru, sv, tr | ## Stable offerings The following will remain available beyond June 1. | Model | Snapshots | Supported Languages | | ------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------- | | `sonic-3` | All | 42 languages — [full list](/build-with-cartesia/tts-models/latest#language-support) | | `sonic-2` | `sonic-2-2025-04-16`, `sonic-2-2025-05-08`, `sonic-2-2025-06-11` | en, de, es, fr, ja, ko, pt, zh | | `sonic-turbo` | `sonic-turbo-2025-06-04` | en, de, es, fr, ja, ko, pt, zh, hi | ## API changes These endpoints will be discontinued on June 1, 2026. | Discontinued Endpoint | Replacement | | ------------------------------------------ | ------------------------------------------ | | Voice Embedding: `POST /voices/clone/clip` | [Clone Voice](/api-reference/voices/clone) | | Mix Voices: `POST /voices/mix` | — | | Create Voice: `POST /voices` | [Clone Voice](/api-reference/voices/clone) | These endpoints will stop accepting voice embeddings on June 1, 2026. | Endpoint with a breaking change | Replacement | | ------------------------------------- | ------------------------------------------------------ | | TTS (bytes): `POST /tts/bytes` | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) | | TTS (SSE): `POST /tts/sse` | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) | | TTS (WebSocket): `WSS /tts/websocket` | [Voice IDs](/build-with-cartesia/tts-models/voice-ids) | You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks. ### Moving off of deprecated endpoints 1. Change how you create voices — see [Migrating Voices](/build-with-cartesia/tts-models/migrating-voices). 2. Switch from voice embeddings to IDs — see [Voice IDs](/build-with-cartesia/tts-models/voice-ids). ## Full Checklist 1. Move off of [deprecated models / snapshots / languages](/build-with-cartesia/tts-models/api-changes#deprecated-models-and-languages) onto `sonic-3` or another stable model 2. Move off of [deprecated endpoints](/build-with-cartesia/tts-models/api-changes#api-changes) when creating voices 3. Use [Voice IDs](/build-with-cartesia/tts-models/voice-ids) 4. Check your deprecated model traffic on [play.cartesia.ai/deprecation/traffic](https://play.cartesia.ai/deprecation/traffic) 5. Make sure your voices are migrated on [play.cartesia.ai/deprecation/voices](https://play.cartesia.ai/deprecation/voices) 6. (Optional) Upgrade your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01` ## Why are we doing this? Since the launch of Sonic 3, we've made improvements across pacing, prosody, and naturalness; the vast majority of our customers have migrated to these models with great success. In order to increase our capacity, availability, and serving performance, we have to discontinue our oldest models. Additionally, our newer models have evolved beyond voice embeddings in order to sound more natural. The parts of our API that accept voice embeddings cannot be made forward-compatible. Migrating to voice IDs will allow us to continue to improve both our models and your voices in tandem. If you have questions, reach out to [support@cartesia.ai](mailto:support@cartesia.ai). # Migrating Voices Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/migrating-voices On June 1, 2026, we are discontinuing our voice embedding (aka stability) TTS models. Voices listed on [play.cartesia.ai/deprecation/voices](https://play.cartesia.ai/deprecation/voices) will stop working. Simply click "Auto Migrate" to make these voices compatible with the latest Sonic 3, 2, and Turbo snapshots. If you use voice embeddings rather than voice IDs, see [Voice IDs](/build-with-cartesia/tts-models/voice-ids). For an overview of all changes, see [API Changes](/build-with-cartesia/tts-models/api-changes). ## Where do these voices come from? Voices created by these endpoints rely on our voice embedding models: * [POST /voices](/2024-06-10/api-reference/voices/create) * [POST /voices/mix](/2024-06-10/api-reference/voices/mix) * `POST /voices/clone/clip` ## Creating voices You can move to our [Clone Voice API](/api-reference/voices/clone) or use our [web UI](https://play.cartesia.ai/voices/create/clone) to create voices from 3–10 seconds of source audio. You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks. Here is an example using the Cartesia SDK: ```python theme={null} your_api_key: str = "" client = Cartesia(api_key=your_api_key) print("Cloning a voice") with open("3 to 10 seconds of source audio.wav", mode="rb") as f: voice = client.voices.clone( clip=f, # this must match the source audio language="en", name="My Voice", mode="similarity", ) print(f"Cloned voice {voice.id}") print("Generating audio...") generated_audio = client.tts.bytes( # voice embeddings will not work after June 1, 2026! voice={"mode": "id", "id": voice.id}, model_id="sonic-3", transcript="Hello from Cartesia!", language="en", output_format={ "container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100 }, ) ``` # Older TTS Models Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/older-models We recommend using [Sonic 3](/build-with-cartesia/tts-models/latest) for best results, most languages, and controllability. We continue to serve these older models for compatibility. Some models and snapshots are being discontinued on June 1, 2026 — see [API Changes](/build-with-cartesia/tts-models/api-changes) for details. > the latest **stable** snapshot of the model\ > to be discontinued June 1, 2026 All models have a base model name (e.g. `sonic-2`, `sonic-turbo`) and date-versioned model names (e.g. `sonic-2-2025-06-11`). We recommend using base model names for prototyping and development, then switching to a date-versioned model for production use cases to ensure stability. When making API calls, you can specify either: ```python lines theme={null} # Use the base model # (automatically routes to the latest stable snapshot) model_id = "sonic-3" # Or specify a particular snapshot for consistency model_id = "sonic-3-2026-01-12" ``` ## `sonic-2` Sonic-2 provides ultra-realistic speech with accurate transcript following, minimal hallucinations, and excellent voice cloning. It's latency optimized and achieves 90ms model latency. Additional Capabilities: * Higher fidelity voice cloning * Timestamps for all 15 languages * [Infill](/2024-11-13/api-reference/infill/bytes) support | Snapshot | Release Date | Languages | Status | | ------------------------------------------- | -------------- | ---------------------------------------------------------- | ---------------- | | `sonic-2-2025-06-11` | June 11, 2025 | en, fr, de, es, pt, zh, ja, ko | Stable | | `sonic-2-2025-06-11` | June 11, 2025 | hi, it, nl, pl, ru, sv, tr | EOL June 1, 2026 | | `sonic-2-2025-05-08` | May 8, 2025 | en, fr, de, es, pt, zh, ja, ko | Stable | | `sonic-2-2025-05-08` | May 8, 2025 | hi, it, nl, pl, ru, sv, tr | EOL June 1, 2026 | | `sonic-2-2025-04-16` | April 16, 2025 | en, fr, de, es, pt, zh, ja, ko | Stable | | `sonic-2-2025-04-16` | April 16, 2025 | hi, it, nl, pl, ru, sv, tr | EOL June 1, 2026 | | `sonic-2-2025-03-07` | March 7, 2025 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 | Read these pages to learn more about how to use Sonic-2: * [Best practices](/build-with-cartesia/formatting-text-for-sonic-2/best-practices) * [Inserting breaks](/build-with-cartesia/formatting-text-for-sonic-2/inserting-breaks-pauses) * [Spelling text](/build-with-cartesia/formatting-text-for-sonic-2/spelling-out-input-text) ## `sonic-turbo` All the power of Sonic, with half the latency (as low as 40ms). | Snapshot | Release Date | Languages | Status | | ----------------------------------------------- | ------------- | ---------------------------------------------------------- | ---------------- | | `sonic-turbo-2025-06-04` | June 6, 2025 | en, fr, de, es, pt, zh, ja, hi, ko | Stable | | `sonic-turbo-2025-06-04` | June 6, 2025 | it, nl, pl, ru, sv, tr | EOL June 1, 2026 | | `sonic-turbo-2025-03-07` | March 7, 2025 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 | ## `sonic` The first version of our flagship text-to-speech model. It produces high-accuracy, expressive speech, and is optimized for efficiency to achieve low latency. | Snapshot | Release Date | Languages | Status | | ----------------------------------------- | ----------------- | ---------------------------------------------------------- | ---------------- | | `sonic-2024-12-12` | December 12, 2024 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 | | `sonic-2024-10-19` | October 19, 2024 | en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr | EOL June 1, 2026 | ## Deprecated and Preview Model Aliases The following model aliases are now deprecated. Please use the recommended model names instead: | Deprecated Alias | Use Instead | | ------------------------------------------- | ----------------------------------------- | | `sonic-3-preview` | `sonic-3` | | `sonic-preview` | `sonic-2` | | `sonic-english` | `sonic-2024-10-19` | | `sonic-multilingual` | `sonic-2024-10-19` | # Sonic 3 Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/sonic-3 `sonic-3` is our streaming TTS model, with high naturalness, accurate transcript following, and industry-leading latency. It provides fine-grained control on volume, speed, and emotion. Key Features: * **42 languages** supported * **Volume, speed, and emotion** controls, supported through API parameters and SSML tags * **Laughter** through `[laughter]` tags For more information, see [Volume, Speed, and Emotion](/build-with-cartesia/sonic-3/volume-speed-emotion). ### Voice selection Choosing voices that work best for your use case is key to getting the best performance out of Sonic 3. * **For voice agents**: We've found stable, realistic voices work better for voice agents than studio, emotive voices. Example American English voices include Katie (ID: `f786b574-daa5-4673-aa0c-cbe3e8534c02`) and Kiefer (ID: `228fca29-3a0a-435c-8728-5cb483251068`). * **For expressive characters**: We've tagged our most expressive and emotive voices with the `Emotive` tag. Example American English voices include Tessa (ID: `6ccbfb76-1fc6-48f7-b71d-91ac6298247b`) and Kyle (ID: `c961b81c-a935-4c17-bfb3-ba2239de8c2f`). For more information and recommendations, see [Choosing a Voice](/build-with-cartesia/capability-guides/choosing-a-voice). ### Language support Sonic-3 supports the following languages:
English (`en`)French (`fr`)German (`de`)Spanish (`es`)
Portuguese (`pt`)Chinese (`zh`)Japanese (`ja`)Hindi (`hi`)
Italian (`it`)Korean (`ko`)Dutch (`nl`)Polish (`pl`)
Russian (`ru`)Swedish (`sv`)Turkish (`tr`)Tagalog (`tl`)
Bulgarian (`bg`)Romanian (`ro`)Arabic (`ar`)Czech (`cs`)
Greek (`el`)Finnish (`fi`)Croatian (`hr`)Malay (`ms`)
Slovak (`sk`)Danish (`da`)Tamil (`ta`)Ukrainian (`uk`)
Hungarian (`hu`)Norwegian (`no`)Vietnamese (`vi`)Bengali (`bn`)
Thai (`th`)Hebrew (`he`)Georgian (`ka`)Indonesian (`id`)
Telugu (`te`)Gujarati (`gu`)Kannada (`kn`)Malayalam (`ml`)
Marathi (`mr`)Punjabi (`pa`)
## Selecting a Model | Snapshot | Release Date | Languages | Status | | ------------------------------------------- | ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ | | `sonic-3-2026-01-12` | January 12, 2026 | en, de, es, fr, ja, pt, zh, hi, ko, it, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | Stable | | `sonic-3-2025-10-27` | October 27, 2025 | en, de, es, fr, ja, pt, zh, hi, ko, it, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa | Stable | the latest **stable** snapshot of the model When making API calls, you can specify either: ```python lines theme={null} # Use the base model # (automatically routes to the latest stable snapshot) model_id = "sonic-3" # Or specify a particular snapshot for consistency model_id = "sonic-3-2026-01-12" # Try the latest (beta) model (can be 'hot swapped') model_id = "sonic-3-latest" ``` ### Continuous updates and model snapshots All models have a base model name (e.g. `sonic-3`) and a dated snapshot (e.g. `sonic-3-2025-10-27`). Using the base model will automatically keep you up to date with the most recent stable snapshot of that model. If pinning a specific version is important for your use case, we recommend using the dated version. For testing our latest capabilities, we recommend using `sonic-3-latest`, which is a non-snapshotted version. `sonic-3-latest` can be updated with no notice, and not recommended for production. To summarize: | **Model ID** | Model update behavior | Recommended for | | -------------------- | :---------------------------------------------------------- | ------------------------------------------------------------------------------------------ | | `sonic-3-YYYY-MM-DD` | Snapshotted, will never change | Customers who want to run internal evals before any updates | | `sonic-3` | Will be updated to point to the most recent stable snapshot | Customers who want stable releases, but want to be up-to-date with the recent capabilities | | `sonic-3-latest` | Will always be updated to our latest beta releases | Testing purposes | ## Older Models For information on `sonic-2`, `sonic-turbo`, `sonic-multilingual`, and `sonic`, see our page on [Older Models](/build-with-cartesia/tts-models/older-models). # Voice IDs Source: https://docs.cartesia.ai/build-with-cartesia/tts-models/voice-ids On June 1, 2026, we are discontinuing our voice embedding (aka stability) TTS models. If you are currently making generation requests with voice embeddings like this: ```json theme={null} { "voice": { "mode": "embedding", "embedding": [1, 2, ..., 3, 4] }, "model_id": "sonic-2", // ... } ``` You will need to switch to using voice IDs like this: ```json theme={null} { "voice": { "mode": "id", "id": "e07c00bc-4134-4eae-9ea4-1a55fb45746b" }, "model_id": "sonic-2", // ... } ``` If you already use voice IDs, see [Migrating Voices](/build-with-cartesia/tts-models/migrating-voices) to make sure your voices will continue to work after the change. For an overview of all changes, see [API Changes](/build-with-cartesia/tts-models/api-changes). ## Get a voice ID Choose one of the following options. ### Check out the voice library Our featured voices have all gone through rigorous evaluations and are ready to use in production. Check them out at [play.cartesia.ai/voices](https://play.cartesia.ai/voices) and copy the ID of any voice you'd like to use. ### Clone a voice If you have source audio, create a cloned voice via the [playground](https://play.cartesia.ai/voices/create/clone) or the [API](/api-reference/voices/clone). Cloning returns a voice ID you can use immediately. ### Generate source audio from your existing embedding If you no longer have the original audio clip used to create your embedding, generate a short sample with `sonic` or `sonic-2` and then clone a new voice. You can do this on our playground: 1. [play.cartesia.ai/text-to-speech](https://play.cartesia.ai/text-to-speech) 2. [play.cartesia.ai/voices/create/clone](https://play.cartesia.ai/voices/create/clone) Or with our API: 1. [Text to Speech (Bytes)](/2024-11-13/api-reference/tts/bytes) 2. [Clone Voice](/api-reference/voices/clone) Here is an example using our SDK: ```python theme={null} from cartesia import Cartesia # inputs your_api_key: str = "" your_voice_embedding: list[float] = [] language = "en" transcript = """ It's nice to meet you. Hope you're having a great day! Could we reschedule our meeting tomorrow? Please call me back as soon as possible. """ source_tts_model_id = "sonic" client = Cartesia(api_key=your_api_key) # Step 1: generate an audio sample print(f"Generating audio sample {source_tts_model_id=}") source_audio_iterator = client.tts.bytes( voice={"mode": "embedding", "embedding": your_voice_embedding}, model_id=source_tts_model_id, transcript=transcript, language=language, output_format={ "container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100 }, ) # Step 2: clone a voice print("Cloning a voice") voice = client.voices.clone( name="My Voice", language=language, clip=b"".join(source_audio_iterator), mode="similarity", ) print(f"Cloned voice {voice.id}") # you can now use the voice like this migrate_to_model = "sonic-3" generated_sample_file_name = f"{migrate_to_model}_{voice.id}.wav" cloned_audio_iterator = client.tts.bytes( voice={"mode": "id", "id": voice.id}, model_id=migrate_to_model, transcript=transcript, language=language, output_format={ "container": "wav", "encoding": "pcm_f32le", "sample_rate": 44100 }, ) with open(generated_sample_file_name, "wb") as f: for chunk in cloned_audio_iterator: f.write(chunk) print(f"Listen to your new voice: {generated_sample_file_name}") try: import subprocess subprocess.run( [ "ffplay", "-loglevel", "quiet", "-autoexit", "-nodisp", generated_sample_file_name, ] ) except FileNotFoundError: pass ``` ## Using Voice IDs See [TTS (Bytes)](/api-reference/tts/bytes), [TTS (SSE)](/api-reference/tts/sse), and [TTS (WebSocket)](/api-reference/tts/websocket) for full API documentation. You can test these API changes by setting your [Cartesia Version](/use-the-api/api-conventions#always-send-a-cartesia-version-header) to `2026-03-01`. We recommend upgrading your Cartesia Version on production traffic before June 1 to make sure nothing breaks. # Set up an organization Source: https://docs.cartesia.ai/enterprise/set-up-an-organization Organization workspaces enable seamless collaboration between multiple team members. All users in an organization share the same view of resources, including voices, API keys, and datasets. The only exceptions are playground generation history and starred voices, which remain private to each individual user. By default, your Cartesia account initializes as an organization workspace on the Free subscription plan with a limit of one member. To invite team members, you must first upgrade your organization to the Startup tier or higher. After upgrading, you can invite unlimited users at no additional cost. ## Manage your organization Organizations must be upgraded to the Startup tier or above before team members can be invited. Each workspace has its own billing and credit limits, so make sure you are on the intended organization before proceeding to upgrade your subscription. Upgrade organization Once you've upgraded your organization, you can use the "Manage" button in the workspace switcher to manage it: Organization manage button in switcher This pops up a modal where you can change your profile and invite your team: Organization manager modal There are two membership types in an organizaton: 1. Admin: has the ability to manage the organization profile, invitations, and members. 2. Member: can use all functionality included in the subscription, but cannot alter organization settings. Organization membership types You can invite unlimited team members in an organization once it is on Startup tier or higher. Once your organization is upgraded, voices, Line agents, API keys, and other resources will be available to all users in the organization. ## Create additional organizations If you want separate workspaces on different subscriptions, you can create another organization by going to the playground at [https://play.cartesia.ai](https://play.cartesia.ai), selecting the workspace switcher, and clicking **Create organization**. Create organization This will bring up a dialog where you can name your organization and upload a logo. Organization creation dialog Please reach out to us at [support@cartesia.ai](mailto:support@cartesia.ai) if you run into any troubles with your organization. # Set up SSO Source: https://docs.cartesia.ai/enterprise/set-up-sso We support Single-Sign On (SSO) for customers on the Enterprise plan via SAML. This integration is processed through our identity provider, [Clerk](https://clerk.com). ## Set up SSO with Okta 1. Send us your SSO domain. 2. We will send you a service provider configuration, which consists of a single-sign on URL and an audience URI (SP entity ID). 3. Follow steps 2, 3, 4, and 5 in [the Clerk SSO guide](https://clerk.com/docs/authentication/enterprise-connections/saml/okta), and send us the metadata URL you get from step 6.1. After you are done, we will complete the remaining SSO setup and send you a confirmation that SSO is enabled for your organization. # Authenticate your client applications Source: https://docs.cartesia.ai/get-started/authenticate-your-client-applications Secure client access to Cartesia APIs using Access Tokens You may want to make Cartesia API requests directly from your client application, eg, a web app. However, shipping your API key to the app is not secure, as a malicious user could extract your API key and issue API requests billed to your account. Access Tokens provide a secure way to authenticate client-side requests to Cartesia's APIs without exposing your API key. Access Tokens are used in contexts like web apps which should not be bundled with an API key. For trusted contexts like server applications, local scripts, or iPython notebooks, you should simply use API keys. ## Prerequisites Before implementing Access Tokens: 1. Configure your server with a Cartesia API key 2. Implement user authentication in your application 3. Establish secure client-server communication ### Available Grants Access Tokens support granular permissions through grants. Both TTS and STT grants are optional: **TTS Grant**: With `grants: { tts: true }`, clients have access to: * `/tts/bytes` - Synchronous TTS generation streamed with chunked encoding * `/tts/sse` - Server-sent events for streaming * `/tts/websocket` - WebSocket-based streaming **STT Grant**: With `grants: { stt: true }`, clients have access to: * `/stt/websocket` - WebSocket-based speech-to-text streaming * `/stt` - Batch speech-to-text processing * `/audio/transcriptions` - OpenAI-compatible transcription endpoint **Agents Grant**: With `grants: { agent: true }`, clients have access to: * the Agents websocket calling endpoint You can request multiple grants in a single token: ```json theme={null} grants: { tts: true, stt: true, agent: false } ``` ## Implementation Guide ### 1. Token Generation (Server-side) Make a request to generate a new access token: ```bash cURL lines theme={null} # TTS and STT access curl --location 'https://api.cartesia.ai/access-token' \ -H 'Cartesia-Version: 2025-04-16' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer sk_car_...' \ -d '{ "grants": {"tts": true, "stt": true}, "expires_in": 60}' # TTS-only access curl --location 'https://api.cartesia.ai/access-token' \ -H 'Cartesia-Version: 2025-04-16' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer sk_car_...' \ -d '{ "grants": {"tts": true}, "expires_in": 60}' ``` ```javascript JavaScript lines theme={null} import { CartesiaClient } from "@cartesia/cartesia-js"; const client = new CartesiaClient({ apiKey: "YOUR_API_KEY" }); // TTS and STT access await client.auth.accessToken({ grants: { tts: true, stt: true }, expires_in: 60 }); // TTS-only access await client.auth.accessToken({ grants: { tts: true }, expires_in: 60 }); ``` ```python Python lines theme={null} from cartesia import Cartesia client = Cartesia( token="YOUR_API_KEY" ) # TTS and STT access response = client.auth.access_token( grants={"tts": True, "stt": True}, # Grant both permissions expires_in=60 # Token expires in 60 seconds ) # TTS-only access response = client.auth.access_token( grants={"tts": True}, # Grant TTS permissions only expires_in=60 # Token expires in 60 seconds ) # The response will contain the access token print(f"Access Token: {response.token}") ``` #### Example Implementation ```typescript lines theme={null} // TTS and STT access const response = await fetch("https://api.cartesia.ai/access-token", { method: "POST", headers: { "Content-Type": "application/json", "Cartesia-Version": "2025-04-16", Authorization: "Bearer ", }, body: JSON.stringify({ grants: { tts: true, stt: true }, expires_in: 60, // 1 minute }), }); // TTS-only access const responseTTS = await fetch("https://api.cartesia.ai/access-token", { method: "POST", headers: { "Content-Type": "application/json", "Cartesia-Version": "2025-04-16", Authorization: "Bearer ", }, body: JSON.stringify({ grants: { tts: true }, expires_in: 60, // 1 minute }), }); const { token } = await response.json(); ``` For detailed API specifications, see the [Token API Reference](/api-reference/auth/access-token). ### 2. Token Storage (Client-side) Store the token securely, such as setting HTTP-only cookie with matching token expiration. The cookie should be `httpOnly`, `secure`, and `sameSite: "strict"`. ### 3. Making Authenticated Requests ```typescript lines theme={null} // Using TTS with access token const ttsResponse = await fetch("https://api.cartesia.ai/tts/bytes", { headers: { Authorization: `Bearer ${accessToken}`, "Content-Type": "application/json", }, // ... request configuration }); // Using STT with access token const sttResponse = await fetch("https://api.cartesia.ai/stt", { method: "POST", headers: { Authorization: `Bearer ${accessToken}`, }, body: formData, // multipart/form-data with audio file }); ``` ### 4. Token Refresh Strategy Proactively refresh the token in your app before they expire. ## Security Best Practices ### Essential Guidelines * ✅ Generate tokens server-side only * ✅ Use short token lifetimes (minutes) * ✅ Implement automatic token refresh * ✅ Store tokens in HTTP-only cookies * ✅ Enable secure and SameSite cookie flags ### Security Don'ts * ❌ Never store tokens in localStorage/sessionStorage * ❌ Never log tokens or display them in the UI * ❌ Never transmit tokens over non-HTTPS connections ### Token Lifecycle Management 1. Generate new token upon user authentication 2. Implement automatic refresh before expiration 3. Handle token expiration gracefully ## Additional Resources * [API Reference](/api-reference/auth/access-token) - Access Token generation endpoint documentation # Welcome to Cartesia Source: https://docs.cartesia.ai/get-started/overview Our API enables developers to build real-time, multimodal AI experiences that feel natural and responsive. The Cartesia API is the fastest, most emotive, ultra-realistic voice AI platform. Purpose-built for developers, it serves state-of-the-art models for both text-to-speech and speech-to-text, enabling seamless conversational AI experiences. ## Sonic Models for Text-to-Speech Sonic models take text input and and stream back ultra-realistic speech in response. They can also clone voices, with full control over pronunciation and accent. **Sonic 3 is the world's fastest, most emotive, ultra-realistic text-to-speech model.** It can stream out the first byte of audio in just 90ms, making it perfect for real-time and conversational experiences as well as dubbing, narration, AI avatars, and more. (To put things into perspective, 90ms is about twice as fast as the blink of an eye.) **If real-time performance is your top priority,** Sonic Turbo offers even better performance, streaming out the first byte of audio in just 40ms. Learn more about available Sonic model variants and their capabilities in the [TTS Models](../build-with-cartesia/tts-models/latest) section. ## Ink Models for Speech-to-Text Ink models provide streaming speech-to-text transcription optimized for real-time voice applications. **Ink-Whisper**, our debut model, is specifically engineered for conversational AI—handling telephony artifacts, background noise, accents, and proper nouns that typically challenge standard STT systems. Ink-Whisper uses advanced dynamic chunking to process variable-length audio segments, reducing errors and hallucinations during pauses or audio gaps. At just \$0.13/hour, it's the most affordable streaming STT model available. Learn more about the Ink model and its capabilities in the [STT Models](../build-with-cartesia/stt-models) section. ## Support Join our Discord server to chat with the Cartesia team, engage with the community, and get help with your projects. Email us at [support@cartesia.ai](mailto:support@cartesia.ai) to get help with integrating Cartesia, your account, or billing. # Realtime Text to Speech Quickstart Source: https://docs.cartesia.ai/get-started/realtime-text-to-speech-quickstart Stream text to Cartesia over a WebSocket and receive audio in real time. Using the Cartesia Websocket API allows you to simultaneously stream text input and audio output. This is best for realtime use cases such as voice agents when text is generated incrementally, as from an LLM. Stream text in chunks to the Cartesia and receive audio chunks in real time. This is ideal when text is generated incrementally, such as from an LLM. ## Prerequisites * A Cartesia API key. [Create one here](https://play.cartesia.ai/keys), then add it to your `.bashrc` or `.zshrc`: ```sh theme={null} export CARTESIA_API_KEY= ``` * `ffplay` (part of FFmpeg), used to play audio output: ```sh theme={null} brew install ffmpeg ``` ```sh theme={null} sudo apt install ffmpeg ``` ## Stream text and play audio ```sh theme={null} pip install 'cartesia[websockets]' ``` ```python realtime-tts.py theme={null} from cartesia import Cartesia import subprocess import os client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY")) print("Starting ffplay to play streaming audio output...") player = subprocess.Popen( ["ffplay", "-f", "f32le", "-ar", "44100", "-probesize", "32", "-analyzeduration", "0", "-nodisp", "-autoexit", "-loglevel", "quiet", "-"], stdin=subprocess.PIPE, bufsize=0, ) print("Connecting to Cartesia via websockets...") with client.tts.websocket_connect() as connection: ctx = connection.context( model_id="sonic-3", voice={"mode": "id", "id": "f786b574-daa5-4673-aa0c-cbe3e8534c02"}, output_format={ "container": "raw", "encoding": "pcm_f32le", "sample_rate": 44100, }, ) print("Sending chunked text input...") for part in ["Hi there! ", "Welcome to ", "Cartesia Sonic."]: ctx.push(part) ctx.no_more_inputs() for response in ctx.receive(): if response.type == "chunk" and response.audio: print(f"Received audio chunk ({len(response.audio)} bytes)") # Here we pipe audio to ffplay. In a production app you might play audio in # a client, or forward it to another service, eg, a telephony provider. player.stdin.write(response.audio) elif response.type == "done": break player.stdin.close() player.wait() ``` ```sh theme={null} python3 realtime-tts.py ``` This will stream text inputs to Cartesia, and play the streaming audio output using `ffplay`. (Make sure your device volume is turned on!) ```sh theme={null} npm install @cartesia/cartesia-js ws ``` In the browser, you don't need the `ws` package — the native WebSocket API is used instead. However, you will need to use ephemeral access tokens for authentication. See [Authenticate Your Client Applications](/get-started/authenticate-your-client-applications). Create a file named `realtime-tts.js` with the following code: ```js realtime-tts.js theme={null} import Cartesia from "@cartesia/cartesia-js"; import { spawn } from "child_process"; const client = new Cartesia({ apiKey: process.env["CARTESIA_API_KEY"] }); console.log("Starting ffplay to play streaming audio output..."); const player = spawn("ffplay", ["-f", "f32le", "-ar", "44100", "-probesize", "32", "-analyzeduration", "0", "-nodisp", "-autoexit", "-loglevel", "quiet", "-"], { stdio: ["pipe", "ignore", "ignore"], }); console.log("Connecting to Cartesia via websockets..."); const ws = await client.tts.websocket(); const ctx = ws.context({ model_id: "sonic-3", voice: { mode: "id", id: "f786b574-daa5-4673-aa0c-cbe3e8534c02" }, output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: 44100 }, }); console.log("Sending chunked text input..."); const transcriptChunks = ["Hi there! ", "Welcome to ", "Cartesia Sonic."] for (const part of transcriptChunks) { await ctx.push({ transcript: part }); } await ctx.no_more_inputs(); for await (const event of ctx.receive()) { if (event.type === "chunk" && event.audio) { console.log("Received audio chunk (%d bytes)", event.audio.length); // Here we pipe audio to ffplay. In a production app you might play audio in // a client, or forward it to another service, eg, a telephony provider. player.stdin.write(event.audio); } else if (event.type === "done") { break; } } player.stdin.end(); ws.close(); ``` ```sh theme={null} node realtime-tts.js ``` This will stream text inputs to Cartesia, and play the streaming audio output using `ffplay`. (Make sure your device volume is turned on!) ## How it works The WebSocket connection manages multiple *contexts*, each representing an independent, continuous stream of speech. Cartesia context is exactly like an LLM context: on our servers, we store the previously-generated speech so that new speech matches it in tone. To summarize, here's what our code does, after establishing a Websocket connection: 1. **Create a context** with `context()`. 2. **Push text** incrementally with `push()`. Each chunk continues seamlessly from the previous one using [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations). 3. **Signal completion** with `no_more_inputs()` to tell the model no more text is coming. 4. **Receive audio** chunks as they are generated. This maps directly to LLM token streaming — push each token or sentence fragment as it arrives, and audio begins streaming back even if the full text is not yet available. ## What's next Deep dive into context management and buffering. Browse voices and learn how to pick the right one for your use case. Pick the right output format, sample rate, and encoding for your use case. # LiveKit Source: https://docs.cartesia.ai/integrations/live-kit LiveKit Agents logo **LiveKit** is a WebRTC-first platform for realtime **video, voice, and data** in your product. **LiveKit Agents** is its framework for conversational agents. **Cartesia** integrates in two ways: **LiveKit Inference** (hosted **cartesia/sonic-3** and related model IDs in the agent runtime; keys and pricing are through **LiveKit**—see [LiveKit’s Cartesia TTS guide](https://docs.livekit.io/agents/models/tts/inference/cartesia)) and the open-source **[livekit-plugins-cartesia](https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-cartesia)** Python package for **TTS and STT** using your **Cartesia** credentials from the worker. # Demo Here's a demo of a voice assistant built with LiveKit Agents and Cartesia: Try out the LiveKit Cartesia demo. The source code for this demo is available [here](https://github.com/livekit-examples/cartesia-voice-agent) # Overview Source: https://docs.cartesia.ai/integrations/overview Partner integrations for Cartesia TTS and STT in your own app—not Cartesia-hosted agents. Cartesia provides first-party speech APIs and SDKs, and integrates with many other products and developer frameworks. The pages in this section describe each path at a high level; detailed setup usually lives in partner documentation and repositories. ## Prerequisites You’ll need these for almost every integration below. Individual pages also list extras (partner accounts, runtimes, SDK installs). * **[Cartesia API key](https://play.cartesia.ai/keys)** — create and manage keys in the Playground. * **A voice** — pick one in the Playground or API; see [Choosing a voice](/build-with-cartesia/capability-guides/choosing-a-voice) and [Voice IDs](/build-with-cartesia/tts-models/voice-ids). ## Integrations Realtime rooms and agents—Cartesia via LiveKit Inference or the Cartesia plugin. Python voice and multimodal agents with official Cartesia TTS/STT services. Programmable Voice and Media Streams with Cartesia TTS (Node walkthrough). TRTC realtime media with Cartesia for conversational AI workloads. No-code phone agents; Cartesia is the default voice stack for new agents. Rasa Pro voice assistants with Cartesia as the TTS backend. Stream’s Vision Agents framework with a Cartesia TTS plugin. `cartesia-mcp` for Cursor, Claude Desktop, and other MCP clients. # Pipecat Source: https://docs.cartesia.ai/integrations/pipecat Pipecat logo ## Overview [**Pipecat**](https://www.pipecat.ai/) is an open-source Python framework for realtime **voice** agents. Building voice agents requires the creation and orchestration of pipelines, media and communication transports (such as Daily or LiveKit), and pluggable AI models. **Cartesia** is available as a first-party provider plugin for **[TTS and STT services](https://github.com/pipecat-ai/pipecat/tree/main/src/pipecat/services/cartesia)** in the Pipecat repo. ## Prerequisites Pipecat’s examples require a recent Python installation (see the Pipecat repo's [root-level README](https://github.com/pipecat-ai/pipecat/tree/main#prerequisites) for current prerequisites). Install the **`pipecat-ai`** Python package with the **`cartesia`** extra for TTS/STT (bracket syntax): ``` pip install "pipecat-ai[cartesia,...]" # or uv add "pipecat-ai[cartesia,...]" ``` You'd also need to choose the **transport** extras your sample needs - you can do this by matching whatever the upstream README lists for that example. ## Getting Started - TTS (Websockets) Pipecat's getting-started example provides you with a small, copy-friendly path to wire Cartesia TTS into a Pipecat [TTS WebSocket API](https://docs.cartesia.ai/api-reference/tts/websocket), and: Getting-started examples in the Pipecat repo. ## Getting Started - TTS and STT (Websockets & HTTP) For smaller voice-focused samples using Cartesia STT and TTS you can choose between two transports - WebSockets or HTTP: Voice bot using Cartesia STT & TTS over WebSocket. Same flow using Cartesia STT & TTS over HTTP. ## Orchestrated Conversational AI For a fuller example app that shows an end to end Voice Agent experience (VAD -> STT -> LLM -> TTS) orchestrated with Pipecat, see StudyPal: StudyPal example in the pipecat-examples repo. # Rasa Source: https://docs.cartesia.ai/integrations/rasa **Rasa** is an open dialogue stack; **voice streaming with Cartesia** is documented for **Rasa Pro** (commercial) assistants. Configure a voice channel in **`credentials.yml`** with `tts: name: cartesia` and **`CARTESIA_API_KEY`** per Rasa’s speech-integrations reference. Start with their walkthrough, then use the reference for parameters (`model_id`, `voice`, multilingual `language_map`, etc.): Full tutorial for building a voice agent with Rasa and Cartesia. For implementation details, see their documentation: Rasa reference for Cartesia TTS in voice channels. # Tencent RTC Source: https://docs.cartesia.ai/integrations/tencent-rtc Cartesia & Tencent **Tencent Real-Time Communication (TRTC)** is Tencent Cloud’s stack for realtime audio and video—calls, live streaming, and conferencing. **TRTC Conversational AI** is Tencent’s packaged stack for realtime voice agents. Tencent and Cartesia have a **public partnership** to combine TRTC networking with Cartesia **Sonic** TTS and **Ink-Whisper** STT for low-latency conversational AI (see Tencent’s [TRTC × Cartesia solution overview](https://trtc.tencentcloud.com/solutions/trtc-cartesia)). Integration steps and SDK details live in **Tencent’s** console and docs. # Demo Experience the TRTC × Cartesia voice assistant here: [TRTC x Cartesia Demo](https://trtc.io/demo/homepage/#/cartesia) # Thoughtly Source: https://docs.cartesia.ai/integrations/thoughtly
Thoughtly logo
**Thoughtly** is a no-code platform for **inbound and outbound AI phone agents** (sales, support, routing): visual flows, CRM and calendar integrations, analytics, and telephony. Following the [Thoughtly × Cartesia partnership](https://www.thoughtly.com/blog/thoughtly-upgrades-its-voice-library-through-partnership-with-cartesia/), **new agents default to Cartesia voices** (low-latency TTS, expanded library, cloning from a short sample in-product); Thoughtly notes existing agents can keep prior voices during migration. # Demo See a demo of Cartesia on Thoughtly. # Integrate with Twilio Source: https://docs.cartesia.ai/integrations/twilio How to integrate Twilio with Cartesia to generate audio from text and send it as a voice call. Use **Twilio Programmable Voice** with **Media Streams** so a phone call receives audio generated by **Cartesia TTS** over WebSockets. This walkthrough uses **Node.js**: a small server bridges Twilio’s stream to Cartesia and plays TTS audio on the callee’s line. ## Prerequisites Before you begin, make sure you have the following: 1. [Node.js](https://nodejs.org/en/download) installed. 2. A [Twilio account](https://www.twilio.com/en-us/try-twilio). You will need your Account SID and Auth Token. 3. A [Cartesia API key](https://play.cartesia.ai/keys). 4. A phone number that you want to call. 5. A Twilio phone number to call from. 6. An [ngrok authtoken](https://dashboard.ngrok.com/get-started/your-authtoken) (a free account works). ## Get Started 1. Create a new directory for your project and navigate to it in your terminal. 2. Initialize a new Node.js project: ```bash lines theme={null} npm init -y ``` 3. Install the required dependencies: ```bash lines theme={null} npm install twilio ws http @ngrok/ngrok dotenv ``` Create a `.env` file in your project root and add the following: ```sh lines theme={null} TWILIO_ACCOUNT_SID="your_twilio_account_sid" TWILIO_AUTH_TOKEN="your_twilio_auth_token" CARTESIA_API_KEY="your_cartesia_api_key" NGROK_AUTHTOKEN="your_ngrok_authtoken" ``` Replace the placeholder values with your actual credentials. Create a file named `app.js` (or any name you prefer) and add the following code: ```javascript lines theme={null} const twilio = require('twilio'); const WebSocket = require('ws'); const http = require('http'); const ngrok = require('@ngrok/ngrok'); const dotenv = require('dotenv'); const crypto = require('crypto'); // Load environment variables dotenv.config({ override: true }); // Function to get a value from environment variable or command line argument function getConfig(key, defaultValue = undefined) { return process.env[key] || process.argv.find(arg => arg.startsWith(`${key}=`))?.split('=')[1] || defaultValue; } // Configuration const config = { TWILIO_ACCOUNT_SID: getConfig('TWILIO_ACCOUNT_SID'), TWILIO_AUTH_TOKEN: getConfig('TWILIO_AUTH_TOKEN'), CARTESIA_API_KEY: getConfig('CARTESIA_API_KEY'), NGROK_AUTHTOKEN: getConfig('NGROK_AUTHTOKEN'), }; // Validate required configuration const requiredConfig = ['TWILIO_ACCOUNT_SID', 'TWILIO_AUTH_TOKEN', 'CARTESIA_API_KEY', 'NGROK_AUTHTOKEN']; for (const key of requiredConfig) { if (!config[key]) { console.error(`Missing required configuration: ${key}`); process.exit(1); } } const client = twilio(config.TWILIO_ACCOUNT_SID, config.TWILIO_AUTH_TOKEN); ``` In the script, you'll find a configuration section for Cartesia TTS. Make sure to set the following variables according to your needs: ```javascript lines theme={null} const TTS_WEBSOCKET_URL = `wss://api.cartesia.ai/tts/websocket?cartesia_version=2025-03-01`; const modelId = 'sonic-3'; const voice = { 'mode': 'id', // You can check available voices using the Cartesia API or at https://play.cartesia.ai 'id': "e07c00bc-4134-4eae-9ea4-1a55fb45746b" }; const partialResponse = 'Hi there, my name is Cartesia. I hope youre having a great day!'; ``` Configure your Twilio outbound and inbound numbers: ```javascript lines theme={null} const outbound = "+1234567890"; // Replace with the number you want to call const inbound = "+1234567890"; // Replace with your Twilio number ``` The `main()` function orchestrates the entire process: 1. Connects to the Cartesia TTS WebSocket 2. Tests the TTS WebSocket 3. Sets up a Twilio WebSocket server 4. Creates an ngrok tunnel for the Twilio WebSocket 5. Initiates the call using Twilio ```javascript expandable lines theme={null} let ttsWebSocket; let callSid; let messageComplete = false; let audioChunksReceived = 0; function log(message) { console.log(`[${new Date().toISOString()}] ${message}`); } function connectToTTSWebSocket() { return new Promise((resolve, reject) => { log('Attempting to connect to TTS WebSocket'); ttsWebSocket = new WebSocket(TTS_WEBSOCKET_URL, { headers: { 'X-Api-Key': config.CARTESIA_API_KEY } }); ttsWebSocket.on('open', () => { log('Connected to TTS WebSocket'); resolve(ttsWebSocket); }); ttsWebSocket.on('error', (error) => { log(`TTS WebSocket error: ${error.message}`); reject(error); }); ttsWebSocket.on('close', (code, reason) => { log(`TTS WebSocket closed. Code: ${code}, Reason: ${reason}`); reject(new Error('TTS WebSocket closed unexpectedly')); }); }); } function sendTTSMessage(message) { const textMessage = { 'model_id': modelId, 'transcript': message, 'voice': voice, 'output_format': { 'container': 'raw', 'encoding': 'pcm_mulaw', 'sample_rate': 8000 }, // create a new context for each message since each is a complete transcript 'context_id': crypto.randomUUID() }; log(`Sending message to TTS WebSocket: ${message}`); ttsWebSocket.send(JSON.stringify(textMessage)); } function testTTSWebSocket() { return new Promise((resolve, reject) => { const testMessage = 'This is a test message'; let receivedAudio = false; sendTTSMessage(testMessage); const timeout = setTimeout(() => { if (!receivedAudio) { reject(new Error('Timeout: No audio received from TTS WebSocket')); } }, 10000); // 10 second timeout ttsWebSocket.on('message', (audioChunk) => { if (!receivedAudio) { log(audioChunk); log('Received audio chunk from TTS for test message'); receivedAudio = true; clearTimeout(timeout); resolve(); } }); }); } async function startCall(twilioWebsocketUrl) { try { log(`Initiating call with WebSocket URL: ${twilioWebsocketUrl}`); const call = await client.calls.create({ twiml: ``, to: outbound, // Replace with the phone number you want to call from: inbound // Replace with your Twilio phone number }); callSid = call.sid; log(`Call initiated. SID: ${callSid}`); } catch (error) { log(`Error initiating call: ${error.message}`); throw error; } } async function hangupCall() { try { log(`Attempting to hang up call: ${callSid}`); await client.calls(callSid).update({status: 'completed'}); log('Call hung up successfully'); } catch (error) { log(`Error hanging up call: ${error.message}`); } } function setupTwilioWebSocket() { return new Promise((resolve, reject) => { const server = http.createServer((req, res) => { log(`Received HTTP request: ${req.method} ${req.url}`); res.writeHead(200); res.end('WebSocket server is running'); }); const wss = new WebSocket.Server({ server }); log('WebSocket server created'); wss.on('connection', (twilioWs, request) => { log(`Twilio WebSocket connection attempt from ${request.socket.remoteAddress}`); let streamSid = null; twilioWs.on('message', (message) => { try { const msg = JSON.parse(message); log(`Received message from Twilio: ${JSON.stringify(msg)}`); if (msg.event === 'start') { log('Media stream started'); streamSid = msg.start.streamSid; log(`Stream SID: ${streamSid}`); sendTTSMessage(partialResponse); } else if (msg.event === 'media' && !messageComplete) { log('Received media event'); } else if (msg.event === 'stop') { log('Media stream stopped'); hangupCall(); } } catch (error) { log(`Error processing Twilio message: ${error.message}`); } }); twilioWs.on('close', (code, reason) => { log(`Twilio WebSocket disconnected. Code: ${code}, Reason: ${reason}`); }); twilioWs.on('error', (error) => { log(`Twilio WebSocket error: ${error.message}`); }); // Handle incoming audio chunks from TTS WebSocket ttsWebSocket.on('message', (audioChunk) => { log('Received audio chunk from TTS'); try { if (streamSid) { twilioWs.send(JSON.stringify({ event: 'media', streamSid: streamSid, media: { payload: JSON.parse(audioChunk)['data'] } })); audioChunksReceived++; log(`Audio chunks received: ${audioChunksReceived}`); if (audioChunksReceived >= 50) { messageComplete = true; log('Message complete, preparing to hang up'); setTimeout(hangupCall, 2000); } } else { log('Warning: Received audio chunk but streamSid is not set'); } } catch (error) { log(`Error sending audio chunk to Twilio: ${error.message}`); } }); log('Twilio WebSocket connected and handlers set up'); }); wss.on('error', (error) => { log(`WebSocket server error: ${error.message}`); }); server.listen(0, () => { const port = server.address().port; log(`Twilio WebSocket server is running on port ${port}`); resolve(port); }); server.on('error', (error) => { log(`HTTP server error: ${error.message}`); reject(error); }); }); } async function setupNgrokTunnel(port) { try { const listener = await ngrok.forward({ addr: port, authtoken: config.NGROK_AUTHTOKEN, }); const wssUrl = listener.url().replace('https://', 'wss://'); log(`ngrok tunnel established: ${wssUrl}`); return wssUrl; } catch (error) { log(`Error setting up ngrok tunnel: ${error.message}`); throw error; } } async function main() { try { log('Starting application'); await connectToTTSWebSocket(); log('TTS WebSocket connected successfully'); await testTTSWebSocket(); log('TTS WebSocket test passed successfully'); const twilioWebsocketPort = await setupTwilioWebSocket(); log(`Twilio WebSocket server set up on port ${twilioWebsocketPort}`); const twilioWebsocketUrl = await setupNgrokTunnel(twilioWebsocketPort); await startCall(twilioWebsocketUrl); } catch (error) { log(`Error in main function: ${error.message}`); } } // Run the script main(); ``` To run the application, use the following command: ```bash lines theme={null} node app.js ``` ## How It Works 1. The script establishes a connection to Cartesia's TTS WebSocket. 2. It sets up a local WebSocket server to communicate with Twilio. 3. An ngrok tunnel is created to expose the local WebSocket server to the internet. 4. A call is initiated using Twilio, connecting to the ngrok tunnel. 5. When the call connects, the script sends the predefined message to Cartesia's TTS. 6. Cartesia converts the text to speech and sends audio chunks back. 7. The script forwards these audio chunks to Twilio, which plays them on the call. ## Customization * To change the spoken message, modify the `partialResponse` variable. * Adjust the voice parameters in the `voice` object to change the TTS voice characteristics. * Modify the `audioChunksReceived` threshold to control when the call should end. ## Troubleshooting * If you encounter any issues, check the console logs for detailed error messages. * Ensure all required environment variables are correctly set. * If you see `invalid tunnel configuration`, make sure you're using the better supported `@ngrok/ngrok` package and not `ngrok`. # Vision Agents by Stream Source: https://docs.cartesia.ai/integrations/vision-agents-by-stream Vision Agents logo [Stream](https://getstream.io/) maintains **[Vision Agents](https://visionagents.ai)**—an open-source Python framework for voice- and vision-driven agents with realtime media over **Stream**’s WebRTC edge. Cartesia is supported as the **TTS** provider; install steps, environment variables, and parameters are in Stream’s **[Cartesia integration](https://visionagents.ai/integrations/cartesia)**. You need a **Stream** developer account for realtime transport and a **Cartesia API key** for speech. The ["Simple Agent"](https://github.com/GetStream/Vision-Agents/tree/main/examples/01_simple_agent_example) example in GitHub and the [voice](https://visionagents.ai/introduction/voice-agents) / [video](https://visionagents.ai/introduction/video-agents) intros are good starting points. # Demo Try out the Simple Agent Cartesia demo. # CLI documentation Source: https://docs.cartesia.ai/line/cli Create, deploy, and manage voice agents from the command line. ## Installation By running the quick install commands, you are accepting Cartesia's [Terms of Service (TOS)](https://cartesia.ai/legal/terms.html). Please make sure to review the full TOS here before proceeding. Install and download from our servers: ```zsh lines theme={null} curl -fsSL https://cartesia.sh | sh ``` Update to the latest version: ```zsh lines theme={null} cartesia update ``` ## Quick Start Authenticate with your Cartesia API key. To make an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys) and select your organization. ```zsh lines theme={null} cartesia auth login # paste your API key when prompted ``` Clone an example agent from the Line repository. ```zsh lines theme={null} cartesia create my-agent # Choose any example you like. cd my-agent ``` Give your agent a name and link it to your organization. ```zsh lines theme={null} cartesia init ``` Deploy your agent to make it available in the playground. ```zsh lines theme={null} cartesia deploy ``` ## Features ### Initialize a Project Link any directory to a new or existing Cartesia agent: ```zsh lines theme={null} cartesia init ``` Create a project from an example: ```zsh lines theme={null} cartesia create ``` Inside a project directory, the CLI auto-detects the agent. Run `cartesia status` to see the current agent ID. ### Chat with Your Agent Test your agent's text reasoning locally. Terminal 1. Run your text logic fastapi server: ```zsh lines theme={null} PORT=8000 uv run python main.py ``` Terminal 2. Run the CLI to chat with your agent: ```zsh lines theme={null} cartesia chat 8000 ``` ## Commands ### Authentication To get an API key, go to [play.cartesia.ai/keys](https://play.cartesia.ai/keys), select your organization, and generate a new key. ```zsh lines theme={null} cartesia auth login ``` To validate the existing API key: ```zsh lines theme={null} cartesia auth status ``` To logout (clears cached credentials): ```zsh lines theme={null} cartesia auth logout ``` ### Voice Agents Deploy your agent to Cartesia cloud. ```zsh lines theme={null} cartesia deploy ``` List out all the agents in your organization: ```zsh lines theme={null} cartesia agents ls ``` #### Managed Deployments Versions of your agent running on Cartesia's cloud. Each deployment rebuilds the environment, instantiates your project, and runs a health check. To see all of your deployments: ```zsh lines theme={null} cartesia deployments ls ``` Check the status of a deployment: ```zsh lines theme={null} cartesia status [ or ] ``` #### Self-Hosted Agent Code While Cartesia's managed deployments are the simplest way to deploy low-latency voice agents, if you'd like to manage your own deployments of your agent code, you can pass us a URL for your agent to connect to during calls. Connect an existing agent to your self-hosted code: ```zsh lines theme={null} cartesia connect --agent-id --url https://my-agent.example.com ``` Or run without `--agent-id` to interactively select an existing agent or create a new one: ```zsh lines theme={null} cartesia connect --url https://my-agent.example.com ``` Disconnect an agent from your self-hosted code: ```zsh lines theme={null} cartesia disconnect --agent-id ``` ### Environment Variables Create, list, and remove environment variables for your agent. Set environment variables for your agent: ```zsh lines theme={null} cartesia env set API_KEY=FOOBAR MY_CONFIG=FOOBAZ ``` Environment variables are encrypted for storage and can only be accessed by your code. Port environment variables from a `.env` file: ```zsh lines theme={null} cartesia env set --from .env ``` ```text .env theme={null} API_KEY=FOOBAR MY_CONFIG=FOOBAZ ``` Remove an environment variable: ```zsh lines theme={null} cartesia env rm ``` ### Help Menu For more details on any command: ```zsh lines theme={null} cartesia --help ``` # Release Notes Source: https://docs.cartesia.ai/line/developer-tools/release-notes Updates to the Line SDK and platform. ## March 2026 Platform-wide API, PVC, and client library updates for this month are in [Changelog 2026](/changelog/2026) (March 2026). *** ## February 4, 2026 ### AgentUpdateCall Output Event Added `AgentUpdateCall` event for dynamically updating call configuration during a conversation: ```python theme={null} from line.events import AgentUpdateCall # In an agent's process method: yield AgentUpdateCall(voice_id="5ee9feff-1265-424a-9d7f-8e4d431a12c7") yield AgentUpdateCall(pronunciation_dict_id="dict-123") ``` | Field | Description | | ----------------------- | ------------------------------------ | | `voice_id` | Updates the agent's voice | | `pronunciation_dict_id` | Updates the pronunciation dictionary | All fields are optional—only set fields are updated. See [Events](/line/sdk/events#dynamic-configuration) for details. *** ## February 1, 2026 ### Line SDK v0.2 — Major Release We're releasing **Line SDK v0.2**, a complete redesign of the voice agent framework focused on simplicity, streaming performance, and seamless LLM integration. This release introduces a new async iterable architecture that replaces the previous event bus system. **Breaking Changes**: v0.2 is not backwards compatible with v0.1.x. See the [Migration Guide](#migration-guide-from-v0-1-x-to-v0-2) below for detailed upgrade instructions. **What's changing?** Line SDK v0.2 makes it much simpler to build voice agents. Instead of manually wiring together multiple components (systems, bridges, nodes), you now write a single function that returns your agent. The SDK handles audio, interruptions, and conversation flow automatically. **Why upgrade?** * **Faster development** — Build agents in hours instead of days with less boilerplate code * **Easier maintenance** — Fewer moving parts means fewer bugs and simpler debugging * **Better reliability** — Built-in error handling, retries, and fallback models * **More flexibility** — Switch between 100+ AI providers (OpenAI, Anthropic, Google, etc.) without code changes * **Powerful tools** — Add capabilities like web search, call transfers, and multi-agent handoffs with one line of code *** ## What's New in v0.2 ### Simplified Agent Architecture The new architecture replaces the `VoiceAgentSystem`, `Bus`, `Bridge`, and `ReasoningNode` pattern with a single async iterable function: ```python theme={null} import os from line import CallRequest from line.llm_agent import LlmAgent, LlmConfig, end_call from line.voice_agent_app import AgentEnv, VoiceAgentApp async def get_agent(env: AgentEnv, call_request: CallRequest): return LlmAgent( model="anthropic/claude-haiku-4-5-20251001", api_key=os.getenv("ANTHROPIC_API_KEY"), tools=[end_call], config=LlmConfig( system_prompt="You are a helpful assistant.", introduction="Hello! How can I help you today?", ), ) app = VoiceAgentApp(get_agent=get_agent) ``` **Benefits:** * Less boilerplate code * No manual event routing or bridge configuration * Automatic conversation history management * Built-in interruption handling * Quick, and easy tool definition ### Built-in LLM Support via LiteLLM `LlmAgent` provides unified access to 100+ LLM providers through [LiteLLM](https://github.com/BerriAI/litellm): ```python theme={null} # OpenAI LlmAgent(model="gpt-5-nano", api_key=os.getenv("OPENAI_API_KEY"), ...) # Anthropic LlmAgent(model="anthropic/claude-haiku-4-5-20251001", api_key=os.getenv("ANTHROPIC_API_KEY"), ...) # Google Gemini LlmAgent(model="gemini/gemini-2.5-flash-preview-09-2025", api_key=os.getenv("GEMINI_API_KEY"), ...) # With fallbacks LlmAgent( model="gpt-5-nano", config=LlmConfig(fallbacks=["anthropic/claude-haiku-4-5-20251001", "gemini/gemini-2.5-flash-preview-09-2025"]), ... ) ``` ### Declarative Tool System Define agent capabilities using simple decorators. Three tool types cover all common scenarios: | Tool Type | Decorator | What It Does | Example Use Case | | --------------- | ------------------- | --------------------------------------------------------------- | ------------------------------------------------- | | **Loopback** | `@loopback_tool` | Fetches information, then the agent speaks the answer naturally | Looking up order status, checking account balance | | **Passthrough** | `@passthrough_tool` | Takes an immediate action without additional AI processing | Ending a call, transferring to a phone number | | **Handoff** | `@handoff_tool` | Transfers the conversation to a different specialized agent | Routing to Spanish support, escalating to billing | ```python theme={null} from typing import Annotated from line.llm_agent import loopback_tool, passthrough_tool, handoff_tool from line.events import AgentEndCall @loopback_tool async def get_weather(ctx, city: Annotated[str, "City name"]) -> str: """Get current weather for a city.""" return f"72°F and sunny in {city}" @passthrough_tool async def end_call(ctx): """End the call.""" yield AgentEndCall() @handoff_tool async def transfer_to_support(ctx, event): """Transfer to support agent.""" async for output in support_agent.process(ctx.turn_env, event): yield output ``` ### Background Tool Execution Long-running tools can execute in the background without blocking the LLM: ```python theme={null} from typing import Annotated from line.llm_agent import loopback_tool @loopback_tool(is_background=True) async def check_bank_balance(ctx, account_id: Annotated[str, "Account ID"]): """Check account balance (may take a few seconds).""" yield "Checking your balance..." # Immediate acknowledgment balance = await api.get_balance(account_id) # Long operation yield f"Your balance is ${balance:.2f}" # Triggers new LLM completion ``` ### Built-in Tools Common operations available out of the box: ```python theme={null} from line.llm_agent import end_call, send_dtmf, transfer_call, web_search, agent_as_handoff agent = LlmAgent( tools=[ end_call, # End the call send_dtmf, # Send DTMF tones transfer_call, # Transfer to phone number web_search, # Real-time web search agent_as_handoff(other_agent, name="transfer_to_billing"), ], ... ) ``` ### Multi-Agent Workflows Create sophisticated agent routing with `agent_as_handoff`: ```python theme={null} spanish_agent = LlmAgent( model="gpt-5-nano", config=LlmConfig(system_prompt="Speak only in Spanish.", ...), ... ) main_agent = LlmAgent( tools=[ agent_as_handoff( spanish_agent, handoff_message="Transferring to Spanish support...", name="transfer_to_spanish", description="Transfer when user requests Spanish.", ), ], ... ) ``` ### Structured Event System Events are how your agent communicates with the outside world. **Output events** are actions your agent takes (speaking, ending calls). **Input events** are things that happen during a call (user speaks, call starts). **Output Events** (agent → harness): * `AgentSendText` — Send text to be spoken * `AgentEndCall` — End the call * `AgentTransferCall` — Transfer to another number * `AgentSendDtmf` — Send DTMF tone * `AgentToolCalled` / `AgentToolReturned` — Tool execution tracking * `LogMetric` / `LogMessage` — Observability **Input Events** (harness → agent): * `CallStarted` / `CallEnded` — Call lifecycle * `UserTurnStarted` / `UserTurnEnded` — User speaking * `UserTextSent` / `UserDtmfSent` — User content * `AgentHandedOff` — Handoff notification All input events include a `history` field with the complete conversation context. ### Enhanced Configuration Fine-tune how your agent thinks and responds. `LlmConfig` lets you control the AI's personality, response length, creativity, and reliability: ```python theme={null} LlmConfig( system_prompt="You are a helpful assistant.", introduction="Hello! How can I help?", # Sampling parameters temperature=0.7, max_tokens=1024, top_p=0.95, # Resilience num_retries=2, fallbacks=["gpt-5-nano"], timeout=30.0, # Provider-specific options extra={"reasoning_effort": "high"}, ) ``` *** ## Migration Guide from v0.1.x to v0.2 This guide walks you through upgrading your existing v0.1.x agents to v0.2. The migration involves updating imports, simplifying your agent setup, and adopting the new tool system. Most agents can be migrated in under an hour. ### Overview of Changes | v0.1.x | v0.2 | | ------------------------------------- | ----------------------------------------- | | `VoiceAgentSystem` + `Bus` + `Bridge` | `VoiceAgentApp` with `get_agent` callback | | `ReasoningNode` subclasses | `LlmAgent` or custom `Agent` protocol | | `call_handler(system, request)` | `get_agent(env, request) -> Agent` | | Manual event routing | Automatic event dispatch with filters | | `process_context()` method | `process(env, event)` async iterable | ### Step 1: Update Imports ```python theme={null} # v0.1.x from line.voice_agent_app import VoiceAgentApp from line.voice_agent_system import VoiceAgentSystem from line.bridge import Bridge from line.nodes import ReasoningNode from line.events import ( AgentSpeechSent, UserTranscriptionReceived, EndCall, TransferCall, ) # v0.2 from line.voice_agent_app import VoiceAgentApp, AgentEnv from line.llm_agent import LlmAgent, LlmConfig from line.llm_agent import end_call, transfer_call, loopback_tool, passthrough_tool from line.events import ( AgentSendText, AgentEndCall, AgentTransferCall, UserTurnEnded, CallStarted, ) ``` ### Step 2: Replace VoiceAgentSystem with get\_agent In v0.1.x, event routing was configured manually via `bridge.on()`. In v0.2, event dispatch is automatic with customizable **run** and **cancel filters**. ```python v0.1.x theme={null} from line.voice_agent_app import VoiceAgentApp from line.voice_agent_system import VoiceAgentSystem from line.bridge import Bridge from line.nodes import ReasoningNode from line.events import ( UserTranscriptionReceived, UserStoppedSpeaking, DTMFInputEvent, ) class MyReasoningNode(ReasoningNode): async def process_context(self, context): # Your LLM logic here response = await call_llm(context.messages) yield AgentResponse(content=response) async def call_handler(system: VoiceAgentSystem, call_request): node = MyReasoningNode(system_prompt="You are helpful.") bridge = Bridge(node) system.with_speaking_node(node, bridge) # Manual event routing with bridge.on() bridge.on(UserTranscriptionReceived).map(node.add_event) bridge.on(UserStoppedSpeaking).stream(node.generate).broadcast() # DTMF events required explicit routing bridge.on(DTMFInputEvent).map(node.handle_dtmf) await system.start() await system.send_initial_message("Hello!") await system.wait_for_shutdown() app = VoiceAgentApp(call_handler=call_handler) ``` ```python v0.2 theme={null} import os from line import CallRequest from line.voice_agent_app import VoiceAgentApp, AgentEnv from line.llm_agent import LlmAgent, LlmConfig, end_call from line.events import ( CallStarted, UserTurnEnded, UserDtmfSent, UserTurnStarted, CallEnded, ) async def get_agent(env: AgentEnv, call_request: CallRequest): agent = LlmAgent( model="gpt-5-nano", api_key=os.getenv("OPENAI_API_KEY"), tools=[end_call], config=LlmConfig( system_prompt="You are helpful.", introduction="Hello!", ), ) # Default: just return the agent (uses default filters) return agent async def get_agent_with_dtmf(env: AgentEnv, call_request: CallRequest): """Alternative: include DTMF events in processing.""" agent = LlmAgent(...) # Return an AgentSpec tuple to customize filters run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded] cancel_filter = [UserTurnStarted] return (agent, run_filter, cancel_filter) app = VoiceAgentApp(get_agent=get_agent) ``` #### Run and Cancel Filters Filters control your agent's behavior during a call: * **Run filters** determine what triggers your agent to respond (e.g., when the user finishes speaking) * **Cancel filters** determine what interrupts your agent (e.g., when the user starts talking over the agent) You can customize these by returning a tuple instead of just the agent: ```python theme={null} from typing import Union, Tuple AgentSpec = Union[Agent, Tuple[Agent, run_filter, cancel_filter]] ``` | Filter | Purpose | Default | | ------------------ | ------------------------------------------ | ----------------------------------------- | | **run\_filter** | Events that trigger agent processing | `[CallStarted, UserTurnEnded, CallEnded]` | | **cancel\_filter** | Events that cancel in-progress agent tasks | `[UserTurnStarted]` | **Example: Agent that responds to DTMF input** ```python theme={null} from line.events import ( CallStarted, CallEnded, UserTurnEnded, UserTurnStarted, UserDtmfSent ) async def get_agent(env: AgentEnv, call_request: CallRequest): agent = LlmAgent(...) # Include UserDtmfSent in run_filter to process DTMF run_filter = [CallStarted, UserTurnEnded, UserDtmfSent, CallEnded] cancel_filter = [UserTurnStarted] return (agent, run_filter, cancel_filter) ``` **Example: Agent that doesn't get interrupted** ```python theme={null} async def get_agent(env: AgentEnv, call_request: CallRequest): agent = LlmAgent(...) # Empty cancel_filter = agent won't be interrupted run_filter = [CallStarted, UserTurnEnded, CallEnded] cancel_filter = [] return (agent, run_filter, cancel_filter) ``` **Example: Custom filter function** ```python theme={null} def my_run_filter(event: InputEvent) -> bool: """Only process events during business hours.""" if isinstance(event, CallStarted): return is_business_hours() return isinstance(event, (UserTurnEnded, CallEnded)) async def get_agent(env: AgentEnv, call_request: CallRequest): agent = LlmAgent(...) return (agent, my_run_filter, [UserTurnStarted]) ``` ### Step 3: Migrate Event Handling ```python v0.1.x theme={null} # Event names AgentSpeechSent # Agent spoke UserTranscriptionReceived # User spoke EndCall # End call TransferCall # Transfer call # Manual event handling in ReasoningNode class MyNode(ReasoningNode): async def process_context(self, context): for event in context.events: if isinstance(event, UserTranscriptionReceived): user_message = event.transcription ``` ```python v0.2 theme={null} # Event names AgentSendText # Output: send text to speak AgentTextSent # Input: confirmation text was spoken UserTurnEnded # Input: user finished speaking AgentEndCall # Output: end call AgentTransferCall # Output: transfer call # Events include history automatically async def process(self, env, event): if isinstance(event, UserTurnEnded): # Access user's message user_message = event.content[0].content # Access full conversation history for past_event in event.history: if isinstance(past_event, UserTextSent): print(f"User previously said: {past_event.content}") ``` ### Step 4: Migrate Custom Tools ```python v0.1.x theme={null} # Manual tool handling in ReasoningNode class MyNode(ReasoningNode): async def process_context(self, context): # Parse tool calls from LLM response if tool_call := extract_tool_call(response): result = await self.execute_tool(tool_call) # Manually add to context and call LLM again context.add_tool_result(result) response = await call_llm(context) ``` ```python v0.2 theme={null} from typing import Annotated from line.llm_agent import loopback_tool, passthrough_tool from line.events import AgentSendText, AgentEndCall # Declarative tool definitions @loopback_tool async def get_account_balance(ctx, account_id: Annotated[str, "Account ID"]): """Look up account balance.""" balance = await api.get_balance(account_id) return f"${balance:.2f}" @passthrough_tool async def end_call_with_message(ctx, message: Annotated[str, "Goodbye message"]): """End call with a custom message.""" yield AgentSendText(text=message) yield AgentEndCall() # Tools are passed to LlmAgent agent = LlmAgent( tools=[get_account_balance, end_call_with_message], ... ) ``` ### Step 5: Migrate Multi-Agent Patterns ```python v0.1.x theme={null} # Manual agent switching class MainNode(ReasoningNode): def __init__(self, spanish_node): self.spanish_node = spanish_node self.use_spanish = False async def process_context(self, context): if self.should_switch_to_spanish(context): self.use_spanish = True # Complex manual state management ``` ```python v0.2 theme={null} from line.llm_agent import agent_as_handoff spanish_agent = LlmAgent( model="gpt-5-nano", config=LlmConfig(system_prompt="Speak only in Spanish."), ... ) main_agent = LlmAgent( tools=[ agent_as_handoff( spanish_agent, handoff_message="Transferring...", name="transfer_to_spanish", description="Use when user requests Spanish.", ), ], ... ) ``` ### Removed APIs The following APIs from v0.1.x have been removed with no direct replacement: | Removed | Alternative | | --------------------- | -------------------------------------------- | | `VoiceAgentSystem` | Use `VoiceAgentApp` with `get_agent` | | `Bus` | Events are dispatched automatically | | `Bridge` | Use run/cancel filters on `AgentSpec` | | `ReasoningNode` | Use `LlmAgent` or implement `Agent` protocol | | `ConversationHarness` | Handled internally by `ConversationRunner` | | `EventsRegistry` | Use typed event classes directly | ### Custom Agent Protocol If you need custom logic beyond `LlmAgent`, implement the `Agent` protocol: ```python theme={null} from typing import AsyncIterable from line.events import ( InputEvent, OutputEvent, AgentSendText, CallStarted, UserTurnEnded, ) class CustomAgent: """Custom agent implementing the Agent protocol.""" async def process(self, env, event: InputEvent) -> AsyncIterable[OutputEvent]: if isinstance(event, CallStarted): yield AgentSendText(text="Hello from custom agent!") elif isinstance(event, UserTurnEnded): # Your custom logic here user_message = event.content[0].content response = await your_custom_logic(user_message, event.history) yield AgentSendText(text=response) ``` *** ## Breaking Changes Summary This section provides a quick reference for all breaking changes. Use this as a checklist when migrating your code. ### Event Renames | v0.1.x | v0.2 | | --------------------------- | -------------------------------------------------- | | `AgentSpeechSent` | `AgentSendText` (output) / `AgentTextSent` (input) | | `UserTranscriptionReceived` | `UserTextSent` / `UserTurnEnded` | | `UserStartedSpeaking` | `UserTurnStarted` | | `UserStoppedSpeaking` | `UserTurnEnded` | | `AgentStartedSpeaking` | `AgentTurnStarted` | | `AgentStoppedSpeaking` | `AgentTurnEnded` | | `EndCall` | `AgentEndCall` | | `TransferCall` | `AgentTransferCall` | | `DTMFInputEvent` | `UserDtmfSent` | | `DTMFOutputEvent` | `AgentSendDtmf` | **Output vs. Input events**: `AgentSendText` is an output event you **yield** to make the agent speak. `AgentTextSent` is an input event you **receive** confirming what was spoken (appears in history). ### Structural Changes * **History in events**: All input events now include an optional `history` field with complete conversation context. When `history` is `None`, the event is inside a history list; when it contains a list, the event has full context attached. * **Tool events**: `ToolCall`/`ToolResult` replaced with structured `AgentToolCalled`/`AgentToolReturned` * **Event IDs**: All events now have stable `event_id` fields for tracking ### Configuration Changes | v0.1.x | v0.2 | | --------------------------------- | ------------------------------------- | | `CallRequest.agent.system_prompt` | `LlmConfig.system_prompt` | | `CallRequest.agent.introduction` | `LlmConfig.introduction` | | Manual LLM parameters | `LlmConfig` with full LiteLLM support | Use `LlmConfig.from_call_request(call_request, fallback_system_prompt="...", fallback_introduction="...")` to automatically inherit configuration from the Cartesia Playground while providing sensible defaults. See [Agents documentation](/line/sdk/agents#accessing-call-metadata-in-your-agent-logic) for details. *** ## New Dependencies v0.2 introduces the following dependencies: ``` litellm # Multi-provider LLM support pydantic # Type validation for events phonenumbers >= 9.0 # Phone number validation for transfer_call ``` Optional dependencies for examples: ``` exa-py # Exa web search integration duckduckgo-search # Fallback web search ``` *** ## Getting Help * **Documentation**: [Line SDK Overview](/line/sdk/overview) * **Examples**: [github.com/cartesia-ai/line/examples](https://github.com/cartesia-ai/line/tree/main/examples) * **Support**: [support@cartesia.ai](mailto:support@cartesia.ai) # Metrics Source: https://docs.cartesia.ai/line/evaluations/metrics The Line platform includes a suite of tools for evaluating how your Agent is performing, both during development phase and in production. You have full control over how metrics for evaluating your agent are defined.