Realtime Text to Speech Quickstart

Using the Cartesia Websocket API allows you to simultaneously stream text input and audio output. This is best for realtime use cases such as voice agents when text is generated incrementally, as from an LLM. Stream text in chunks to the Cartesia and receive audio chunks in real time. This is ideal when text is generated incrementally, such as from an LLM.

Prerequisites

A Cartesia API key. Create one here, then add it to your .bashrc or .zshrc:
```
export CARTESIA_API_KEY=<your api key here>
```
ffplay (part of FFmpeg), used to play audio output:
- macOS
- Ubuntu
brew install ffmpeg
sudo apt install ffmpeg

Stream text and play audio

Python
TypeScript

Install the SDK

pip install 'cartesia[websockets]'

Stream text over a WebSocket

realtime-tts.py

from cartesia import Cartesia
import subprocess
import os

client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY"))

print("Starting ffplay to play streaming audio output...")
player = subprocess.Popen(
    ["ffplay", "-f", "f32le", "-ar", "44100", "-probesize", "32", "-analyzeduration", "0", "-nodisp", "-autoexit", "-loglevel", "quiet", "-"],
    stdin=subprocess.PIPE,
    bufsize=0,
)

print("Connecting to Cartesia via websockets...")
with client.tts.websocket_connect() as connection:
    ctx = connection.context(
        model_id="sonic-3.5",
        voice={"mode": "id", "id": "f786b574-daa5-4673-aa0c-cbe3e8534c02"},
        output_format={
            "container": "raw",
            "encoding": "pcm_f32le",
            "sample_rate": 44100,
        },
    )

    print("Sending chunked text input...")
    for part in ["Hi there! ", "Welcome to ", "Cartesia Sonic."]:
        ctx.push(part)

    ctx.no_more_inputs()

    for response in ctx.receive():
        if response.type == "chunk" and response.audio:
            print(f"Received audio chunk ({len(response.audio)} bytes)")
            # Here we pipe audio to ffplay. In a production app you might play audio in
            # a client, or forward it to another service, eg, a telephony provider.
            player.stdin.write(response.audio)
        elif response.type == "done":
            break

player.stdin.close()
player.wait()

Run the quickstart

python3 realtime-tts.py

This will stream text inputs to Cartesia, and play the streaming audio output using ffplay. (Make sure your device volume is turned on!)

Install the SDK

npm init -y
npm pkg set type=module
npm install @cartesia/cartesia-js ws
npm install --save-dev tsx typescript @types/node

In the browser, you don’t need the ws package — the native WebSocket API is used instead. However, you will need to use ephemeral access tokens for authentication. See Authenticate Your Client Applications.

Stream text over a WebSocket

Create a file named realtime-tts.ts with the following code:

realtime-tts.ts

import Cartesia from "@cartesia/cartesia-js";
import { spawn } from "child_process";

const apiKey = process.env["CARTESIA_API_KEY"];
if (!apiKey) {
  throw new Error("Missing CARTESIA_API_KEY");
}

const client = new Cartesia({ apiKey });

console.log("Starting ffplay to play streaming audio output...");
const { stdin } = spawn("ffplay", ["-f", "f32le", "-ar", "44100", "-probesize", "32", "-analyzeduration", "0", "-nodisp", "-autoexit", "-loglevel", "quiet", "-"], {
  stdio: ["pipe", "ignore", "ignore"],
});
if (!stdin) {
  throw new Error("ffplay stdin not available");
}

console.log("Connecting to Cartesia via websockets...");
const ws = await client.tts.websocket();

const ctx = ws.context({
  model_id: "sonic-3.5",
  voice: { mode: "id", id: "f786b574-daa5-4673-aa0c-cbe3e8534c02" },
  output_format: { container: "raw", encoding: "pcm_f32le", sample_rate: 44100 },
});

console.log("Sending chunked text input...");
const transcriptChunks = ["Hi there! ", "Welcome to ", "Cartesia Sonic."];
for (const part of transcriptChunks) {
  await ctx.push({ transcript: part });
}

await ctx.no_more_inputs();

for await (const event of ctx.receive()) {
  if (event.type === "chunk" && event.audio) {
    console.log(`Received audio chunk (${event.audio.length} bytes)`);
    // Here we pipe audio to ffplay. In a production app you might play audio in
    // a client, or forward it to another service, eg, a telephony provider.
    stdin.write(event.audio);
  } else if (event.type === "done") {
    break;
  }
}
stdin.end();
ws.close();

Run the quickstart

npx tsx realtime-tts.ts

This will stream text inputs to Cartesia, and play the streaming audio output using ffplay. (Make sure your device volume is turned on!)

How it works

The WebSocket connection manages multiple contexts, each representing an independent, continuous stream of speech. Cartesia context is exactly like an LLM context: on our servers, we store the previously-generated speech so that new speech matches it in tone. To summarize, here’s what our code does, after establishing a Websocket connection:

Create a context with context().
Push text incrementally with push(). Each call sends the chunk with continue: true, telling the model more text will follow. See continuations for details.
Signal completion with no_more_inputs(), which sends continue: false to tell the model no more text is coming.
Receive audio chunks as they are generated.

This maps directly to LLM token streaming — push each token or sentence fragment as it arrives, and audio begins streaming back even if the full text is not yet available.

What’s next

Stream inputs using continuations

Deep dive into context management and buffering.

Choose a Voice

Browse voices and learn how to pick the right one for your use case.

Choosing TTS parameters

Pick the right output format, sample rate, and encoding for your use case.

Get Started

Text-to-Speech

Speech-to-Text

Tools

Integrations

Enterprise

Realtime Text to Speech Quickstart

Prerequisites

Stream text and play audio

How it works

What’s next

Stream inputs using continuations

Choose a Voice

Choosing TTS parameters

Get Started

Text-to-Speech

Speech-to-Text

Tools

Integrations

Enterprise

Documentation Index

​Prerequisites

​Stream text and play audio

​How it works

​What’s next

Stream inputs using continuations

Choose a Voice

Choosing TTS parameters

Prerequisites

Stream text and play audio

How it works

What’s next