Calls API - Cartesia Docs

Stream audio between your application and your voice agent via WebSocket. Use this for web apps, mobile apps, or to bridge your own telephony provider.

Quick start

const ws = new WebSocket(
  `wss://api.cartesia.ai/agents/stream/${agentId}`,
  {
    headers: {
      Authorization: `Bearer ${accessToken}`,
      "Cartesia-Version": "2025-04-16",
    },
  }
);

// Initialize the stream
ws.onopen = () => {
  ws.send(JSON.stringify({
    event: "start",
    config: { input_format: "pcm_44100" },
  }));
};

// Handle agent audio
ws.onmessage = (msg) => {
  const data = JSON.parse(msg.data);
  if (data.event === "media_output") {
    playAudio(atob(data.media.payload));
  }
};

// Send user audio
function sendAudio(audioData) {
  ws.send(JSON.stringify({
    event: "media_input",
    stream_id: streamId,
    media: { payload: btoa(audioData) },
  }));
}

Get an access token from the /access-token endpoint. See Authenticating Client Apps for details.

Connection

Connect to the WebSocket endpoint:

wss://api.cartesia.ai/agents/stream/{agent_id}

Headers:

Header	Value
`Authorization`	`Bearer {token}`
`Cartesia-Version`	`2025-04-16`

Protocol Overview

The WebSocket connection uses JSON messages for control events and base64-encoded audio for media. The client sends a start event, the server responds with ack, then both sides exchange audio and control events until the connection closes.

Client events

Start Event

Initializes the audio stream configuration.

config overrides your agent’s default input audio settings
stream_id is optional. If not provided, the server generates one and returns it in the ack event

This must be the first message sent.

{
  "event": "start",
  "stream_id": "unique_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "introduction": "Hello, I'm an AI assistant",
    "system_prompt": "### Your Role \n You are a helpful assistant"
  },
  "metadata": {
    "to": "user@example.com",
    "from": "+1234567890"
  }
}

Fields:

stream_id (optional): Stream identifier. If not provided, server generates one
config.input_format: Audio format for client audio input (mulaw_8000, pcm_16000, pcm_24000, pcm_44100)
config.voice_id (optional): Override the agent’s default TTS voice
agent (optional): Allows configuring individual agent calls via API and previewing changes in introduction or prompt without publishing to production
metadata (optional): Custom metadata object. These will be passed through to the agent code, but there are some special fields you can use as well:
- to (optional): Destination identifier for call routing (defaults to agent ID)
- from (optional): Source identifier for the call (defaults to “websocket”)

Media Input Event

Audio data sent from the client to the server. payload audio data should be base64 encoded.

{
  "event": "media_input",
  "stream_id": "unique_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}

Fields:

stream_id: Unique identifier for the Stream from the ack response
media.payload: Base64-encoded audio data in the format specified in the start event

DTMF Event

Sends DTMF (dual-tone multi-frequency) tones.

{
  "event": "dtmf",
  "stream_id": "example_id",
  "dtmf": "1"
}

Fields:

stream_id: Stream identifier
dtmf: DTMF digit (0-9, *, #)

Custom Event

Sends custom metadata to the agent.

{
  "event": "custom",
  "stream_id": "example_id",
  "metadata": {
    "user_id": "user123",
    "session_info": "custom_data"
  }
}

Fields:

stream_id: Stream identifier
metadata: Object containing key-value pairs of custom data

Server events

Ack Event

Confirms stream configuration. Returns the server-generated stream_id if one wasn’t provided in the start event.

{
  "event": "ack",
  "stream_id": "example_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "system_prompt": "### Your Role \n You are a helpful assistant",
    "introduction": "Hello, I'm an AI assistant"
  }
}

Media Output Event

Server sends agent audio response. payload is base 64 encoded audio data.

{
  "event": "media_output",
  "stream_id": "example_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}

Clear Event

Indicates the agent wants to clear/interrupt the current audio stream.

{
  "event": "clear",
  "stream_id": "example_id"
}

Transfer Call Event

Indicates the agent wants to transfer the call to a phone number. The client is responsible for initiating the transfer on its telephony side.

{
  "event": "transfer_call",
  "stream_id": "example_id",
  "transfer": {
    "target_phone_number": "+1234567890"
  }
}

Fields:

stream_id: Stream identifier
transfer.target_phone_number: E.164 phone number to transfer the call to

Connection Management

Inactivity Timeout

The server closes idle connections after 180 seconds. Any client message resets the timer:

Application messages (media_input, dtmf, custom events)
Standard WebSocket ping frames
Any other valid WebSocket message

When the timeout occurs, the connection is closed with:

Code: 1000 (Normal Closure)
Reason: "connection idle timeout"

Ping/Pong Keepalive

To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive:

# Client sends ping to reset inactivity timer
pong_waiter = await websocket.ping()
latency = await pong_waiter

// Requires the Node.js `ws` library — the browser WebSocket API does not expose ping()
setInterval(() => {
  if (websocket.readyState === WebSocket.OPEN) {
    websocket.ping();
  }
}, 60000); // Send ping every 60 seconds

The server automatically responds to ping frames with pong frames and resets the inactivity timer upon receiving any message.

Connection Close

The connection can be closed by either the client or server using WebSocket close frames. Client-initiated close:

await websocket.close(code=1000, reason="session completed")

Server-initiated close: When the agent ends the call, the server closes the connection with:

Code: 1000 (Normal Closure)
Reason: "call ended by agent" or "call ended by agent, reason: {specific_reason}" if additional context is available

Best Practices

Send start first — The connection closes if any other event is sent before start.
Choose the right audio format — Match the format to your source: mulaw_8000 for telephony, pcm_44100 for web clients.
Handle closes cleanly — Always capture close codes and reasons for debugging and recovery.
Keep the connection alive — Send WebSocket ping frames every 60–90 seconds to avoid the 180-second inactivity timeout.
Manage stream IDs — Provide your own stream_id values to improve observability across systems.
Recover from idle timeouts — On 1000 / connection idle timeout, reconnect and resend a start event.

​Quick start

​Connection

​Protocol Overview

​Client events

​Start Event

​Media Input Event

​DTMF Event

​Custom Event

​Server events

​Ack Event

​Media Output Event

​Clear Event

​Transfer Call Event

​Connection Management

​Inactivity Timeout

​Ping/Pong Keepalive

​Connection Close

​Best Practices

Quick start

Connection

Protocol Overview

Client events

Start Event

Media Input Event

DTMF Event

Custom Event

Server events

Ack Event

Media Output Event

Clear Event

Transfer Call Event

Connection Management

Inactivity Timeout

Ping/Pong Keepalive

Connection Close

Best Practices