Skip to main content
Stream audio between your application and your voice agent via WebSocket. Use this for web apps, mobile apps, or to bridge your own telephony provider.

Quick start

const ws = new WebSocket(
  `wss://api.cartesia.ai/agents/stream/${agentId}`,
  {
    headers: {
      Authorization: `Bearer ${accessToken}`,
      "Cartesia-Version": "2025-04-16",
    },
  }
);

// Initialize the stream
ws.onopen = () => {
  ws.send(JSON.stringify({
    event: "start",
    config: { input_format: "pcm_44100" },
  }));
};

// Handle agent audio
ws.onmessage = (msg) => {
  const data = JSON.parse(msg.data);
  if (data.event === "media_output") {
    playAudio(atob(data.media.payload));
  }
};

// Send user audio
function sendAudio(audioData) {
  ws.send(JSON.stringify({
    event: "media_input",
    stream_id: streamId,
    media: { payload: btoa(audioData) },
  }));
}
Get an access token from the /access-token endpoint. See Authenticating Client Apps for details.

Connection

Connect to the WebSocket endpoint:
wss://api.cartesia.ai/agents/stream/{agent_id}
Headers:
HeaderValue
AuthorizationBearer {token}
Cartesia-Version2025-04-16

Protocol Overview

The WebSocket connection uses JSON messages for control events and base64-encoded audio for media. The client sends a start event, the server responds with ack, then both sides exchange audio and control events until the connection closes.

Client events

Start Event

Initializes the audio stream configuration.
  • config overrides your agent’s default input audio settings
  • stream_id is optional. If not provided, the server generates one and returns it in the ack event
This must be the first message sent.
{
  "event": "start",
  "stream_id": "unique_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "introduction": "Hello, I'm an AI assistant",
    "system_prompt": "### Your Role \n You are a helpful assistant"
  },
  "metadata": {
    "to": "user@example.com",
    "from": "+1234567890"
  }
}
Fields:
  • stream_id (optional): Stream identifier. If not provided, server generates one
  • config.input_format: Audio format for client audio input (mulaw_8000, pcm_16000, pcm_24000, pcm_44100)
  • config.voice_id (optional): Override the agent’s default TTS voice
  • agent (optional): Allows configuring individual agent calls via API and previewing changes in introduction or prompt without publishing to production
  • metadata (optional): Custom metadata object. These will be passed through to the agent code, but there are some special fields you can use as well:
    • to (optional): Destination identifier for call routing (defaults to agent ID)
    • from (optional): Source identifier for the call (defaults to “websocket”)

Media Input Event

Audio data sent from the client to the server. payload audio data should be base64 encoded.
{
  "event": "media_input",
  "stream_id": "unique_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
Fields:
  • stream_id: Unique identifier for the Stream from the ack response
  • media.payload: Base64-encoded audio data in the format specified in the start event

DTMF Event

Sends DTMF (dual-tone multi-frequency) tones.
{
  "event": "dtmf",
  "stream_id": "example_id",
  "dtmf": "1"
}
Fields:
  • stream_id: Stream identifier
  • dtmf: DTMF digit (0-9, *, #)

Custom Event

Sends custom metadata to the agent.
{
  "event": "custom",
  "stream_id": "example_id",
  "metadata": {
    "user_id": "user123",
    "session_info": "custom_data"
  }
}
Fields:
  • stream_id: Stream identifier
  • metadata: Object containing key-value pairs of custom data

Server events

Ack Event

Confirms stream configuration. Returns the server-generated stream_id if one wasn’t provided in the start event.
{
  "event": "ack",
  "stream_id": "example_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "system_prompt": "### Your Role \n You are a helpful assistant",
    "introduction": "Hello, I'm an AI assistant"
  }
}

Media Output Event

Server sends agent audio response. payload is base 64 encoded audio data.
{
  "event": "media_output",
  "stream_id": "example_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}

Clear Event

Indicates the agent wants to clear/interrupt the current audio stream.
{
  "event": "clear",
  "stream_id": "example_id"
}

Connection Management

Inactivity Timeout

The server closes idle connections after 180 seconds. Any client message resets the timer:
  • Application messages (media_input, dtmf, custom events)
  • Standard WebSocket ping frames
  • Any other valid WebSocket message
When the timeout occurs, the connection is closed with:
  • Code: 1000 (Normal Closure)
  • Reason: "connection idle timeout"

Ping/Pong Keepalive

To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive:
# Client sends ping to reset inactivity timer
pong_waiter = await websocket.ping()
latency = await pong_waiter
// Requires the Node.js `ws` library — the browser WebSocket API does not expose ping()
setInterval(() => {
  if (websocket.readyState === WebSocket.OPEN) {
    websocket.ping();
  }
}, 60000); // Send ping every 60 seconds
The server automatically responds to ping frames with pong frames and resets the inactivity timer upon receiving any message.

Connection Close

The connection can be closed by either the client or server using WebSocket close frames. Client-initiated close:
await websocket.close(code=1000, reason="session completed")
Server-initiated close: When the agent ends the call, the server closes the connection with:
  • Code: 1000 (Normal Closure)
  • Reason: "call ended by agent" or "call ended by agent, reason: {specific_reason}" if additional context is available

Best Practices

  1. Send start first — The connection closes if any other event is sent before start.
  2. Choose the right audio format — Match the format to your source: mulaw_8000 for telephony, pcm_44100 for web clients.
  3. Handle closes cleanly — Always capture close codes and reasons for debugging and recovery.
  4. Keep the connection alive — Send WebSocket ping frames every 60–90 seconds to avoid the 180-second inactivity timeout.
  5. Manage stream IDs — Provide your own stream_id values to improve observability across systems.
  6. Recover from idle timeouts — On 1000 / connection idle timeout, reconnect and resend a start event.