Skip to main content
The Agents WebSocket provides real-time, bidirectional communication between web clients and Cartesia voice agents. It enables streaming audio input and real-time agent responses for browser-based or custom applications.

Connection

Connect to the WebSocket endpoint:
wss://api.cartesia.ai/agents/stream/{agent_id}
Headers:
HeaderValue
AuthorizationBearer {token}
Cartesia-Version2025-04-16
The token must be an Access Token created from the /access-token endpoint. To learn more , see Authenticating Client Apps.

Protocol Overview

The WebSocket connection uses JSON messages for control events and base64-encoded audio for media. The lifecycle follows this sequence:
  1. Client → Server: Send a start event to initialize the stream.
  2. Server → Client: Receive an ack event confirming configuration.
  3. Bidirectional exchange: The client and server exchange streaming audio and control events until either side closes the connection, or the inactivity timeout is fired.
  4. Close: Either side ends the session with a standard WebSocket close frame.
If the client doesn’t provide a stream_id in the initial start event, the server generates one and returns it in the ack response.

Client events

Start Event

Initializes the audio stream configuration.
  • The config parameter will optionally alter the input audio settings, overriding what your default agent configuration might otherwise be
  • The stream_id can be set manually if you wish to maintain this on the client end for observability purposes. If not specified, we’ll generate one and return it in the ack event
This must be the first message sent.
{
  "event": "start",
  "stream_id": "unique_id",
  "config": {
    "input_format": "pcm_44100"
  },
  "metadata": {
    "to": "user@example.com",
    "from": "+1234567890"
  }
}
Fields:
  • stream_id (optional): Stream identifier. If not provided, server generates one
  • config.input_format: Audio format for client audio input (mulaw_8000, pcm_16000, pcm_24000, pcm_44100)
  • metadata (optional): Custom metadata object. These will be passed through to the user code, but there are some special fields you can use as well:
    • to (optional): Destination identifier for call routing (defaults to agent ID)
    • from (optional): Source identifier for the call (defaults to “websocket”)

Media Input Event

Audio data sent from the client to the server. payload audio data should be base64 encoded.
{
  "event": "media_input",
  "stream_id": "unique_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
Fields:
  • stream_id: Unique identifier for the Stream from the ack response
  • media.payload: Base64-encoded audio data in the format specified in the start event

DTMF Event

Sends DTMF (dual-tone multi-frequency) tones.
{
  "event": "dtmf",
  "stream_id": "example_id",
  "dtmf": "1"
}
Fields:
  • stream_id: Stream identifier
  • dtmf: DTMF digit (0-9, *, #)

Custom Event

Sends custom metadata to the agent.
{
  "event": "custom",
  "stream_id": "example_id",
  "metadata": {
    "user_id": "user123",
    "session_info": "custom_data"
  }
}
Fields:
  • stream_id: Stream identifier
  • metadata: Object containing key-value pairs of custom data

Server events

Ack Event

Server acknowledgment of the start event, confirming stream configuration. If stream_id wasn’t provided in the initial start event, this is where the user can obtain the server generated stream_id.
{
  "event": "ack",
  "stream_id": "example_id",
  "config": {
    "input_format": "pcm_44100"
  }
}

Media Output Event

Server sends agent audio response. payload is base 64 encoded audio data.
{
  "event": "media_output",
  "stream_id": "example_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}

Clear Event

Indicates the agent wants to clear/interrupt the current audio stream.
{
  "event": "clear",
  "stream_id": "example_id"
}

DTMF Event

Server sends DTMF tones from the agent.
{
  "event": "dtmf",
  "stream_id": "example_id",
  "dtmf": "5"
}

Custom Event

Server sends custom metadata from the agent.
{
  "event": "custom",
  "stream_id": "example_id",
  "metadata": {
    "agent_state": "processing",
    "confidence": 0.95,
    "custom_data": "value"
  }
}

Connection Management

Inactivity Timeout

The server automatically closes idle WebSocket connections after 30 seconds of inactivity. Activity is defined as receiving any message from the client, including:
  • Application messages (media_input, dtmf, custom events)
  • Standard WebSocket ping frames
  • Any other valid WebSocket message
When the timeout occurs, the connection is closed with:
  • Code: 1000 (Normal Closure)
  • Reason: "connection idle timeout"

Ping/Pong Keepalive

To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive:
# Client sends ping to reset inactivity timer
pong_waiter = await websocket.ping()
latency = await pong_waiter
// JavaScript example
setInterval(() => {
  if (websocket.readyState === WebSocket.OPEN) {
    websocket.ping();
  }
}, 20000); // Send ping every 20 seconds
The server automatically responds to ping frames with pong frames and resets the inactivity timer upon receiving any message.

Connection Close

The connection can be closed by either the client or server using WebSocket close frames. Client-initiated close:
await websocket.close(code=1000, reason="session completed")
Server-initiated close: When the agent ends the call, the server closes the connection with:
  • Code: 1000 (Normal Closure)
  • Reason: "call ended by agent" or "call ended by agent, reason: {specific_reason}" if additional context is available

Best Practices

  1. Send start first — The connection closes if any other event is sent before start.
  2. Choose the right audio format — Match the format to your source: mulaw_8000 for telephony, pcm_44100 for web clients.
  3. Handle closes cleanly — Always capture close codes and reasons for debugging and recovery.
  4. Keep the connection alive — Send WebSocket ping frames every 20–25 seconds to avoid the 30-second inactivity timeout.
  5. Manage stream IDs — Provide your own stream_id values to improve observability across systems.
  6. Recover from idle timeouts — On 1000 / connection idle timeout, reconnect and resend a start event.