> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Calls API

Stream audio between your application and your voice agent via WebSocket. Use this for web apps, mobile apps, or to bridge your own telephony provider.

## Quick start

```javascript theme={null}
const ws = new WebSocket(
  `wss://api.cartesia.ai/agents/stream/${agentId}`,
  {
    headers: {
      Authorization: `Bearer ${accessToken}`,
      "Cartesia-Version": "2025-04-16",
    },
  }
);

// Initialize the stream
ws.onopen = () => {
  ws.send(JSON.stringify({
    event: "start",
    config: { input_format: "pcm_44100" },
  }));
};

// Handle agent audio
ws.onmessage = (msg) => {
  const data = JSON.parse(msg.data);
  if (data.event === "media_output") {
    playAudio(atob(data.media.payload));
  }
};

// Send user audio
function sendAudio(audioData) {
  ws.send(JSON.stringify({
    event: "media_input",
    stream_id: streamId,
    media: { payload: btoa(audioData) },
  }));
}
```

Get an access token from the `/access-token` [endpoint](/api-reference/auth/access-token#body-grants-agent). See [Authenticating Client Apps](/get-started/authenticate-your-client-applications) for details.

***

## Connection

Connect to the WebSocket endpoint:

```
wss://api.cartesia.ai/agents/stream/{agent_id}
```

**Headers:**

| Header             | Value            |
| ------------------ | ---------------- |
| `Authorization`    | `Bearer {token}` |
| `Cartesia-Version` | `2025-04-16`     |

## Protocol Overview

The WebSocket connection uses JSON messages for control events and base64-encoded audio for media.

The client sends a `start` event, the server responds with `ack`, then both sides exchange audio and control events until the connection closes.

## Client events

### Start Event

Initializes the audio stream configuration.

* `config` overrides your agent's default input audio settings
* `stream_id` is optional. If not provided, the server generates one and returns it in the `ack` event

**This must be the first message sent.**

```json theme={null}
{
  "event": "start",
  "stream_id": "unique_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "introduction": "Hello, I'm an AI assistant",
    "system_prompt": "### Your Role \n You are a helpful assistant"
  },
  "metadata": {
    "to": "user@example.com",
    "from": "+1234567890"
  }
}
```

**Fields:**

* `stream_id` (optional): Stream identifier. If not provided, server generates one
* `config.input_format`: Audio format for client audio input (`mulaw_8000`, `pcm_16000`, `pcm_24000`, `pcm_44100`)
* `config.voice_id` (optional): Override the agent's default TTS voice
* `agent` (optional): Allows configuring individual agent calls via API and previewing changes in introduction or prompt without publishing to production
* `metadata` (optional): Custom metadata object. These will be passed through to the agent code, but there are some special fields you can use as well:
  * `to` (optional): Destination identifier for call routing (defaults to agent ID)
  * `from` (optional): Source identifier for the call (defaults to "websocket")

### Media Input Event

Audio data sent from the client to the server. `payload` audio data should be base64 encoded.

```json theme={null}
{
  "event": "media_input",
  "stream_id": "unique_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
```

**Fields:**

* `stream_id`: Unique identifier for the Stream from the ack response
* `media.payload`: Base64-encoded audio data in the format specified in the start event

### DTMF Event

Sends DTMF (dual-tone multi-frequency) tones.

```json theme={null}
{
  "event": "dtmf",
  "stream_id": "example_id",
  "dtmf": "1"
}
```

**Fields:**

* `stream_id`: Stream identifier
* `dtmf`: DTMF digit (0-9, \*, #)

### Custom Event

Sends custom metadata to the agent.

```json theme={null}
{
  "event": "custom",
  "stream_id": "example_id",
  "metadata": {
    "user_id": "user123",
    "session_info": "custom_data"
  }
}
```

**Fields:**

* `stream_id`: Stream identifier
* `metadata`: Object containing key-value pairs of custom data

## Server events

### Ack Event

Confirms stream configuration. Returns the server-generated `stream_id` if one wasn't provided in the `start` event.

```json theme={null}
{
  "event": "ack",
  "stream_id": "example_id",
  "config": {
    "input_format": "pcm_44100",
    "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "agent": {
    "system_prompt": "### Your Role \n You are a helpful assistant",
    "introduction": "Hello, I'm an AI assistant"
  }
}
```

### Media Output Event

Server sends agent audio response. `payload` is base 64 encoded audio data.

```json theme={null}
{
  "event": "media_output",
  "stream_id": "example_id",
  "media": {
    "payload": "base64_encoded_audio_data"
  }
}
```

### Clear Event

Indicates the agent wants to clear/interrupt the current audio stream.

```json theme={null}
{
  "event": "clear",
  "stream_id": "example_id"
}
```

### Transfer Call Event

Indicates the agent wants to transfer the call to a phone number. The client is responsible for initiating the transfer on its telephony side.

```json theme={null}
{
  "event": "transfer_call",
  "stream_id": "example_id",
  "transfer": {
    "target_phone_number": "+1234567890"
  }
}
```

**Fields:**

* `stream_id`: Stream identifier
* `transfer.target_phone_number`: E.164 phone number to transfer the call to

## Connection Management

### Inactivity Timeout

The server closes idle connections after **180 seconds**. Any client message resets the timer:

* Application messages (media\_input, dtmf, custom events)
* Standard WebSocket ping frames
* Any other valid WebSocket message

When the timeout occurs, the connection is closed with:

* **Code:** 1000 (Normal Closure)
* **Reason:** `"connection idle timeout"`

### Ping/Pong Keepalive

To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive:

```python theme={null}
# Client sends ping to reset inactivity timer
pong_waiter = await websocket.ping()
latency = await pong_waiter
```

```javascript theme={null}
// Requires the Node.js `ws` library — the browser WebSocket API does not expose ping()
setInterval(() => {
  if (websocket.readyState === WebSocket.OPEN) {
    websocket.ping();
  }
}, 60000); // Send ping every 60 seconds
```

The server automatically responds to ping frames with pong frames and resets the inactivity timer upon receiving any message.

### Connection Close

The connection can be closed by either the client or server using WebSocket close frames.

**Client-initiated close:**

```python theme={null}
await websocket.close(code=1000, reason="session completed")
```

**Server-initiated close:**
When the agent ends the call, the server closes the connection with:

* **Code:** 1000 (Normal Closure)
* **Reason:** `"call ended by agent"` or `"call ended by agent, reason: {specific_reason}"` if additional context is available

## Best Practices

1. **Send `start` first** — The connection closes if any other event is sent before `start`.
2. **Choose the right audio format** — Match the format to your source: `mulaw_8000` for telephony, `pcm_44100` for web clients.
3. **Handle closes cleanly** — Always capture close codes and reasons for debugging and recovery.
4. **Keep the connection alive** — Send WebSocket ping frames every 60–90 seconds to avoid the 180-second inactivity timeout.
5. **Manage stream IDs** — Provide your own `stream_id` values to improve observability across systems.
6. **Recover from idle timeouts** — On `1000 / connection idle timeout`, reconnect and resend a `start` event.
