Quick start
/access-token endpoint. See Authenticating Client Apps for details.
Connection
Connect to the WebSocket endpoint:| Header | Value |
|---|---|
Authorization | Bearer {token} |
Cartesia-Version | 2025-04-16 |
Protocol Overview
The WebSocket connection uses JSON messages for control events and base64-encoded audio for media. The client sends astart event, the server responds with ack, then both sides exchange audio and control events until the connection closes.
Client events
Start Event
Initializes the audio stream configuration.configoverrides your agent’s default input audio settingsstream_idis optional. If not provided, the server generates one and returns it in theackevent
stream_id(optional): Stream identifier. If not provided, server generates oneconfig.input_format: Audio format for client audio input (mulaw_8000,pcm_16000,pcm_24000,pcm_44100)config.voice_id(optional): Override the agent’s default TTS voiceagent(optional): Allows configuring individual agent calls via API and previewing changes in introduction or prompt without publishing to productionmetadata(optional): Custom metadata object. These will be passed through to the agent code, but there are some special fields you can use as well:to(optional): Destination identifier for call routing (defaults to agent ID)from(optional): Source identifier for the call (defaults to “websocket”)
Media Input Event
Audio data sent from the client to the server.payload audio data should be base64 encoded.
stream_id: Unique identifier for the Stream from the ack responsemedia.payload: Base64-encoded audio data in the format specified in the start event
DTMF Event
Sends DTMF (dual-tone multi-frequency) tones.stream_id: Stream identifierdtmf: DTMF digit (0-9, *, #)
Custom Event
Sends custom metadata to the agent.stream_id: Stream identifiermetadata: Object containing key-value pairs of custom data
Server events
Ack Event
Confirms stream configuration. Returns the server-generatedstream_id if one wasn’t provided in the start event.
Media Output Event
Server sends agent audio response.payload is base 64 encoded audio data.
Clear Event
Indicates the agent wants to clear/interrupt the current audio stream.Connection Management
Inactivity Timeout
The server closes idle connections after 180 seconds. Any client message resets the timer:- Application messages (media_input, dtmf, custom events)
- Standard WebSocket ping frames
- Any other valid WebSocket message
- Code: 1000 (Normal Closure)
- Reason:
"connection idle timeout"
Ping/Pong Keepalive
To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive:Connection Close
The connection can be closed by either the client or server using WebSocket close frames. Client-initiated close:- Code: 1000 (Normal Closure)
- Reason:
"call ended by agent"or"call ended by agent, reason: {specific_reason}"if additional context is available
Best Practices
- Send
startfirst — The connection closes if any other event is sent beforestart. - Choose the right audio format — Match the format to your source:
mulaw_8000for telephony,pcm_44100for web clients. - Handle closes cleanly — Always capture close codes and reasons for debugging and recovery.
- Keep the connection alive — Send WebSocket ping frames every 60–90 seconds to avoid the 180-second inactivity timeout.
- Manage stream IDs — Provide your own
stream_idvalues to improve observability across systems. - Recover from idle timeouts — On
1000 / connection idle timeout, reconnect and resend astartevent.