Connection
Connect to the WebSocket endpoint:| Header | Value |
|---|---|
Authorization | Bearer {token} |
Cartesia-Version | 2025-04-16 |
token must be an Access Token created from the /access-token endpoint. To learn more , see
Authenticating Client Apps.
Protocol Overview
The WebSocket connection uses JSON messages for control events and base64-encoded audio for media. The lifecycle follows this sequence:- Client → Server: Send a start event to initialize the stream.
- Server → Client: Receive an ack event confirming configuration.
- Bidirectional exchange: The client and server exchange streaming audio and control events until either side closes the connection, or the inactivity timeout is fired.
- Close: Either side ends the session with a standard WebSocket close frame.
stream_id in the initial start event, the server generates one and returns it in the ack response.
Client events
Start Event
Initializes the audio stream configuration.- The
configparameter will optionally alter the input audio settings, overriding what your default agent configuration might otherwise be - The
stream_idcan be set manually if you wish to maintain this on the client end for observability purposes. If not specified, we’ll generate one and return it in theackevent
stream_id(optional): Stream identifier. If not provided, server generates oneconfig.input_format: Audio format for client audio input (mulaw_8000,pcm_16000,pcm_24000,pcm_44100)metadata(optional): Custom metadata object. These will be passed through to the user code, but there are some special fields you can use as well:to(optional): Destination identifier for call routing (defaults to agent ID)from(optional): Source identifier for the call (defaults to “websocket”)
Media Input Event
Audio data sent from the client to the server.payload audio data should be base64 encoded.
stream_id: Unique identifier for the Stream from the ack responsemedia.payload: Base64-encoded audio data in the format specified in the start event
DTMF Event
Sends DTMF (dual-tone multi-frequency) tones.stream_id: Stream identifierdtmf: DTMF digit (0-9, *, #)
Custom Event
Sends custom metadata to the agent.stream_id: Stream identifiermetadata: Object containing key-value pairs of custom data
Server events
Ack Event
Server acknowledgment of the start event, confirming stream configuration. Ifstream_id wasn’t provided in the initial start event, this is where the user can obtain the server generated stream_id.
Media Output Event
Server sends agent audio response.payload is base 64 encoded audio data.
Clear Event
Indicates the agent wants to clear/interrupt the current audio stream.DTMF Event
Server sends DTMF tones from the agent.Custom Event
Server sends custom metadata from the agent.Connection Management
Inactivity Timeout
The server automatically closes idle WebSocket connections after 30 seconds of inactivity. Activity is defined as receiving any message from the client, including:- Application messages (media_input, dtmf, custom events)
- Standard WebSocket ping frames
- Any other valid WebSocket message
- Code: 1000 (Normal Closure)
- Reason:
"connection idle timeout"
Ping/Pong Keepalive
To prevent inactivity timeouts during periods of silence, use standard WebSocket ping frames for periodic keepalive:Connection Close
The connection can be closed by either the client or server using WebSocket close frames. Client-initiated close:- Code: 1000 (Normal Closure)
- Reason:
"call ended by agent"or"call ended by agent, reason: {specific_reason}"if additional context is available
Best Practices
- Send
startfirst — The connection closes if any other event is sent beforestart. - Choose the right audio format — Match the format to your source:
mulaw_8000for telephony,pcm_44100for web clients. - Handle closes cleanly — Always capture close codes and reasons for debugging and recovery.
- Keep the connection alive — Send WebSocket ping frames every 20–25 seconds to avoid the 30-second inactivity timeout.
- Manage stream IDs — Provide your own
stream_idvalues to improve observability across systems. - Recover from idle timeouts — On
1000 / connection idle timeout, reconnect and resend astartevent.