Messages

{
  "type": "error",
  "title": "Invalid model",
  "message": "The model is not valid, make sure it is a valid model ID.",
  "error_code": "model_not_found",
  "doc_url": "https://docs.cartesia.ai/build-with-cartesia/stt-models/latest",
  "status_code": 400,
  "request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}

STT

Realtime Speech-to-Text

A bidirectional WebSocket connection for real-time speech transcription with native turn detection. It is the recommended endpoint for building voice agents.

This API is organized around user turns (human user starts talking, stops talking), not transcript segments. The model itself signals when a user turn begins and ends, so your agent reacts to events rather than running its own voice activity detection.

See Turn Events for details on handling turn events
See Realtime Speech-to-Text (External VAD) if you don’t want the model to perform turn detection or want to control when transcripts are emitted for minimal latency

All emitted text is final, i.e. only high-accuracy transcripts are sent by this API. Later events will append to the transcript without modifying text sent by earlier events.

For WebSocket connection limits, see the concurrency limits and timeouts page.

WSS

stt

turns

websocket

Messages

{
  "type": "error",
  "title": "Invalid model",
  "message": "The model is not valid, make sure it is a valid model ID.",
  "error_code": "model_not_found",
  "doc_url": "https://docs.cartesia.ai/build-with-cartesia/stt-models/latest",
  "status_code": 400,
  "request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}

model

type:string

required

ID of the model to use for transcription, e.g. ink-2. See Models for available models.

encoding

type:string

required

The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.

Supported encodings: pcm_s16le, pcm_s32le, pcm_f16le, pcm_f32le, pcm_mulaw, pcm_alaw.

For guidance on choosing an encoding, see Audio encodings.

sample_rate

type:string

required

The sample rate of the audio in Hz.

cartesia_version

type:string

required

API version. Provide this either by adding cartesia_version=2026-03-01 as a URL query parameter or Cartesia-Version: 2026-03-01 as a request header.

Browser WebSockets do not support request headers and should add the query parameter in the URL.

X-API-Key

type:httpApiKey

API key passed in a header.

access_token

type:httpApiKey

A short-lived access token passed in a query param to make API requests from a client. This is particularly useful in the browser, where WebSockets do not support headers. See Authenticate client apps to generate an access token.

Send Audio Data

type:string

Send WebSocket binary messages containing raw audio data as specified by the encoding and sample_rate query parameters.

Audio Requirements:

Send audio in small chunks (e.g., 100ms intervals) for optimal latency
Audio format must match the encoding and sample_rate parameters

Close Command

type:object

Send a JSON encoded close command as WebSocket text message to close the session cleanly. All buffered audio will be processed by the model into events.

Connected

type:object

Fires once when the WebSocket connection is established.

You do not need to wait for this event before sending audio.

Turn Start

type:object

Marks the start of a user turn. Fires quickly after the user begins speaking.

This event can be used to interrupt your agent to avoid talking over the user.

Turn Update

type:object

Fires repeatedly as the model transcribes the current user turn.

Turn Eager End [PREVIEW]

type:object

Fires when the model predicts that the user might be done speaking.

Turn Resume [PREVIEW]

type:object

Fires after turn.eager_end if the user turn has not actually ended.

Turn End

type:object

Marks the end of a user turn.

Error Response

type:object

Error information for STT WebSocket connections.

Buffering

Realtime Speech-to-Text (External VAD)

⌘I

Use the API

API Status

TTS

STT

Agents

Voices

Voice Changer

Auth

Datasets

Fine Tunes

Infill

Pronunciation Dicts

Admin

Realtime Speech-to-Text

Use the API

API Status

TTS

STT

Agents

Voices

Voice Changer

Auth

Datasets

Fine Tunes

Infill

Pronunciation Dicts

Admin

Documentation Index