Messages

{
  "type": "transcript",
  "is_final": true,
  "request_id": "b67e1c5d-2f4c-4c3d-9f82-96eb4d2f12a8",
  "text": "How are you doing today?",
  "duration": 2.5,
  "language": "en",
  "words": [
    {
      "word": "How",
      "start": 0,
      "end": 0.12
    },
    {
      "word": "are",
      "start": 0.15,
      "end": 0.25
    },
    {
      "word": "you",
      "start": 0.28,
      "end": 0.35
    },
    {
      "word": "doing",
      "start": 0.38,
      "end": 0.55
    },
    {
      "word": "today?",
      "start": 0.58,
      "end": 0.78
    }
  ]
}

{
  "type": "error",
  "title": "Invalid model",
  "message": "The model is not valid, make sure it is a valid model ID.",
  "error_code": "model_not_found",
  "doc_url": "https://docs.cartesia.ai/build-with-cartesia/stt-models/latest",
  "status_code": 400,
  "request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}

STT

Realtime Speech-to-Text (External VAD)

A bidirectional WebSocket connection for real-time speech transcription that works with external voice activity detection (VAD). It is the recommended endpoint for “push-to-talk” apps.

This API relies on the finalize command to trigger transcription. If you do not know when the user starts and stops speaking, consider Realtime Speech-to-Text to benefit from user turn detection.

Basic Usage:

Connect to the WebSocket with appropriate query parameters
Send audio in small chunks (e.g. 100ms) as WebSocket binary messages
Send finalize as a WebSocket text message when the user is done speaking
Receive transcripts as JSON encoded WebSocket text messages (each message is a delta and is not cumulative)
Repeat 2-4
Send close as a WebSocket text message to finalize any buffered audio and close the session cleanly

For WebSocket connection limits, see the concurrency limits and timeouts page.

WSS

stt

websocket

Messages

{
  "type": "transcript",
  "is_final": true,
  "request_id": "b67e1c5d-2f4c-4c3d-9f82-96eb4d2f12a8",
  "text": "How are you doing today?",
  "duration": 2.5,
  "language": "en",
  "words": [
    {
      "word": "How",
      "start": 0,
      "end": 0.12
    },
    {
      "word": "are",
      "start": 0.15,
      "end": 0.25
    },
    {
      "word": "you",
      "start": 0.28,
      "end": 0.35
    },
    {
      "word": "doing",
      "start": 0.38,
      "end": 0.55
    },
    {
      "word": "today?",
      "start": 0.58,
      "end": 0.78
    }
  ]
}

{
  "type": "error",
  "title": "Invalid model",
  "message": "The model is not valid, make sure it is a valid model ID.",
  "error_code": "model_not_found",
  "doc_url": "https://docs.cartesia.ai/build-with-cartesia/stt-models/latest",
  "status_code": 400,
  "request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}

model

type:string

required

ID of the model to use for transcription, e.g. ink-2. See Models for available models.

encoding

type:string

required

The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.

For guidance on choosing an encoding, see Audio encodings.

sample_rate

type:string

required

The sample rate of the audio in Hz.

cartesia_version

type:string

required

API version. Provide this either by adding cartesia_version=2026-03-01 as a URL query parameter or Cartesia-Version: 2026-03-01 as a request header.

Browser WebSockets do not support request headers and should add the query parameter in the URL.

X-API-Key

type:httpApiKey

API key passed in a header.

access_token

type:httpApiKey

A short-lived access token passed in a query param to make API requests from a client. This is particularly useful in the browser, where WebSockets do not support headers. See Authenticate client apps to generate an access token.

query

type:object

Send Audio Data

type:string

Send WebSocket binary messages containing raw audio data as specified by the encoding and sample_rate query parameters.

Audio Requirements:

Send audio in small chunks (e.g., 100ms intervals) for optimal latency
Audio format must match the encoding and sample_rate parameters

Finalize Command

type:string

Send finalize as a text message when the user is done speaking to receive the transcript for any buffered audio.

Example: finalize

Close Command

type:string

Send close as a text message to flush remaining audio, close session, and receive a done acknowledgment

Example: close

Transcript Response

type:object

Transcript chunks.

You should send the finalize command after the user is done speaking to make the API emit these transcript chunks; although, the API may send transcript chunks even before you send the finalize command.

Flush Done Response

type:object

Acknowledgment for the finalize command

Done Response

type:object

Acknowledgment for the close command

Error Response

type:object

Error information for STT WebSocket connections.

Realtime Speech-to-Text

Batch Speech-to-Text

⌘I

Documentation Index