Speech-to-Text (Streaming)

Messages

{
  "type": "transcript",
  "is_final": false,
  "request_id": "58dfa4d4-91c5-410c-8529-6824c8f7aedc",
  "text": "How are you doing today?",
  "duration": 0.5,
  "language": "en",
  "words": [
    {
      "word": "How",
      "start": 0,
      "end": 0.12
    },
    {
      "word": "are",
      "start": 0.15,
      "end": 0.25
    },
    {
      "word": "you",
      "start": 0.28,
      "end": 0.35
    },
    {
      "word": "doing",
      "start": 0.38,
      "end": 0.55
    },
    {
      "word": "today?",
      "start": 0.58,
      "end": 0.78
    }
  ]
}

WSS

wss://api.cartesia.ai

stt

websocket

Messages

{
  "type": "transcript",
  "is_final": false,
  "request_id": "58dfa4d4-91c5-410c-8529-6824c8f7aedc",
  "text": "How are you doing today?",
  "duration": 0.5,
  "language": "en",
  "words": [
    {
      "word": "How",
      "start": 0,
      "end": 0.12
    },
    {
      "word": "are",
      "start": 0.15,
      "end": 0.25
    },
    {
      "word": "you",
      "start": 0.28,
      "end": 0.35
    },
    {
      "word": "doing",
      "start": 0.38,
      "end": 0.55
    },
    {
      "word": "today?",
      "start": 0.58,
      "end": 0.78
    }
  ]
}

model

type:string

required

ID of the model to use for transcription. Use 'ink-whisper' for the latest Cartesia Whisper model.

language

type:string

required

The language of the input audio in ISO-639-1 format. Defaults to en.

Supported languages: en, zh, de, es, ru, ko, fr, ja, pt, tr, pl, ca, nl, ar, sv, it, id, hi, fi, vi, he, uk, el, ms, cs, ro, da, hu, ta, no, th, ur, hr, bg, lt, la, mi, ml, cy, sk, te, fa, lv, bn, sr, az, sl, kn, et, mk, br, eu, is, hy, ne, mn, bs, kk, sq, sw, gl, mr, pa, si, km, sn, yo, so, af, oc, ka, be, tg, sd, gu, am, yi, lo, uz, fo, ht, ps, tk, nn, mt, sa, lb, my, bo, tl, mg, as, tt, haw, ln, ha, ba, jw, su, yue

encoding

type:string

required

The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.

Required field - you must specify the encoding format that matches your audio data. We recommend using pcm_s16le for best performance.

sample_rate

type:string

required

The sample rate of the audio in Hz.

Required field - must match the actual sample rate of your audio data. We recommend using 16000 for best performance.

min_volume

type:string

required

Volume threshold for voice activity detection. Audio below this threshold will be considered silence. Range: 0.0-1.0. Higher values = more aggressive filtering of quiet speech.

max_silence_duration_secs

type:string

required

Maximum duration of silence (in seconds) before the system considers the utterance complete and triggers endpointing. Higher values allow for longer pauses within utterances.

api_key

type:string

required

You can specify this instead of the X-API-Key header. This is particularly useful for use in the browser, where WebSockets do not support headers. You do not need to specify this if you are passing the header.

X-API-Key

type:httpApiKey

API key passed in header

api_key

type:httpApiKey

API key passed as query parameter (useful for browser WebSockets)

Send Audio Data or Commands

type:string

In Practice:

Send binary WebSocket messages containing raw audio data in the format specified by encoding parameter
Send text WebSocket messages with commands: finalize (flush any remaining audio and receive flush_done acknowledgment) or done (flush remaining audio, close session, and receive done acknowledgment)

Timeout Behavior:

If no audio data is sent for 3 minutes, the WebSocket will automatically disconnect
The timeout resets with each message (audio data or text command) sent to the server

Audio Requirements:

Send audio in small chunks (e.g., 100ms intervals) for optimal latency
Audio format must match the encoding and sample_rate parameters

Finalize Command

type:string

Send 'finalize' as a text message to flush any remaining audio and receive flush_done acknowledgment

Done Command

type:string

Send 'done' as a text message to flush remaining audio, close session, and receive done acknowledgment

Receive Transcription

type:object

The server will send transcription results as they become available. Messages can be of type transcript, flush_done, done, or error. Each transcript response includes word-level timestamps.

Flush Done Response

type:object

Acknowledgment that finalize command was received

Done Response

type:object

Acknowledgment that session is closing

STT Error Response

type:object

Error information for STT

Speech-to-Text (Batch)

Migrating From OpenAI Whisper to Cartesia Ink

⌘I

Use the API

API Status

TTS

STT

Agents

Voices

Voice Changer

Auth

Datasets

Fine Tunes

Infill

Pronunciation Dicts

Speech-to-Text (Streaming)