Skip to main content
WSS
wss://api.cartesia.ai
/
stt
/
websocket
Messages
model
type:string
required

ID of the model to use for transcription. Use 'ink-whisper' for the latest Cartesia Whisper model.

language
type:string
required

The language of the input audio in ISO-639-1 format. Defaults to en.

Supported languages: en, zh, de, es, ru, ko, fr, ja, pt, tr, pl, ca, nl, ar, sv, it, id, hi, fi, vi, he, uk, el, ms, cs, ro, da, hu, ta, no, th, ur, hr, bg, lt, la, mi, ml, cy, sk, te, fa, lv, bn, sr, az, sl, kn, et, mk, br, eu, is, hy, ne, mn, bs, kk, sq, sw, gl, mr, pa, si, km, sn, yo, so, af, oc, ka, be, tg, sd, gu, am, yi, lo, uz, fo, ht, ps, tk, nn, mt, sa, lb, my, bo, tl, mg, as, tt, haw, ln, ha, ba, jw, su, yue

encoding
type:string
required

The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.

Required field - you must specify the encoding format that matches your audio data. We recommend using pcm_s16le for best performance.

sample_rate
type:string
required

The sample rate of the audio in Hz.

Required field - must match the actual sample rate of your audio data. We recommend using 16000 for best performance.

min_volume
type:string
required

Volume threshold for voice activity detection. Audio below this threshold will be considered silence. Range: 0.0-1.0. Higher values = more aggressive filtering of quiet speech.

max_silence_duration_secs
type:string
required

Maximum duration of silence (in seconds) before the system considers the utterance complete and triggers endpointing. Higher values allow for longer pauses within utterances.

api_key
type:string
required

You can specify this instead of the X-API-Key header. This is particularly useful for use in the browser, where WebSockets do not support headers. You do not need to specify this if you are passing the header.

X-API-Key
type:httpApiKey

API key passed in header

api_key
type:httpApiKey

API key passed as query parameter (useful for browser WebSockets)

Send Audio Data or Commands
type:string

In Practice:

  • Send binary WebSocket messages containing raw audio data in the format specified by encoding parameter
  • Send text WebSocket messages with commands: finalize (flush any remaining audio and receive flush_done acknowledgment) or done (flush remaining audio, close session, and receive done acknowledgment)

Timeout Behavior:

  • If no audio data is sent for 3 minutes, the WebSocket will automatically disconnect
  • The timeout resets with each message (audio data or text command) sent to the server

Audio Requirements:

  • Send audio in small chunks (e.g., 100ms intervals) for optimal latency
  • Audio format must match the encoding and sample_rate parameters
Finalize Command
type:string

Send 'finalize' as a text message to flush any remaining audio and receive flush_done acknowledgment

Done Command
type:string

Send 'done' as a text message to flush remaining audio, close session, and receive done acknowledgment

Receive Transcription
type:object

The server will send transcription results as they become available. Messages can be of type transcript, flush_done, done, or error. Each transcript response includes word-level timestamps.

Flush Done Response
type:object

Acknowledgment that finalize command was received

Done Response
type:object

Acknowledgment that session is closing

STT Error Response
type:object

Error information for STT