{}{}{}{
"type": "transcript",
"is_final": false,
"request_id": "58dfa4d4-91c5-410c-8529-6824c8f7aedc",
"text": "How are you doing today?",
"duration": 0.5,
"language": "en",
"words": [
{
"word": "How",
"start": 0,
"end": 0.12
},
{
"word": "are",
"start": 0.15,
"end": 0.25
},
{
"word": "you",
"start": 0.28,
"end": 0.35
},
{
"word": "doing",
"start": 0.38,
"end": 0.55
},
{
"word": "today?",
"start": 0.58,
"end": 0.78
}
]
}{
"type": "flush_done",
"request_id": "b67e1c5d-2f4c-4c3d-9f82-96eb4d2f12a8"
}{
"type": "done",
"request_id": "b67e1c5d-2f4c-4c3d-9f82-96eb4d2f12a8"
}{
"type": "<string>",
"error": "<string>",
"request_id": "<string>"
}{}{}{}{
"type": "transcript",
"is_final": false,
"request_id": "58dfa4d4-91c5-410c-8529-6824c8f7aedc",
"text": "How are you doing today?",
"duration": 0.5,
"language": "en",
"words": [
{
"word": "How",
"start": 0,
"end": 0.12
},
{
"word": "are",
"start": 0.15,
"end": 0.25
},
{
"word": "you",
"start": 0.28,
"end": 0.35
},
{
"word": "doing",
"start": 0.38,
"end": 0.55
},
{
"word": "today?",
"start": 0.58,
"end": 0.78
}
]
}{
"type": "flush_done",
"request_id": "b67e1c5d-2f4c-4c3d-9f82-96eb4d2f12a8"
}{
"type": "done",
"request_id": "b67e1c5d-2f4c-4c3d-9f82-96eb4d2f12a8"
}{
"type": "<string>",
"error": "<string>",
"request_id": "<string>"
}ID of the model to use for transcription. Use 'ink-whisper' for the latest Cartesia Whisper model.
The language of the input audio in ISO-639-1 format. Defaults to en.
Supported languages: en, zh, de, es, ru, ko, fr, ja, pt, tr, pl, ca, nl, ar, sv, it, id, hi, fi, vi, he, uk, el, ms, cs, ro, da, hu, ta, no, th, ur, hr, bg, lt, la, mi, ml, cy, sk, te, fa, lv, bn, sr, az, sl, kn, et, mk, br, eu, is, hy, ne, mn, bs, kk, sq, sw, gl, mr, pa, si, km, sn, yo, so, af, oc, ka, be, tg, sd, gu, am, yi, lo, uz, fo, ht, ps, tk, nn, mt, sa, lb, my, bo, tl, mg, as, tt, haw, ln, ha, ba, jw, su, yue
The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.
Required field - you must specify the encoding format that matches your audio data. We recommend using pcm_s16le for best performance.
The sample rate of the audio in Hz.
Required field - must match the actual sample rate of your audio data. We recommend using 16000 for best performance.
Volume threshold for voice activity detection. Audio below this threshold will be considered silence. Range: 0.0-1.0. Higher values = more aggressive filtering of quiet speech.
Maximum duration of silence (in seconds) before the system considers the utterance complete and triggers endpointing. Higher values allow for longer pauses within utterances.
You can specify this instead of the X-API-Key header. This is particularly useful for use in the browser, where WebSockets do not support headers. You do not need to specify this if you are passing the header.
API key passed in header
API key passed as query parameter (useful for browser WebSockets)
In Practice:
Timeout Behavior:
Audio Requirements:
Send 'finalize' as a text message to flush any remaining audio and receive flush_done acknowledgment
Send 'done' as a text message to flush remaining audio, close session, and receive done acknowledgment
The server will send transcription results as they become available. Messages can be of type transcript, flush_done, done, or error. Each transcript response includes word-level timestamps.
Acknowledgment that finalize command was received
Acknowledgment that session is closing
Error information for STT