Speech-to-Text (Streaming)
This endpoint creates a bidirectional WebSocket connection for real-time speech transcription.
Our STT endpoint enables sending in a stream of audio as bytes, and provides transcription results as they become available.
Usage Pattern:
- Connect to the WebSocket with appropriate query parameters
- Send audio chunks as binary WebSocket messages in the specified encoding format
- Receive transcription messages as JSON with word-level timestamps
- Send
finalize
as a text message to flush any remaining audio (receivesflush_done
acknowledgment) - Send
done
as a text message to close the session cleanly (receivesdone
acknowledgment and closes)
Performance Recommendation:
For best performance, it is recommended to resample audio before streaming and send audio chunks in pcm_s16le
format at 16kHz sample rate.
Pricing: Speech-to-text streaming is priced at 1 credit per 1 second of audio streamed in.
Concurrency: STT has a dedicated concurrency limit, which determines the maximum number of active WebSocket connections you can have at any time. If you exceed your concurrency limit, new connections will be rejected with a 429 error. Idle WebSocket connections are automatically closed after 20 seconds of inactivity (no audio being streamed).
HandshakeTry it
Headers
Query parameters
ID of the model to use for transcription. Use ink-whisper
for the latest Cartesia Whisper model.
The language of the input audio in ISO-639-1 format. Defaults to en
.
Supported languages
en
(English)zh
(Chinese)de
(German)es
(Spanish)ru
(Russian)ko
(Korean)fr
(French)ja
(Japanese)pt
(Portuguese)tr
(Turkish)pl
(Polish)ca
(Catalan)nl
(Dutch)ar
(Arabic)sv
(Swedish)it
(Italian)id
(Indonesian)hi
(Hindi)fi
(Finnish)vi
(Vietnamese)he
(Hebrew)uk
(Ukrainian)el
(Greek)ms
(Malay)cs
(Czech)ro
(Romanian)da
(Danish)hu
(Hungarian)ta
(Tamil)no
(Norwegian)th
(Thai)ur
(Urdu)hr
(Croatian)bg
(Bulgarian)lt
(Lithuanian)la
(Latin)mi
(Maori)ml
(Malayalam)cy
(Welsh)sk
(Slovak)te
(Telugu)fa
(Persian)lv
(Latvian)bn
(Bengali)sr
(Serbian)az
(Azerbaijani)sl
(Slovenian)kn
(Kannada)et
(Estonian)mk
(Macedonian)br
(Breton)eu
(Basque)is
(Icelandic)hy
(Armenian)ne
(Nepali)mn
(Mongolian)bs
(Bosnian)kk
(Kazakh)sq
(Albanian)sw
(Swahili)gl
(Galician)mr
(Marathi)pa
(Punjabi)si
(Sinhala)km
(Khmer)sn
(Shona)yo
(Yoruba)so
(Somali)af
(Afrikaans)oc
(Occitan)ka
(Georgian)be
(Belarusian)tg
(Tajik)sd
(Sindhi)gu
(Gujarati)am
(Amharic)yi
(Yiddish)lo
(Lao)uz
(Uzbek)fo
(Faroese)ht
(Haitian Creole)ps
(Pashto)tk
(Turkmen)nn
(Nynorsk)mt
(Maltese)sa
(Sanskrit)lb
(Luxembourgish)my
(Myanmar)bo
(Tibetan)tl
(Tagalog)mg
(Malagasy)as
(Assamese)tt
(Tatar)haw
(Hawaiian)ln
(Lingala)ha
(Hausa)ba
(Bashkir)jw
(Javanese)su
(Sundanese)yue
(Cantonese)
The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.
Required field - you must specify the encoding format that matches your audio data. We recommend using pcm_s16le
for best performance.
The sample rate of the audio in Hz.
Required field - must match the actual sample rate of your audio data. We recommend using 16000
for best performance.
Volume threshold for voice activity detection. Audio below this threshold will be considered silence. Range: 0.0-1.0. Higher values = more aggressive filtering of quiet speech.
Maximum duration of silence (in seconds) before the system considers the utterance complete and triggers endpointing. Higher values allow for longer pauses within utterances.
You can specify this instead of the X-API-Key
header. This is particularly useful for use in the browser, where WebSockets do not support headers.
You do not need to specify this if you are passing the header.
Send
In Practice:
- Send binary WebSocket messages containing raw audio data in the format specified by
encoding
parameter - Send text WebSocket messages with commands:
finalize
- Flush any remaining audio and receive flush_done acknowledgmentdone
- Flush remaining audio, close session, and receive done acknowledgment
Timeout Behavior:
- If no audio data is sent for 20 seconds, the WebSocket will automatically disconnect
- The timeout resets with each message (audio data or text command) sent to the server
Audio Requirements:
- Send audio in small chunks (e.g., 100ms intervals) for optimal latency
- Audio format must match the
encoding
andsample_rate
parameters
Receive
The server will send transcription results as they become available. Messages can be of type transcript
, flush_done
, done
, or error
. Each transcript response includes word-level timestamps.