Speech-to-Text (Streaming)

This endpoint creates a bidirectional WebSocket connection for real-time speech transcription.

Our STT endpoint enables sending in a stream of audio as bytes, and provides transcription results as they become available.

Usage Pattern:

  1. Connect to the WebSocket with appropriate query parameters
  2. Send audio chunks as binary WebSocket messages in the specified encoding format
  3. Receive transcription messages as JSON with word-level timestamps
  4. Send finalize as a text message to flush any remaining audio (receives flush_done acknowledgment)
  5. Send done as a text message to close the session cleanly (receives done acknowledgment and closes)

Performance Recommendation: For best performance, it is recommended to resample audio before streaming and send audio chunks in pcm_s16le format at 16kHz sample rate.

Pricing: Speech-to-text streaming is priced at 1 credit per 1 second of audio streamed in.

Concurrency: STT has a dedicated concurrency limit, which determines the maximum number of active WebSocket connections you can have at any time. If you exceed your concurrency limit, new connections will be rejected with a 429 error. Idle WebSocket connections are automatically closed after 20 seconds of inactivity (no audio being streamed).

HandshakeTry it

WSS
wss://api.cartesia.ai/stt/websocket

Headers

Cartesia-Version"2025-04-16"Required

Query parameters

modelstringRequired

ID of the model to use for transcription. Use ink-whisper for the latest Cartesia Whisper model.

languagestringOptional

The language of the input audio in ISO-639-1 format. Defaults to en.

  • en (English)
  • zh (Chinese)
  • de (German)
  • es (Spanish)
  • ru (Russian)
  • ko (Korean)
  • fr (French)
  • ja (Japanese)
  • pt (Portuguese)
  • tr (Turkish)
  • pl (Polish)
  • ca (Catalan)
  • nl (Dutch)
  • ar (Arabic)
  • sv (Swedish)
  • it (Italian)
  • id (Indonesian)
  • hi (Hindi)
  • fi (Finnish)
  • vi (Vietnamese)
  • he (Hebrew)
  • uk (Ukrainian)
  • el (Greek)
  • ms (Malay)
  • cs (Czech)
  • ro (Romanian)
  • da (Danish)
  • hu (Hungarian)
  • ta (Tamil)
  • no (Norwegian)
  • th (Thai)
  • ur (Urdu)
  • hr (Croatian)
  • bg (Bulgarian)
  • lt (Lithuanian)
  • la (Latin)
  • mi (Maori)
  • ml (Malayalam)
  • cy (Welsh)
  • sk (Slovak)
  • te (Telugu)
  • fa (Persian)
  • lv (Latvian)
  • bn (Bengali)
  • sr (Serbian)
  • az (Azerbaijani)
  • sl (Slovenian)
  • kn (Kannada)
  • et (Estonian)
  • mk (Macedonian)
  • br (Breton)
  • eu (Basque)
  • is (Icelandic)
  • hy (Armenian)
  • ne (Nepali)
  • mn (Mongolian)
  • bs (Bosnian)
  • kk (Kazakh)
  • sq (Albanian)
  • sw (Swahili)
  • gl (Galician)
  • mr (Marathi)
  • pa (Punjabi)
  • si (Sinhala)
  • km (Khmer)
  • sn (Shona)
  • yo (Yoruba)
  • so (Somali)
  • af (Afrikaans)
  • oc (Occitan)
  • ka (Georgian)
  • be (Belarusian)
  • tg (Tajik)
  • sd (Sindhi)
  • gu (Gujarati)
  • am (Amharic)
  • yi (Yiddish)
  • lo (Lao)
  • uz (Uzbek)
  • fo (Faroese)
  • ht (Haitian Creole)
  • ps (Pashto)
  • tk (Turkmen)
  • nn (Nynorsk)
  • mt (Maltese)
  • sa (Sanskrit)
  • lb (Luxembourgish)
  • my (Myanmar)
  • bo (Tibetan)
  • tl (Tagalog)
  • mg (Malagasy)
  • as (Assamese)
  • tt (Tatar)
  • haw (Hawaiian)
  • ln (Lingala)
  • ha (Hausa)
  • ba (Bashkir)
  • jw (Javanese)
  • su (Sundanese)
  • yue (Cantonese)
encodingenumRequired

The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.

Required field - you must specify the encoding format that matches your audio data. We recommend using pcm_s16le for best performance.

sample_rateintegerRequired

The sample rate of the audio in Hz.

Required field - must match the actual sample rate of your audio data. We recommend using 16000 for best performance.

min_volumedoubleOptional

Volume threshold for voice activity detection. Audio below this threshold will be considered silence. Range: 0.0-1.0. Higher values = more aggressive filtering of quiet speech.

max_silence_duration_secsdoubleOptional

Maximum duration of silence (in seconds) before the system considers the utterance complete and triggers endpointing. Higher values allow for longer pauses within utterances.

api_keystringRequired

You can specify this instead of the X-API-Key header. This is particularly useful for use in the browser, where WebSockets do not support headers.

You do not need to specify this if you are passing the header.

Send

Send Audio Data or CommandsstringRequired

In Practice:

  • Send binary WebSocket messages containing raw audio data in the format specified by encoding parameter
  • Send text WebSocket messages with commands:
    • finalize - Flush any remaining audio and receive flush_done acknowledgment
    • done - Flush remaining audio, close session, and receive done acknowledgment

Timeout Behavior:

  • If no audio data is sent for 20 seconds, the WebSocket will automatically disconnect
  • The timeout resets with each message (audio data or text command) sent to the server

Audio Requirements:

  • Send audio in small chunks (e.g., 100ms intervals) for optimal latency
  • Audio format must match the encoding and sample_rate parameters

Receive

Receive TranscriptionobjectRequired

The server will send transcription results as they become available. Messages can be of type transcript, flush_done, done, or error. Each transcript response includes word-level timestamps.