Speech to Text (Streaming)

This endpoint creates a bidirectional WebSocket connection for real-time speech transcription.

Our STT endpoint enables sending in a stream of audio as bytes encoded as PCM 16K samples, and provides transcription results as they become available.

Usage Pattern:

  1. Connect to the WebSocket with appropriate query parameters
  2. Send audio chunks as binary WebSocket messages in pcm_s16le format at 16K sample rate
  3. Receive transcription messages as JSON
  4. Send finalize as a text message to flush any remaining audio (receives flush_done acknowledgment)
  5. Send done as a text message to close the session cleanly (receives done acknowledgment and closes)

HandshakeTry it

GET
wss://api.cartesia.ai/stt/websocket

Headers

Cartesia-Version"2025-04-16"Required

Query parameters

modelstringRequired

ID of the model to use for transcription. Use ink-whisper for the latest Cartesia Whisper model.

languagestringOptional

The language of the input audio in ISO-639-1 format. Defaults to en.

encodingenumOptional

The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.

Currently supported: pcm_s16le - 16-bit signed integer PCM, little-endian (default)

Allowed values:
sample_rateintegerRequired

The sample rate of the audio in Hz, only 16000 is supported (default). Must match the actual sample rate of your audio data.

api_keystringRequired

You can specify this instead of the X-API-Key header. This is particularly useful for use in the browser, where WebSockets do not support headers.

You do not need to specify this if you are passing the header.

Send

Send Audio Data or CommandsstringRequired

In Practice:

  • Send binary WebSocket messages containing raw audio data in the format specified by encoding parameter
  • Send text WebSocket messages with commands:
    • finalize - Flush any remaining audio and receive flush_done acknowledgment
    • done - Flush remaining audio, close session, and receive done acknowledgment

Timeout Behavior:

  • If no audio data is sent for 20 seconds, the WebSocket will automatically disconnect
  • The timeout resets with each message (audio data or text command) sent to the server

Audio Requirements:

  • Send audio in small chunks (e.g., 100ms intervals) for optimal latency
  • Audio format must match the encoding and sample_rate parameters

Receive

Receive TranscriptionobjectRequired

The server will send transcription results as they become available. Messages can be of type transcript, flush_done, done, or error.