Skip to main content
WSS
/
stt
/
websocket

Documentation Index

Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt

Use this file to discover all available pages before exploring further.

Messages
model
type:string
required

ID of the model to use for transcription. See Models for available models.

language
type:string
required

The language of the input audio in ISO-639-1 format. Defaults to en.

See Models for supported languages.

encoding
type:string
required

The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.

For guidance on choosing an encoding, see Audio encodings.

sample_rate
type:string
required

The sample rate of the audio in Hz.

min_volume
type:string
required

Volume threshold for voice activity detection. Audio below this threshold will be considered silence. Range: 0.0-1.0. Higher values = more aggressive filtering of quiet speech.

max_silence_duration_secs
type:string
required

Maximum duration of silence (in seconds) before the system considers the utterance complete and triggers endpointing. Higher values allow for longer pauses within utterances.

cartesia_version
type:string
required

API version, e.g. 2024-06-10. You can specify this instead of the Cartesia-Version header. This is particularly useful in the browser, where WebSockets do not support headers. You do not need to specify this if you are passing the header.

X-API-Key
type:httpApiKey

API key passed in a header.

access_token
type:httpApiKey

A short-lived access token passed in a query param to make API requests from a client. This is particularly useful in the browser, where WebSockets do not support headers. See Authenticate client apps to generate an access token.

Send Audio Data
type:string

Send binary WebSocket messages containing raw audio data in the format specified by the encoding and sample_rate connection parameters.

Audio Requirements:

  • Send audio in small chunks (e.g., 100ms intervals) for optimal latency
  • Audio format must match the encoding and sample_rate parameters

Timeout Behavior:

  • If no audio data is sent for 3 minutes, the WebSocket will automatically disconnect
  • The timeout resets with each audio chunk sent to the server
Finalize Command
type:string

Send finalize as a text message to flush any remaining audio and receive flush_done acknowledgment

Done Command
type:string

Send done as a text message to flush remaining audio, close session, and receive done acknowledgment

Receive Transcription
type:object

The server will send transcription results as they become available. Messages can be of type transcript, flush_done, done, or error. Each transcript response includes word-level timestamps.

Flush Done Response
type:object

Acknowledgment that finalize command was received

Done Response
type:object

Acknowledgment that session is closing

Error Response
type:object

Error information for STT WebSocket connections.