Messages

{
  "type": "transcript",
  "is_final": false,
  "request_id": "58dfa4d4-91c5-410c-8529-6824c8f7aedc",
  "text": "How are you doing today?",
  "duration": 0.5,
  "language": "en",
  "words": [
    {
      "word": "How",
      "start": 0,
      "end": 0.12
    },
    {
      "word": "are",
      "start": 0.15,
      "end": 0.25
    },
    {
      "word": "you",
      "start": 0.28,
      "end": 0.35
    },
    {
      "word": "doing",
      "start": 0.38,
      "end": 0.55
    },
    {
      "word": "today?",
      "start": 0.58,
      "end": 0.78
    }
  ]
}

STT

Speech-to-Text (Streaming)

This endpoint creates a bidirectional WebSocket connection for real-time speech transcription.

Our STT endpoint enables sending in a stream of audio as bytes, and provides transcription results as they become available.

Usage Pattern:

Connect to the WebSocket with appropriate query parameters
Send audio chunks as binary WebSocket messages in the specified encoding format
Receive transcription messages as JSON with word-level timestamps
Send finalize as a text message to flush any remaining audio (receives flush_done acknowledgment)
Send done as a text message to close the session cleanly (receives done acknowledgment and closes)

Performance Recommendation: For best performance, it is recommended to resample audio before streaming and send audio chunks in pcm_s16le format at 16kHz sample rate.

Pricing: Speech-to-text streaming is priced at 1 credit per 1 second of audio streamed in.

For WebSocket connection limits, see the concurrency limits and timeouts page.

WSS

stt

websocket

Messages

{
  "type": "transcript",
  "is_final": false,
  "request_id": "58dfa4d4-91c5-410c-8529-6824c8f7aedc",
  "text": "How are you doing today?",
  "duration": 0.5,
  "language": "en",
  "words": [
    {
      "word": "How",
      "start": 0,
      "end": 0.12
    },
    {
      "word": "are",
      "start": 0.15,
      "end": 0.25
    },
    {
      "word": "you",
      "start": 0.28,
      "end": 0.35
    },
    {
      "word": "doing",
      "start": 0.38,
      "end": 0.55
    },
    {
      "word": "today?",
      "start": 0.58,
      "end": 0.78
    }
  ]
}

model

type:string

required

ID of the model to use for transcription. See Models for available models.

language

type:string

required

The language of the input audio in ISO-639-1 format. Defaults to en.

See Models for supported languages.

encoding

type:string

required

The encoding format of the audio data. This determines how the server interprets the raw binary audio data you send.

For guidance on choosing an encoding, see Audio encodings.

sample_rate

type:string

required

The sample rate of the audio in Hz.

min_volume

type:string

required

Volume threshold for voice activity detection. Audio below this threshold will be considered silence. Range: 0.0-1.0. Higher values = more aggressive filtering of quiet speech.

max_silence_duration_secs

type:string

required

Maximum duration of silence (in seconds) before the system considers the utterance complete and triggers endpointing. Higher values allow for longer pauses within utterances.

cartesia_version

type:string

required

API version, e.g. 2024-06-10. You can specify this instead of the Cartesia-Version header. This is particularly useful in the browser, where WebSockets do not support headers. You do not need to specify this if you are passing the header.

X-API-Key

type:httpApiKey

API key passed in a header.

access_token

type:httpApiKey

A short-lived access token passed in a query param to make API requests from a client. This is particularly useful in the browser, where WebSockets do not support headers. See Authenticate client apps to generate an access token.

Send Audio Data

type:string

Send binary WebSocket messages containing raw audio data in the format specified by the encoding and sample_rate connection parameters.

Audio Requirements:

Send audio in small chunks (e.g., 100ms intervals) for optimal latency
Audio format must match the encoding and sample_rate parameters

Timeout Behavior:

If no audio data is sent for 3 minutes, the WebSocket will automatically disconnect
The timeout resets with each audio chunk sent to the server

Finalize Command

type:string

Send finalize as a text message to flush any remaining audio and receive flush_done acknowledgment

Done Command

type:string

Send done as a text message to flush remaining audio, close session, and receive done acknowledgment

Receive Transcription

type:object

The server will send transcription results as they become available. Messages can be of type transcript, flush_done, done, or error. Each transcript response includes word-level timestamps.

Flush Done Response

type:object

Acknowledgment that finalize command was received

Done Response

type:object

Acknowledgment that session is closing

Error Response

type:object

Error information for STT WebSocket connections.

Speech-to-Text (Batch)

Migrating From OpenAI Whisper to Cartesia Ink

⌘I

Use the API

API Status

TTS

STT

Voices

Voice Changer

Auth

Infill

Speech-to-Text (Streaming)

Use the API

API Status

TTS

STT

Voices

Voice Changer

Auth

Infill

Documentation Index