Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt

Use this file to discover all available pages before exploring further.

Cartesia’s Realtime Speech-to-Text (External VAD) API is similar to Deepgram’s Live Audio (Nova) API. Both APIs stream audio over a WebSocket and emit transcripts as they become available, so porting an existing Nova integration is mostly a matter of renaming fields and updating a few connection parameters. If you want the API to detect user turns, see Realtime Speech-to-Text and the Deepgram Flux migration guide instead. This guide covers direct WebSocket usage. SDK-specific examples are coming soon.

Connection

Replace the Deepgram WebSocket URL and auth header with Cartesia’s.
- wss://api.deepgram.com/v1/listen?model=nova-3&encoding=linear16&sample_rate=16000
+ wss://api.cartesia.ai/stt/websocket?model=ink-2&encoding=pcm_s16le&sample_rate=16000
- Authorization: Token <DEEPGRAM_API_KEY>
+ X-API-Key: <CARTESIA_API_KEY>
In browsers, WebSockets do not support request headers. Instead, pass the API version as the cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key.

Query parameters

Deepgram NovaCartesia InkNotes
model=nova-3 (Required)model=ink-2 (Required)See Models for all options.
encoding=linear16 (Required)encoding=pcm_s16le (Required)linear16pcm_s16le, linear32pcm_s32le, mulawpcm_mulaw, alawpcm_alaw.
sample_rate (Required)sample_rate (Required)No change.
languagelanguageink-2 only supports en right now. Use ink-whisper for other languages.
cartesia_version=2026-03-01See API Conventions for details.
multichannel, channelsSend a mono audio stream per WebSocket connection.
diarizeComing soon!
keyterm, keywordsComing soon!
endpointing, utterance_end_ms, interim_results, vad_events, punctuate, smart_format, numerals, dictation, redact, replace, , search, detect_entities, profanity_filterNo equivalent.

Sending audio

Both APIs accept raw audio as binary WebSocket frames. No change to your audio pipeline — just make sure the bytes match the encoding and sample_rate you declared. Cartesia’s control commands are bare text frames, not JSON. To force the model to flush any buffered audio and emit the transcript:
- { "type": "Finalize" }
+ finalize
To close the session cleanly:
- { "type": "CloseStream" }
+ close
Cartesia has no equivalent of Deepgram’s KeepAlive message. The connection has a 3-minute idle timeout that resets every time you send an audio chunk — keep streaming audio (silent or otherwise) to hold it open.

Event mapping

Deepgram emits four server message types. Cartesia emits transcript chunks plus acknowledgments for the finalize and close commands.
Deepgram Nova (type)Cartesia (type)Notes
ResultstranscriptThe main transcript event. See payload diff below.
MetadataNo equivalent.
UtteranceEndNo equivalent. Run client-side VAD or use Realtime STT if you need this.
SpeechStartedNo equivalent. Run client-side VAD or use Realtime STT if you need this.
flush_doneAcknowledgment for finalize.
doneAcknowledgment for close. Sent immediately before the WebSocket closes.
errorError events on the WebSocket.
A Deepgram Results message:
{
  "type": "Results",
  "channel_index": [0, 1],
  "duration": 1.7,
  "start": 0.0,
  "is_final": true,
  "speech_final": true,
  "channel": {
    "alternatives": [
      {
        "transcript": "Hi I need to cancel my subscription please.",
        "confidence": 0.98,
        "words": [...]
      }
    ]
  },
  "metadata": {...}
}
Becomes an Ink transcript event:
{
  "type": "transcript",
  "is_final": true,
  "request_id": "2ff8af53-4d38-479d-8287-58940f01c701",
  "text": "Hi I need to cancel my subscription please.",
  "duration": 1.7,
  "language": "en",
  "words": [...]
}
Cartesia’s text is a delta since the last is_final: true chunk, not a cumulative transcript for the whole session. To assemble the full transcript, concatenate the text from every chunk where is_final is true. Do not strip whitespace from text or add whitespace between chunks as this will produce an incorrect transcript.

Fields that don’t have an equivalent

Cartesia does not emit:
  • channel.alternatives — Cartesia returns a single best transcript at the top level
  • channel_index, from_finalize
  • speech_final — use is_final together with silence-based finalization
  • confidence (per-word and per-utterance)
  • entities, metadata, model_info
  • punctuated_word, speaker (per-word) — diarization is coming soon