Cartesia’s Realtime Speech-to-Text (External VAD) API is similar to Deepgram’s Live Audio (Nova) API. Both APIs stream audio over a WebSocket and emit transcripts as they become available, so porting an existing Nova integration is mostly a matter of renaming fields and updating a few connection parameters. If you want the API to detect user turns, see Realtime Speech-to-Text and the Deepgram Flux migration guide instead. This guide covers direct WebSocket usage. SDK-specific examples are coming soon.Documentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.
Connection
Replace the Deepgram WebSocket URL and auth header with Cartesia’s.cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key.
Query parameters
| Deepgram Nova | Cartesia Ink | Notes |
|---|---|---|
model=nova-3 (Required) | model=ink-2 (Required) | See Models for all options. |
encoding=linear16 (Required) | encoding=pcm_s16le (Required) | linear16 → pcm_s16le, linear32 → pcm_s32le, mulaw → pcm_mulaw, alaw → pcm_alaw. |
sample_rate (Required) | sample_rate (Required) | No change. |
language | language | ink-2 only supports en right now. Use ink-whisper for other languages. |
| — | cartesia_version=2026-03-01 | See API Conventions for details. |
multichannel, channels | — | Send a mono audio stream per WebSocket connection. |
diarize | — | Coming soon! |
keyterm, keywords | — | Coming soon! |
endpointing, utterance_end_ms, interim_results, vad_events, punctuate, smart_format, numerals, dictation, redact, replace, , search, detect_entities, profanity_filter | — | No equivalent. |
Sending audio
Both APIs accept raw audio as binary WebSocket frames. No change to your audio pipeline — just make sure the bytes match theencoding and sample_rate you declared.
Cartesia’s control commands are bare text frames, not JSON.
To force the model to flush any buffered audio and emit the transcript:
KeepAlive message. The connection has a 3-minute idle timeout that resets every time you send an audio chunk — keep streaming audio (silent or otherwise) to hold it open.
Event mapping
Deepgram emits four server message types. Cartesia emits transcript chunks plus acknowledgments for thefinalize and close commands.
Deepgram Nova (type) | Cartesia (type) | Notes |
|---|---|---|
Results | transcript | The main transcript event. See payload diff below. |
Metadata | — | No equivalent. |
UtteranceEnd | — | No equivalent. Run client-side VAD or use Realtime STT if you need this. |
SpeechStarted | — | No equivalent. Run client-side VAD or use Realtime STT if you need this. |
| — | flush_done | Acknowledgment for finalize. |
| — | done | Acknowledgment for close. Sent immediately before the WebSocket closes. |
| — | error | Error events on the WebSocket. |
Results message:
transcript event:
text is a delta since the last is_final: true chunk, not a cumulative transcript for the whole session. To assemble the full transcript, concatenate the text from every chunk where is_final is true.
Do not strip whitespace from text or add whitespace between chunks as this will produce an incorrect transcript.
Fields that don’t have an equivalent
Cartesia does not emit:channel.alternatives— Cartesia returns a single best transcript at the top levelchannel_index,from_finalizespeech_final— useis_finaltogether with silence-based finalizationconfidence(per-word and per-utterance)entities,metadata,model_infopunctuated_word,speaker(per-word) — diarization is coming soon