Migrating From Deepgram Nova to Cartesia Ink

Cartesia’s Realtime Speech-to-Text (External VAD) API is similar to Deepgram’s Live Audio (Nova) API. Both APIs stream audio over a WebSocket and emit transcripts as they become available, so porting an existing Nova integration is mostly a matter of renaming fields and updating a few connection parameters. If you want the API to detect user turns, see Realtime Speech-to-Text and the Deepgram Flux migration guide instead. This guide covers direct WebSocket usage. SDK-specific examples are coming soon.

Connection

Replace the Deepgram WebSocket URL and auth header with Cartesia’s.

- wss://api.deepgram.com/v1/listen?model=nova-3&encoding=linear16&sample_rate=16000
+ wss://api.cartesia.ai/stt/websocket?model=ink-2&encoding=pcm_s16le&sample_rate=16000

- Authorization: Token <DEEPGRAM_API_KEY>
+ X-API-Key: <CARTESIA_API_KEY>

In browsers, WebSockets do not support request headers. Instead, pass the API version as the cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key.

Query parameters

Deepgram Nova	Cartesia Ink	Notes
`model=nova-3` (Required)	`model=ink-2` (Required)	See Models for all options.
`encoding=linear16` (Required)	`encoding=pcm_s16le` (Required)	`linear16` → `pcm_s16le`, `linear32` → `pcm_s32le`, `mulaw` → `pcm_mulaw`, `alaw` → `pcm_alaw`.
`sample_rate` (Required)	`sample_rate` (Required)	No change.
`language`	`language`	`ink-2` only supports `en` right now. Use `ink-whisper` for other languages.
—	`cartesia_version=2026-03-01`	See API Conventions for details.
`multichannel`, `channels`	—	Send a mono audio stream per WebSocket connection.
`diarize`	—	Coming soon!
`keyterm`, `keywords`	—	Coming soon!
`endpointing`, `utterance_end_ms`, `interim_results`, `vad_events`, `punctuate`, `smart_format`, `numerals`, `dictation`, `redact`, `replace`, , `search`, `detect_entities`, `profanity_filter`	—	No equivalent.

Sending audio

Both APIs accept raw audio as binary WebSocket frames. No change to your audio pipeline — just make sure the bytes match the encoding and sample_rate you declared. Cartesia’s control commands are bare text frames, not JSON. To force the model to flush any buffered audio and emit the transcript:

- { "type": "Finalize" }
+ finalize

To close the session cleanly:

- { "type": "CloseStream" }
+ close

Cartesia has no equivalent of Deepgram’s KeepAlive message. The connection has a 3-minute idle timeout that resets every time you send an audio chunk — keep streaming audio (silent or otherwise) to hold it open.

Event mapping

Deepgram emits four server message types. Cartesia emits transcript chunks plus acknowledgments for the finalize and close commands.

Deepgram Nova (`type`)	Cartesia (`type`)	Notes
`Results`	`transcript`	The main transcript event. See payload diff below.
`Metadata`	—	No equivalent.
`UtteranceEnd`	—	No equivalent. Run client-side VAD or use Realtime STT if you need this.
`SpeechStarted`	—	No equivalent. Run client-side VAD or use Realtime STT if you need this.
—	`flush_done`	Acknowledgment for `finalize`.
—	`done`	Acknowledgment for `close`. Sent immediately before the WebSocket closes.
—	`error`	Error events on the WebSocket.

A Deepgram Results message:

{
  "type": "Results",
  "channel_index": [0, 1],
  "duration": 1.7,
  "start": 0.0,
  "is_final": true,
  "speech_final": true,
  "channel": {
    "alternatives": [
      {
        "transcript": "Hi I need to cancel my subscription please.",
        "confidence": 0.98,
        "words": [...]
      }
    ]
  },
  "metadata": {...}
}

Becomes an Ink transcript event:

{
  "type": "transcript",
  "is_final": true,
  "request_id": "2ff8af53-4d38-479d-8287-58940f01c701",
  "text": "Hi I need to cancel my subscription please.",
  "duration": 1.7,
  "language": "en",
  "words": [...]
}

Cartesia’s text is a delta since the last is_final: true chunk, not a cumulative transcript for the whole session. To assemble the full transcript, concatenate the text from every chunk where is_final is true. Do not strip whitespace from text or add whitespace between chunks as this will produce an incorrect transcript.

Fields that don’t have an equivalent

Cartesia does not emit:

channel.alternatives — Cartesia returns a single best transcript at the top level
channel_index, from_finalize
speech_final — use is_final together with silence-based finalization
confidence (per-word and per-utterance)
entities, metadata, model_info
punctuated_word, speaker (per-word) — diarization is coming soon

Use the API

API Status

TTS

STT

Agents

Voices

Voice Changer

Auth

Datasets

Fine Tunes

Infill

Pronunciation Dicts

Admin

Migrating From Deepgram Nova to Cartesia Ink

Connection

Query parameters

Sending audio

Event mapping

Fields that don’t have an equivalent

Use the API

API Status

TTS

STT

Agents

Voices

Voice Changer

Auth

Datasets

Fine Tunes

Infill

Pronunciation Dicts

Admin

Documentation Index

​Connection

​Query parameters

​Sending audio

​Event mapping

​Fields that don’t have an equivalent

Connection

Query parameters

Sending audio

Event mapping

Fields that don’t have an equivalent