Cartesia’s Realtime Speech-to-Text API is similar to Deepgram’s Turn-based Audio (Flux) API. Both APIs emit turn-based events over a WebSocket, so porting an existing Flux integration is mostly a matter of renaming fields and updating a few connection parameters. If you want to tell the API when user turns end, see Realtime Speech-to-Text (External VAD) and the Deepgram Nova migration guide instead. This guide covers direct WebSocket usage. SDK-specific examples are coming soon.Documentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.
Connection
Replace the Deepgram WebSocket URL and auth header with Cartesia’s.cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key.
Query parameters
| Deepgram Flux | Cartesia Ink 2 | Notes |
|---|---|---|
model=flux-general-en (Required) | model=ink-2 (Required) | See STT Models for all options. |
encoding=linear16 (Required) | encoding=pcm_s16le (Required) | linear16 → pcm_s16le, linear32 → pcm_s32le, mulaw → pcm_mulaw, alaw → pcm_alaw. |
sample_rate (Required) | sample_rate (Required) | No change. |
language_hint | — | Only English is supported right now. Multi-lingual support is coming soon! |
| — | cartesia_version=2026-03-01 | See API Conventions for details. |
eager_eot_threshold | — | Turn detection is controlled by the model. Configuration is coming soon! |
eot_threshold | — | Turn detection is controlled by the model. Configuration is coming soon! |
eot_timeout_ms | — | Turn detection is controlled by the model. Configuration is coming soon! |
keyterm | — | Coming soon! |
Sending audio
Both APIs accept raw audio as binary WebSocket frames. No change to your audio pipeline — just make sure the bytes match theencoding and sample_rate you declared.
To close the session, send a JSON encoded WebSocket text frame:
Configure control message since there’s no need to configure end-of-turn.
Event mapping
Flux wraps all turn events in a singleTurnInfo message with an event discriminator. Cartesia emits one message type per event, with the type on the top-level type field.
Deepgram Flux (TurnInfo.event) | Cartesia (type) | Carries transcript? |
|---|---|---|
StartOfTurn | turn.start | No (Flux: yes) |
Update | turn.update | Yes |
EagerEndOfTurn | turn.eager_end | Yes |
TurnResumed | turn.resume | No (Flux: yes) |
EndOfTurn | turn.end | Yes |
Connected | connected | — |
Error | error | — |
TurnInfo message:
turn.end event:
transcript is cumulative within a turn.
Ink 2 has the added benefit that all emitted transcripts are final; words are not emitted until the model is confident. Later events will only append to the transcript without modifying text sent by earlier events.
Fields that don’t have an equivalent
Cartesia does not emit:turn_indexaudio_window_startaudio_window_endwordsend_of_turn_confidencesequence_idlanguageslanguages_hinted