commit_strategy=vad.
Back to guides
Other ways to migrate and best practices for Cartesia Speech-to-Text
If you’re already using the Cartesia SDK, upgrade to version
>=3.2.0Ink 2 only supports English right now.
We expect to add more languages in the coming months.
We expect to add more languages in the coming months.
Connection
Replace the ElevenLabs WebSocket URL and auth header with Cartesia’s/stt/turns/websocket.
cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key.
Connect to the auto-finalization WebSocket with the Cartesia SDK:
Query parameters
| ElevenLabs Scribe (VAD) | Cartesia Realtime STT (Auto) | Notes |
|---|---|---|
model_id=scribe_v2_realtime required | model=ink-2 required | See Models for all options. |
audio_format=pcm_16000 | encoding=pcm_s16le + sample_rate=16000 required | ElevenLabs bundles format and rate; Cartesia splits them. See encoding. |
commit_strategy=vad | — | See manual finalization for manual commits. |
language_code | — | ink-2 only supports en right now. More languages are coming soon! |
| — | cartesia_version=2026-03-01 required | See API Conventions for details. |
vad_silence_threshold_secs, vad_threshold, min_speech_duration_ms, min_silence_duration_ms | — | Cartesia uses semantic turn detection. No VAD tuning required. |
include_timestamps | — | Coming soon! |
keyterms | — | Coming soon! |
enable_logging | — | Controlled by your organization. |
encoding
encoding
ElevenLabs bundles the sample format and rate into a single
Cartesia also accepts
All Cartesia encodings support all sample rates.
audio_format token. Cartesia splits them into encoding and sample_rate.ElevenLabs audio_format | Cartesia encoding | Cartesia sample_rate |
|---|---|---|
pcm_8000 | pcm_s16le | 8000 |
pcm_16000 | pcm_s16le | 16000 |
pcm_22050 | pcm_s16le | 22050 |
pcm_24000 | pcm_s16le | 24000 |
pcm_44100 | pcm_s16le | 44100 |
pcm_48000 | pcm_s16le | 48000 |
ulaw_8000 | pcm_mulaw | 8000 |
pcm_s32le, pcm_f16le, pcm_f32le, and pcm_alaw.All Cartesia encodings support all sample rates.
Sending audio
ElevenLabs wraps each audio chunk in a JSON formatted text frame and base64-encodes the audio bytes.Cartesia accepts audio chunks as binary frames: send the raw audio bytes directly:
- No need to supply previous text
- Sample rate is determined upon connection by the
sample_ratequery parameter
Sending audio with the SDK
Decoding base64 encoded audio before sending
Committing and closing
Event mapping
Scribe emits apartial_transcript, then a committed_transcript when its VAD commits a segment.
Cartesia folds the same information into a turn lifecycle: turn.start, turn.update, turn.eager_end, turn.resume, and turn.end. See Turn Detection for the full state machine.
ElevenLabs message_type | Cartesia type | Notes |
|---|---|---|
session_started | connected | Connection confirmed. You do not need to wait for it before sending audio. |
partial_transcript | turn.update | Partial transcript while the user is speaking. |
committed_transcript | turn.end | User stopped speaking; contains the complete transcript for the user turn. |
committed_transcript_with_timestamps | turn.end | Timestamps are not yet available. |
| — | turn.start | The user began speaking. Carries no transcript. |
| — | turn.eager_end | The model predicts the user might be done speaking. Okay to ignore. |
| — | turn.resume | The user kept talking; ignore the last turn.eager_end. |
error | error | Client or server errors. |
auth_error | — | Cartesia will reject the WebSocket upgrade with a 401 or 403 HTTP status. |
quota_exceeded | error | Cartesia’s error response will contain "error_code": "quota_exceeded". |
rate_limited | error | Cartesia’s error response will contain "error_code": "concurrency_limited". |
session_time_limit_exceeded | — | Cartesia will send a WebSocket close frame with code 1001. |
Partial transcripts
An ElevenLabspartial_transcript:
turn.update:
Committed transcripts
An ElevenLabscommitted_transcript:
turn.end:
Example Server Messages
Scribe’s transcripts are joined with spaces. Ink’s are not.
| ElevenLabs Scribe (VAD) | Cartesia Realtime STT (Auto) |
|---|---|
| — | turn.start |
partial_transcript "Scribe's transcripts" | turn.update "Scribe's transcripts" |
| — | turn.eager_end "Scribe's transcripts" |
| — | turn.resume |
partial_transcript "Scribe's transcripts are joined with spaces." | turn.update "Scribe's transcripts are joined with spaces." |
| — | turn.eager_end "Scribe's transcripts are joined with spaces." |
committed_transcript "Scribe's transcripts are joined with spaces." | turn.end "Scribe's transcripts are joined with spaces." |
committed_transcript_with_timestamps "Scribe's transcripts are joined with spaces." | — |
| — | turn.start |
partial_transcript "Ink's are not." | turn.update " Ink's are not." |
| — | turn.eager_end " Ink's are not." |
committed_transcript "Ink's are not." | turn.end " Ink's are not." |
committed_transcript_with_timestamps "Ink's are not." | — |
References
API Reference
Cartesia Realtime STT (Auto)
Full Code Example
Using the Cartesia SDK