Skip to main content
This guide covers migrating from ElevenLabs Realtime Speech to Text when used with commit_strategy=vad.

Back to guides

Other ways to migrate and best practices for Cartesia Speech-to-Text
This guide contains both bare API descriptions and SDK code. To install the SDK:
pip install cartesia
If you’re already using the Cartesia SDK, upgrade to version >=3.2.0
Ink 2 only supports English right now.
We expect to add more languages in the coming months.

Connection

Replace the ElevenLabs WebSocket URL and auth header with Cartesia’s /stt/turns/websocket.
- wss://api.elevenlabs.io/v1/speech-to-text/realtime?model_id=scribe_v2_realtime&audio_format=pcm_16000&commit_strategy=vad
+ wss://api.cartesia.ai/stt/turns/websocket?model=ink-2&encoding=pcm_s16le&sample_rate=16000
- xi-api-key: <ELEVENLABS_API_KEY>
+ x-api-key: <CARTESIA_API_KEY>
+ cartesia-version: 2026-03-01
In browsers, WebSockets do not support request headers. Instead, pass the API version as the cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key. Connect to the auto-finalization WebSocket with the Cartesia SDK:
import os
from cartesia import AsyncCartesia

client = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY"))

async with client.stt.auto_finalize.websocket(
    model="ink-2", encoding="pcm_s16le", sample_rate=16000
) as connection:
    ...

Query parameters

ElevenLabs Scribe (VAD)Cartesia Realtime STT (Auto)Notes
model_id=scribe_v2_realtime requiredmodel=ink-2 requiredSee Models for all options.
audio_format=pcm_16000encoding=pcm_s16le + sample_rate=16000 requiredElevenLabs bundles format and rate; Cartesia splits them. See encoding.
commit_strategy=vadSee manual finalization for manual commits.
language_codeink-2 only supports en right now. More languages are coming soon!
cartesia_version=2026-03-01 requiredSee API Conventions for details.
vad_silence_threshold_secs, vad_threshold, min_speech_duration_ms, min_silence_duration_msCartesia uses semantic turn detection. No VAD tuning required.
include_timestampsComing soon!
keytermsComing soon!
enable_loggingControlled by your organization.
ElevenLabs bundles the sample format and rate into a single audio_format token. Cartesia splits them into encoding and sample_rate.
ElevenLabs audio_formatCartesia encodingCartesia sample_rate
pcm_8000pcm_s16le8000
pcm_16000pcm_s16le16000
pcm_22050pcm_s16le22050
pcm_24000pcm_s16le24000
pcm_44100pcm_s16le44100
pcm_48000pcm_s16le48000
ulaw_8000pcm_mulaw8000
Cartesia also accepts pcm_s32le, pcm_f16le, pcm_f32le, and pcm_alaw.
All Cartesia encodings support all sample rates.

Sending audio

ElevenLabs wraps each audio chunk in a JSON formatted text frame and base64-encodes the audio bytes.
Cartesia accepts audio chunks as binary frames: send the raw audio bytes directly:
- { "message_type": "input_audio_chunk", "audio_base_64": "<base64 PCM>", "commit": false, "sample_rate": 16000 }
+ <raw PCM bytes>
  • No need to supply previous text
  • Sample rate is determined upon connection by the sample_rate query parameter
If you currently commit audio mid-session with ElevenLabs, consider using Cartesia with manual finalization instead.Take a look at the guides page for details.
To commit all audio and close the session, send a JSON formatted text frame:
{ "type": "close" }
Cartesia will transcribe all buffered audio, then close the socket for you.

Sending audio with the SDK

# ElevenLabs
await elevenlabs_connection.send({
  "audio_base_64": b64encode(raw_audio),
})

# Cartesia
# raw_audio (bytes) - Raw audio data, about 100 ms at a time
await connection.send_raw(raw_audio)

Decoding base64 encoded audio before sending

# ElevenLabs
await elevenlabs_connection.send({
  "audio_base_64": audio_base_64,
})

# Cartesia
from base64 import b64decode
await connection.send_raw(b64decode(audio_base_64))

Committing and closing

# ElevenLabs
elevenlabs_connection.on(RealtimeEvents.COMMITTED_TRANSCRIPT, lambda: elevenlabs_connection.close())
await elevenlabs_connection.commit()

# Cartesia
await connection.send({"type": "close"})

# Cartesia: Close the socket early (optional)
await connection.close()

Event mapping

Scribe emits a partial_transcript, then a committed_transcript when its VAD commits a segment. Cartesia folds the same information into a turn lifecycle: turn.start, turn.update, turn.eager_end, turn.resume, and turn.end. See Turn Detection for the full state machine.
ElevenLabs message_typeCartesia typeNotes
session_startedconnectedConnection confirmed. You do not need to wait for it before sending audio.
partial_transcriptturn.updatePartial transcript while the user is speaking.
committed_transcriptturn.endUser stopped speaking; contains the complete transcript for the user turn.
committed_transcript_with_timestampsturn.endTimestamps are not yet available.
turn.startThe user began speaking. Carries no transcript.
turn.eager_endThe model predicts the user might be done speaking. Okay to ignore.
turn.resumeThe user kept talking; ignore the last turn.eager_end.
errorerrorClient or server errors.
auth_errorCartesia will reject the WebSocket upgrade with a 401 or 403 HTTP status.
quota_exceedederrorCartesia’s error response will contain "error_code": "quota_exceeded".
rate_limitederrorCartesia’s error response will contain "error_code": "concurrency_limited".
session_time_limit_exceededCartesia will send a WebSocket close frame with code 1001.

Partial transcripts

An ElevenLabs partial_transcript:
{
  "message_type": "partial_transcript",
  "text": "Hello"
}
Becomes a Cartesia turn.update:
{
  "type": "turn.update",
  "transcript": "Hello",
  "request_id": "33cacee6-1936-4949-a05b-ecc9f2393248"
}

Committed transcripts

An ElevenLabs committed_transcript:
{
  "message_type": "committed_transcript",
  "text": "Hello world!"
}
Becomes a Cartesia turn.end:
{
  "type": "turn.end",
  "transcript": "Hello world!",
  "request_id": "33cacee6-1936-4949-a05b-ecc9f2393248"
}
import asyncio
from cartesia.types.stt import STTAutoFinalizeWebsocketResponse

def on_message(message: STTAutoFinalizeWebsocketResponse) -> None:
    if message.type == "turn.start":
        print("User started speaking")
    elif message.type == "turn.update":
        print(f"partial_transcript: {message.transcript}")
    elif message.type == "turn.end":
        print(f"committed_transcript: {message.transcript}")
    elif message.type == "error":
        error_code = message.error_code or "unknown_error"
        if error_code == "quota_exceeded":
            print("You are out of credits")
        elif error_code == "concurrency_limited":
            print("You have too many open STT connections")
        else:
            print(f"{error_code}: {message.message}")

connection.on("event", on_message)

# ElevenLabs dispatches to your callbacks automatically;
# with Cartesia you run the dispatch loop yourself
recv_task = asyncio.create_task(connection.dispatch_events())

Example Server Messages

Scribe’s transcripts are joined with spaces. Ink’s are not.
ElevenLabs Scribe (VAD)Cartesia Realtime STT (Auto)
turn.start
partial_transcript "Scribe's transcripts"turn.update "Scribe's transcripts"
turn.eager_end "Scribe's transcripts"
turn.resume
partial_transcript "Scribe's transcripts are joined with spaces."turn.update "Scribe's transcripts are joined with spaces."
turn.eager_end "Scribe's transcripts are joined with spaces."
committed_transcript "Scribe's transcripts are joined with spaces."turn.end "Scribe's transcripts are joined with spaces."
committed_transcript_with_timestamps "Scribe's transcripts are joined with spaces."
turn.start
partial_transcript "Ink's are not."turn.update " Ink's are not."
turn.eager_end " Ink's are not."
committed_transcript "Ink's are not."turn.end " Ink's are not."
committed_transcript_with_timestamps "Ink's are not."

References

API Reference

Cartesia Realtime STT (Auto)

Full Code Example

Using the Cartesia SDK