Skip to main content
This guide covers migrating from OpenAI Realtime Transcription when used with turn_detection: server_vad.

Back to guides

Other ways to migrate and best practices for Cartesia Speech-to-Text
This guide contains both bare API descriptions and SDK code. To install the SDK:
pip install cartesia
If you’re already using the Cartesia SDK, upgrade to version >=3.2.0
Ink 2 only supports English right now.
We expect to add more languages in the coming months.

Connection

Replace the OpenAI WebSocket URL and auth header with Cartesia’s /stt/turns/websocket, including your desired model and input audio format as query parameters:
- wss://api.openai.com/v1/realtime?intent=transcription
+ wss://api.cartesia.ai/stt/turns/websocket?model=ink-2&encoding=pcm_s16le&sample_rate=24000
- Authorization: Bearer <OPENAI_API_KEY>
+ Authorization: Bearer <CARTESIA_API_KEY>
+ Cartesia-Version: 2026-03-01
In browsers, WebSockets do not support request headers. Instead, pass the API version as the cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key. Connect to the auto-finalization WebSocket with the Cartesia SDK:
import os
from cartesia import AsyncCartesia

client = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY"))

async with client.stt.auto_finalize.websocket(
    model="ink-2", encoding="pcm_s16le", sample_rate=24000
) as connection:
    ...

Session configuration

OpenAI configures the session in the session.update payload. Cartesia takes the equivalent settings as query parameters.
OpenAI session configCartesia Realtime STT (Auto)Notes
?intent=transcriptionInk only supports transcription.
audio.input.transcription.model (gpt-4o-transcribe)model=ink-2 requiredSee Models for all options.
audio.input.format (audio/pcm, 24 kHz)encoding=pcm_s16le + sample_rate=24000 requiredCartesia supports many more input audio formats. See encoding for all options.
audio.input.turn_detection (server_vad)See manual finalization to disable turn detection.
audio.input.transcription.languageink-2 only supports en right now. More languages are coming soon!
audio.input.transcription.delayNot configurable.
audio.input.noise_reductionNot required.
include: ["item.input_audio_transcription.logprobs"]Coming soon!
cartesia_version=2026-03-01 requiredSee API Conventions for details.
OpenAI sets the input format under audio.input.format. Cartesia takes encoding and sample_rate as query parameters.
OpenAI audio.input.formatCartesia encodingCartesia sample_rate
{ "type": "audio/pcm", "rate": 24000 }pcm_s16le24000
g711_ulawpcm_mulaw8000
g711_alawpcm_alaw8000
OpenAI’s PCM format is 16-bit, 24 kHz, mono. Cartesia accepts that sample_rate directly, so you can stream the same audio without resampling. Cartesia also accepts pcm_s32le, pcm_f16le, and pcm_f32le.

Sending audio

OpenAI wraps each audio chunk in a JSON formatted text frame and base64-encodes the audio bytes.
Cartesia accepts audio chunks as binary frames: send the raw audio bytes directly:
- { "type": "input_audio_buffer.append", "audio": "<base64 PCM>" }
+ <raw PCM bytes>
There’s no equivalent for OpenAI’s session.update message; reconnect a new WebSocket to change parameters. To commit all audio and close the session, send a JSON formatted text frame:
{ "type": "close" }
Cartesia will transcribe all buffered audio, then close the socket for you.
If you currently commit audio mid-session with OpenAI using input_audio_buffer.commit, consider using Cartesia with manual finalization instead.Take a look at the guides page for details.

Sending audio with the SDK

# raw_audio (bytes) - Raw audio data, about 100 ms at a time
await connection.send_raw(raw_audio)

Decoding base64 encoded audio before sending

from base64 import b64decode

await connection.send_raw(b64decode(audio_base_64))

Closing

# Commit buffered audio
# and let the server close the socket once done
await connection.send({"type": "close"})

# Close the socket early (optional)
connection.close()

Event mapping

OpenAI signals turns with input_audio_buffer.speech_started / speech_stopped / committed, then bursts transcript deltas and a completed event per turn. Cartesia folds the same information into a turn lifecycle: turn.start, turn.update, turn.eager_end, turn.resume, and turn.end. See Turn Detection for the full state machine.
OpenAI typeCartesia typeNotes
session.created / session.updatedconnectedCartesia has no session-config round-trip. You do not need to wait before sending audio.
input_audio_buffer.speech_startedturn.startThe user began speaking. Carries no transcript.
conversation.item.input_audio_transcription.deltaturn.updateOpenAI bursts deltas after the turn commits; Cartesia’s turn.update streams during the turn.
input_audio_buffer.speech_stopped / committedturn.endThe user stopped speaking and the turn committed.
conversation.item.input_audio_transcription.completedturn.endFinal transcript for the turn.
turn.eager_endThe model predicts the user might be done speaking. Okay to ignore.
turn.resumeThe user kept talking; ignore the last turn.eager_end.
errorerrorClient or server errors.

Completed transcripts

An OpenAI conversation.item.input_audio_transcription.completed event:
{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_003",
  "content_index": 0,
  "transcript": "Hello world!"
}
Becomes a Cartesia turn.end event:
{
  "type": "turn.end",
  "transcript": "Hello world!",
  "request_id": "33cacee6-1936-4949-a05b-ecc9f2393248"
}
turn.start and turn.resume events do not carry a transcript.
import asyncio
from cartesia.types.stt import STTAutoFinalizeWebsocketResponse

full_transcript = ""

async def receive() -> None:
    global full_transcript
    async for event in connection:
        if event.type == "turn.start":
            print("speech_started")
        elif event.type == "turn.update":
            # cumulative within a turn
            print(f"Transcript so far: {event.transcript}")
        elif event.type == "turn.end":
            # Do not strip or add spaces!
            full_transcript += event.transcript
            print(f"speech_stopped: {event.transcript}")
        elif event.type == "error":
            print(f"error: {event.message}")

# Run receive() concurrently with your audio sender:
#   await asyncio.gather(send_audio(), receive())

Example Server Messages

OpenAI batches each turn. Ink streams within the turn.
OpenAI gpt-4o-transcribe (server VAD)Cartesia Realtime STT (Auto)
session.updatedconnected
speech_startedturn.start
turn.update "OpenAI batches"
turn.update "OpenAI batches each turn."
turn.eager_end "OpenAI batches each turn."
speech_stopped + committed
…transcription.delta "OpenAI batches each turn." (burst after commit)
…transcription.completed "OpenAI batches each turn."turn.end "OpenAI batches each turn."
speech_startedturn.start
turn.update "Ink streams"
turn.eager_end "Ink streams"
turn.resume
turn.update "Ink streams within the turn."
turn.eager_end "Ink streams within the turn."
speech_stopped + committed
…transcription.delta "Ink streams within the turn." (burst after commit)
…transcription.completed "Ink streams within the turn."turn.end "Ink streams within the turn."

References

API Reference

Cartesia Realtime STT (Auto)

Full Code Example

Using the Cartesia SDK