Migrating from OpenAI Realtime Transcription with Turn Detection

This guide covers migrating from OpenAI Realtime Transcription when used with turn_detection: server_vad.

All migration guides

This guide contains both bare API descriptions and SDK code. To install the SDK:

pip install cartesia

npm i @cartesia/cartesia-js

If you’re already using the Cartesia SDK, upgrade to version >=3.2.0

Ink 2 only supports English right now.
We expect to add more languages in the coming months.

Connection

Replace the OpenAI WebSocket URL and auth header with Cartesia’s /stt/turns/websocket, including your desired model and input audio format as query parameters:

- wss://api.openai.com/v1/realtime?intent=transcription
+ wss://api.cartesia.ai/stt/turns/websocket?model=ink-2&encoding=pcm_s16le&sample_rate=24000

- Authorization: Bearer <OPENAI_API_KEY>
+ Authorization: Bearer <CARTESIA_API_KEY>
+ Cartesia-Version: 2026-03-01

In browsers, WebSockets do not support request headers. Instead, pass the API version as the cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key. Connect to the auto-finalization WebSocket with the Cartesia SDK:

import os
from cartesia import AsyncCartesia

client = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY"))

async with client.stt.auto_finalize.websocket(
    model="ink-2", encoding="pcm_s16le", sample_rate=24000
) as connection:
    ...

import os
from cartesia import Cartesia

client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY"))

with client.stt.auto_finalize.websocket(
    model="ink-2", encoding="pcm_s16le", sample_rate=24000
) as connection:
    ...

import Cartesia from "@cartesia/cartesia-js";

const client = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY });

const connection = client.stt.autoFinalize.websocket({
  model: "ink-2",
  encoding: "pcm_s16le",
  sample_rate: 24000,
});

// Server-side: Generate access-tokens using your API key
import Cartesia from '@cartesia/cartesia-js';

const client = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY });

export async function GET() {
  const { token } = await client.accessToken.create({
    grants: { stt: true, tts: false, agent: false },
    // How long the token lasts in seconds
    // Allowed values: 0–3600
    expires_in: 3600,
  });
  return Response.json({ token });
}


// Client-side
// 1. Fetch an access token from your server
// 2. Connect to Cartesia via WebSocket
import Cartesia from "@cartesia/cartesia-js";

async function getToken(): Promise<string> {
  const res = await fetch('/replace-with-your-server');
  const { token } = await res.json();
  return token;
}
const audioContext = new AudioContext();

const client = new Cartesia({ token: await getToken() });

const connection = client.stt.autoFinalize.websocket({
  model: "ink-2",
  encoding: "pcm_f32le",
  sample_rate: audioContext.sampleRate,
});

Session configuration

OpenAI configures the session in the session.update payload. Cartesia takes the equivalent settings as query parameters.

OpenAI session config	Cartesia Realtime STT (Auto)	Notes
`?intent=transcription`	—	Ink only supports transcription.
`audio.input.transcription.model` (`gpt-4o-transcribe`)	`model=ink-2` required	See Models for all options.
`audio.input.format` (`audio/pcm`, 24 kHz)	`encoding=pcm_s16le` + `sample_rate=24000` required	Cartesia supports many more input audio formats. See encoding for all options.
`audio.input.turn_detection` (`server_vad`)	—	See manual finalization to disable turn detection.
`audio.input.transcription.language`	—	`ink-2` only supports `en` right now. More languages are coming soon!
`audio.input.transcription.delay`	—	Not configurable.
`audio.input.noise_reduction`	—	Not required.
—	`cartesia_version=2026-03-01` required	See API Conventions for details.

encoding

OpenAI sets the input format under audio.input.format. Cartesia takes encoding and sample_rate as query parameters.

OpenAI `audio.input.format`	Cartesia `encoding`	Cartesia `sample_rate`
`{ "type": "audio/pcm", "rate": 24000 }`	`pcm_s16le`	`24000`
`g711_ulaw`	`pcm_mulaw`	`8000`
`g711_alaw`	`pcm_alaw`	`8000`

OpenAI’s PCM format is 16-bit, 24 kHz, mono. Cartesia accepts that sample_rate directly, so you can stream the same audio without resampling. Cartesia also accepts pcm_s32le, pcm_f16le, and pcm_f32le.

Sending audio

OpenAI wraps each audio chunk in a JSON formatted text frame and base64-encodes the audio bytes.
Cartesia accepts audio chunks as binary frames: send the raw audio bytes directly:

- { "type": "input_audio_buffer.append", "audio": "<base64 PCM>" }
+ <raw PCM bytes>

There’s no equivalent for OpenAI’s session.update message; reconnect a new WebSocket to change parameters. To commit all audio and close the session, send a JSON formatted text frame:

{ "type": "close" }

Cartesia will transcribe all buffered audio, then close the socket for you.

If you currently commit audio mid-session with OpenAI using input_audio_buffer.commit, consider using Cartesia with manual finalization instead.Take a look at the migration guides page for details.

Sending audio with the SDK

# raw_audio (bytes) - Raw audio data, about 100 ms at a time
await connection.send_raw(raw_audio)

# raw_audio (bytes) - Raw audio data, about 100 ms at a time
connection.send_raw(raw_audio)

// @param {ArrayBufferLike} rawAudio - raw audio data, about 100 ms at a time
connection.sendRaw(rawAudio);

Decoding base64 encoded audio before sending

from base64 import b64decode

await connection.send_raw(b64decode(audio_base_64))

from base64 import b64decode

connection.send_raw(b64decode(audio_base_64))

connection.sendRaw(Uint8Array.fromBase64(audioBase64));

Closing

# Commit buffered audio
# and let the server close the socket once done
await connection.send({"type": "close"})

# Close the socket early (optional)
connection.close()

# Commit buffered audio
# and let the server close the socket once done
connection.send({"type": "close"})

# Close the socket early (optional)
connection.close()

// Commit buffered audio
// and let the server close the socket once done
connection.send({ type: "close" });

// Close the socket early (optional)
connection.close()

Event mapping

OpenAI signals turns with input_audio_buffer.speech_started / speech_stopped / committed, then bursts transcript deltas and a completed event per turn. Cartesia folds the same information into a turn lifecycle: turn.start, turn.update, turn.eager_end, turn.resume, and turn.end. See Turn Detection for the full state machine.

OpenAI `type`	Cartesia `type`	Notes
`session.created` / `session.updated`	`connected`	Cartesia has no session-config round-trip. You do not need to wait before sending audio.
`input_audio_buffer.speech_started`	`turn.start`	The user began speaking. Carries no transcript.
`conversation.item.input_audio_transcription.delta`	`turn.update`	OpenAI bursts deltas after the turn commits; Cartesia’s `turn.update` streams during the turn.
`input_audio_buffer.speech_stopped` / `committed`	`turn.end`	The user stopped speaking and the turn committed.
`conversation.item.input_audio_transcription.completed`	`turn.end`	Final transcript for the turn.
—	`turn.eager_end`	The model predicts the user might be done speaking. Okay to ignore.
—	`turn.resume`	The user kept talking; ignore the last `turn.eager_end`.
`error`	`error`	Client or server errors.

Completed transcripts

An OpenAI conversation.item.input_audio_transcription.completed event:

{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_003",
  "content_index": 0,
  "transcript": "Hello world!"
}

Becomes a Cartesia turn.end event:

{
  "type": "turn.end",
  "transcript": "Hello world!",
  "request_id": "33cacee6-1936-4949-a05b-ecc9f2393248"
}

turn.start and turn.resume events do not carry a transcript.

import asyncio
from cartesia.types.stt import STTAutoFinalizeWebsocketResponse

full_transcript = ""

async def receive() -> None:
    global full_transcript
    async for event in connection:
        if event.type == "turn.start":
            print("speech_started")
        elif event.type == "turn.update":
            # cumulative within a turn
            print(f"Transcript so far: {event.transcript}")
        elif event.type == "turn.end":
            # Do not strip or add spaces!
            full_transcript += event.transcript
            print(f"speech_stopped: {event.transcript}")
        elif event.type == "error":
            print(f"error: {event.message}")

# Run receive() concurrently with your audio sender:
#   await asyncio.gather(send_audio(), receive())

from cartesia.types.stt import STTAutoFinalizeWebsocketResponse

full_transcript = ""

for event in connection:
    if event.type == "turn.start":
        print("speech_started")
    elif event.type == "turn.update":
        # cumulative within a turn
        print(f"Transcript so far: {event.transcript}")
    elif event.type == "turn.end":
        # Do not strip or add spaces!
        full_transcript += event.transcript
        print(f"speech_stopped: {event.transcript}")
    elif event.type == "error":
        print(f"error: {event.message}")

import Cartesia from '@cartesia/cartesia-js';

let fullTranscript = '';

for await (const event of connection.stream()) {
  if (event.type === 'message') {
    const m = event.message;
    switch (m.type) {
      case 'turn.start':
        console.log('speech_started');
        break;
      case 'turn.update':
        // cumulative within a turn
        console.log(`Transcript so far: ${m.transcript}`);
        break;
      case 'turn.end':
        // Do not trim or add spaces!
        fullTranscript += m.transcript;
        console.log(`speech_stopped: ${m.transcript}`);
        break;
    }
  } else if (event.type === 'error') {
    console.error(`error: ${event.error.message}`);
  }
}

Example Server Messages

OpenAI batches each turn. Ink streams within the turn.

OpenAI gpt-4o-transcribe (server VAD)	Cartesia Realtime STT (Auto)
session.updated	connected
speech_started	turn.start
—	turn.update `"OpenAI batches"`
—	turn.update `"OpenAI batches each turn."`
—	turn.eager_end `"OpenAI batches each turn."`
speech_stopped + committed	—
…transcription.delta `"OpenAI batches each turn."` (burst after commit)	—
…transcription.completed `"OpenAI batches each turn."`	turn.end `"OpenAI batches each turn."`
speech_started	turn.start
—	turn.update `"Ink streams"`
—	turn.eager_end `"Ink streams"`
—	turn.resume
—	turn.update `"Ink streams within the turn."`
—	turn.eager_end `"Ink streams within the turn."`
speech_stopped + committed	—
…transcription.delta `"Ink streams within the turn."` (burst after commit)	—
…transcription.completed `"Ink streams within the turn."`	turn.end `"Ink streams within the turn."`

Get Started

Text-to-Speech

Speech-to-Text

Tools

Integrations

Enterprise

Migrating from OpenAI Realtime Transcription with Turn Detection

All migration guides

Connection

Session configuration

Sending audio

Sending audio with the SDK

Decoding base64 encoded audio before sending

Closing

Event mapping

Completed transcripts

Example Server Messages

References

API Reference

Full Code Example

All migration guides

​Connection

​Session configuration

​Sending audio

​Sending audio with the SDK

​Decoding base64 encoded audio before sending

​Closing

​Event mapping

​Completed transcripts

​Example Server Messages

​References

API Reference

Full Code Example

Connection

Session configuration

Sending audio

Sending audio with the SDK

Decoding base64 encoded audio before sending

Closing

Event mapping

Completed transcripts

Example Server Messages

References