Migrating from OpenAI Realtime Transcription without Turn Detection

This guide covers migrating from OpenAI Realtime Transcription when used with turn_detection: null.

All migration guides

This guide contains both bare API descriptions and SDK code. To install the SDK:

pip install cartesia

npm i @cartesia/cartesia-js

If you’re already using the Cartesia SDK, upgrade to version >=3.2.0

Connection

Replace the OpenAI WebSocket URL and auth header with Cartesia’s /stt/websocket, including your desired model and input audio format as query parameters:

- wss://api.openai.com/v1/realtime?intent=transcription
+ wss://api.cartesia.ai/stt/websocket?model=ink-2&encoding=pcm_s16le&sample_rate=24000

- Authorization: Bearer <OPENAI_API_KEY>
+ Authorization: Bearer <CARTESIA_API_KEY>
+ Cartesia-Version: 2026-03-01

In browsers, WebSockets do not support request headers. Instead, pass the API version as the cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key. Connect to the manual-finalization WebSocket with the Cartesia SDK:

import os
from cartesia import AsyncCartesia

client = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY"))

async with client.stt.manual_finalize.websocket(
    model="ink-2", encoding="pcm_s16le", sample_rate=24000
) as connection:
    ...

import os
from cartesia import Cartesia

client = Cartesia(api_key=os.getenv("CARTESIA_API_KEY"))

with client.stt.manual_finalize.websocket(
    model="ink-2", encoding="pcm_s16le", sample_rate=24000
) as connection:
    ...

import Cartesia from "@cartesia/cartesia-js";

const client = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY });

const connection = client.stt.manualFinalize.websocket({
  model: "ink-2",
  encoding: "pcm_s16le",
  sample_rate: 24000,
});

// Server-side: Generate access-tokens using your API key
import Cartesia from '@cartesia/cartesia-js';

const client = new Cartesia({ apiKey: process.env.CARTESIA_API_KEY });

export async function GET() {
  const { token } = await client.accessToken.create({
    grants: { stt: true, tts: false, agent: false },
    // How long the token lasts in seconds
    // Allowed values: 0–3600
    expires_in: 3600,
  });
  return Response.json({ token });
}


// Client-side
// 1. Fetch an access token from your server
// 2. Connect to Cartesia via WebSocket
import Cartesia from "@cartesia/cartesia-js";

async function getToken(): Promise<string> {
  const res = await fetch('/replace-with-your-server');
  const { token } = await res.json();
  return token;
}
const audioContext = new AudioContext();

const client = new Cartesia({ token: await getToken() });

const connection = client.stt.manualFinalize.websocket({
  model: "ink-2",
  encoding: "pcm_f32le",
  sample_rate: audioContext.sampleRate,
});

Session configuration

OpenAI configures the session in the session.update payload. Cartesia takes the equivalent settings as query parameters.

OpenAI session config	Cartesia Realtime STT (Manual)	Notes
`?intent=transcription`	—	Ink only supports transcription.
`audio.input.transcription.model`	`model=ink-2` required	`gpt-realtime-whisper` and `gpt-4o-transcribe` both map to `ink-2`.
`audio.input.format` (`audio/pcm`, 24 kHz)	`encoding=pcm_s16le` + `sample_rate=24000` required	Cartesia supports many more input audio formats. See encoding for all options.
`audio.input.transcription.language`	`language`	`ink-2` only supports `en` right now. Use `ink-whisper` for other languages.
`audio.input.turn_detection` (`null`)	—	See auto finalization for server-side turn detection.
`audio.input.transcription.delay`	—	Not configurable.
`audio.input.noise_reduction`	—	Not required.
—	`cartesia_version=2026-03-01` required	See API Conventions for details.

encoding

OpenAI sets the input format under audio.input.format. Cartesia takes encoding and sample_rate as query parameters.

OpenAI `audio.input.format`	Cartesia `encoding`	Cartesia `sample_rate`
`{ "type": "audio/pcm", "rate": 24000 }`	`pcm_s16le`	`24000`
`g711_ulaw`	`pcm_mulaw`	`8000`
`g711_alaw`	`pcm_alaw`	`8000`

OpenAI’s PCM format is 16-bit, 24 kHz, mono. Cartesia accepts that sample_rate directly, so you can stream the same audio without resampling. Cartesia also accepts pcm_s32le, pcm_f16le, and pcm_f32le.

Sending audio

OpenAI wraps each audio chunk in a JSON formatted text frame and base64-encodes the audio bytes.
Cartesia accepts audio chunks as binary frames: send the raw audio bytes directly:

- { "type": "input_audio_buffer.append", "audio": "<base64 PCM>" }
+ <raw PCM bytes>

There’s no equivalent for OpenAI’s session.update message; reconnect a new WebSocket to change parameters. Cartesia’s control commands are bare text frames, not JSON. To commit buffered audio and emit a transcript, send a finalize frame in place of input_audio_buffer.commit:

finalize

It is important to send the finalize command at the right times in the audio stream.Consider using auto finalization if you don’t know when your user is done speaking.

To transcribe all remaining audio and close the session, send a close frame:

close

Sending audio with the SDK

# raw_audio (bytes) - Raw audio data, about 100 ms at a time
await connection.send_raw(raw_audio)

# raw_audio (bytes) - Raw audio data, about 100 ms at a time
connection.send_raw(raw_audio)

// @param {ArrayBufferLike} rawAudio - raw audio data, about 100 ms at a time
connection.sendRaw(rawAudio);

Decoding base64 encoded audio before sending

from base64 import b64decode

await connection.send_raw(b64decode(audio_base_64))

from base64 import b64decode

connection.send_raw(b64decode(audio_base_64))

connection.sendRaw(Uint8Array.fromBase64(audioBase64));

Finalizing and closing

# Commit input audio
await connection.send("finalize")

# Transcribe remaining audio, then close the socket
await connection.send("close")

# Commit input audio
connection.send("finalize")

# Transcribe remaining audio, then close the socket
connection.send("close")

// Commit input audio
connection.send("finalize");

// Transcribe remaining audio, then close the socket
connection.send("close");

Event mapping

OpenAI streams conversation.item.input_audio_transcription.delta events and a completed event per committed turn.
Cartesia emits transcript deltas plus acknowledgments for the finalize and close commands.

OpenAI `type`	Cartesia `type`	Notes
`session.created` / `session.updated`	—	Cartesia has no session-config round-trip. Just start sending audio.
`conversation.item.input_audio_transcription.delta`	`transcript`	Ink 2 and Whisper only send `is_final: true`. See the row below.
`conversation.item.input_audio_transcription.completed`	`transcript` (`is_final: true`)	OpenAI sends the full committed transcript; Cartesia streams deltas.
`input_audio_buffer.committed`	`flush_done`	Acknowledgment that the buffer was processed after a commit / `finalize`.
—	`done`	Acknowledgment for `close`. Sent immediately before the WebSocket closes.
`error`	`error`	Client or server errors.

Completed transcripts

An OpenAI conversation.item.input_audio_transcription.completed event carries the full turn:

{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_003",
  "content_index": 0,
  "transcript": "Hello world! This is the full transcript."
}

Becomes one or more Cartesia transcript events, each carrying a delta:

{
  "type": "transcript",
  "is_final": true,
  "text": "Hello world!",
  "duration": 0.5,
  "words": [
    {
      "word": "Hello",
      "start": 0,
      "end": 0.2
    },
    {
      "word": " world!",
      "start": 0.2,
      "end": 0.5
    }
  ],
  "request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}

Ink 2 does not return duration or words yet

Ink 2 and Whisper currently only emit final transcripts (is_final: true)

Cartesia’s final transcripts are deltas; concatenate them without stripping or add whitespace.

import asyncio
from cartesia.types.stt import STTManualFinalizeWebsocketResponse

committed_transcript = ""

async def receive() -> None:
    global committed_transcript
    async for event in connection:
        if event.type == "transcript":
            if event.is_final:
              # Do not strip or add whitespace!
              committed_transcript += event.text
        elif event.type == "flush_done" or event.type == "done":
            print(f"Transcript: {committed_transcript}")
            committed_transcript = ""
        elif event.type == "error":
            print(f"error: {event.message}")

# Run receive() concurrently with your audio sender:
#   await asyncio.gather(send_audio(), receive())

from cartesia.types.stt import STTManualFinalizeWebsocketResponse

committed_transcript = ""

for event in connection:
    if event.type == "transcript":
        if event.is_final:
          # Do not strip or add whitespace!
          committed_transcript += event.text
    elif event.type == "flush_done" or event.type == "done":
        print(f"Transcript: {committed_transcript}")
        committed_transcript = ""
    elif event.type == "error":
        print(f"error: {event.message}")

import Cartesia from '@cartesia/cartesia-js';

let committedTranscript = '';

for await (const event of connection.stream()) {
  if (event.type === 'message') {
    const m = event.message;
    switch (m.type) {
      case 'transcript':
        if (m.is_final) {
          // Do not trim or add whitespace!
          committedTranscript += m.text;
        }
        break;
      case 'flush_done':
      case 'done':
        console.log(`Transcript: ${committedTranscript}`);
        committedTranscript = '';
        break;
    }
  } else if (event.type === 'error') {
    console.error(`error: ${event.error.message}`);
  }
}

Example Server Messages

GPT sends full transcripts. Ink sends deltas and may break words.

OpenAI gpt-realtime-whisper	Cartesia Realtime STT (Manual)
…transcription.delta `"GPT sends"`	is_final: true `"GPT sends"`
…transcription.delta `" full transcripts."`	is_final: true `" full transc"`
commit (client)	finalize (client)
input_audio_buffer.committed	is_final: true `"ripts."`
…transcription.completed `"GPT sends full transcripts."`	flush_done
…transcription.delta `"Ink sends deltas"`	is_final: true `" Ink sends"`
…transcription.delta `" and may break words."`	is_final: true `" deltas and may break wor"`
commit (client)	finalize (client)
input_audio_buffer.committed	is_final: true `"ds."`
…transcription.completed `"Ink sends deltas and may break words."`	flush_done

Get Started

Text-to-Speech

Speech-to-Text

Tools

Integrations

Enterprise

Migrating from OpenAI Realtime Transcription without Turn Detection

All migration guides

Connection

Session configuration

Sending audio

Sending audio with the SDK

Decoding base64 encoded audio before sending

Finalizing and closing

Event mapping

Completed transcripts

Example Server Messages

References

API Reference

Full Code Example

All migration guides

​Connection

​Session configuration

​Sending audio

​Sending audio with the SDK

​Decoding base64 encoded audio before sending

​Finalizing and closing

​Event mapping

​Completed transcripts

​Example Server Messages

​References

API Reference

Full Code Example

Connection

Session configuration

Sending audio

Sending audio with the SDK

Decoding base64 encoded audio before sending

Finalizing and closing

Event mapping

Completed transcripts

Example Server Messages

References