Skip to main content
This guide covers migrating from OpenAI Realtime Transcription when used with turn_detection: null.

Back to guides

Other ways to migrate and best practices for Cartesia Speech-to-Text
This guide contains both bare API descriptions and SDK code. To install the SDK:
pip install cartesia
If you’re already using the Cartesia SDK, upgrade to version >=3.2.0

Connection

Replace the OpenAI WebSocket URL and auth header with Cartesia’s /stt/websocket, including your desired model and input audio format as query parameters:
- wss://api.openai.com/v1/realtime?intent=transcription
+ wss://api.cartesia.ai/stt/websocket?model=ink-2&encoding=pcm_s16le&sample_rate=24000
- Authorization: Bearer <OPENAI_API_KEY>
+ Authorization: Bearer <CARTESIA_API_KEY>
+ Cartesia-Version: 2026-03-01
In browsers, WebSockets do not support request headers. Instead, pass the API version as the cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key. Connect to the manual-finalization WebSocket with the Cartesia SDK:
import os
from cartesia import AsyncCartesia

client = AsyncCartesia(api_key=os.getenv("CARTESIA_API_KEY"))

async with client.stt.manual_finalize.websocket(
    model="ink-2", encoding="pcm_s16le", sample_rate=24000
) as connection:
    ...

Session configuration

OpenAI configures the session in the session.update payload. Cartesia takes the equivalent settings as query parameters.
OpenAI session configCartesia Realtime STT (Manual)Notes
?intent=transcriptionInk only supports transcription.
audio.input.transcription.modelmodel=ink-2 requiredgpt-realtime-whisper and gpt-4o-transcribe both map to ink-2.
audio.input.format (audio/pcm, 24 kHz)encoding=pcm_s16le + sample_rate=24000 requiredCartesia supports many more input audio formats. See encoding for all options.
audio.input.transcription.languagelanguageink-2 only supports en right now. Use ink-whisper for other languages.
audio.input.turn_detection (null)See auto finalization for server-side turn detection.
audio.input.transcription.delayNot configurable.
audio.input.noise_reductionNot required.
cartesia_version=2026-03-01 requiredSee API Conventions for details.
OpenAI sets the input format under audio.input.format. Cartesia takes encoding and sample_rate as query parameters.
OpenAI audio.input.formatCartesia encodingCartesia sample_rate
{ "type": "audio/pcm", "rate": 24000 }pcm_s16le24000
g711_ulawpcm_mulaw8000
g711_alawpcm_alaw8000
OpenAI’s PCM format is 16-bit, 24 kHz, mono. Cartesia accepts that sample_rate directly, so you can stream the same audio without resampling. Cartesia also accepts pcm_s32le, pcm_f16le, and pcm_f32le.

Sending audio

OpenAI wraps each audio chunk in a JSON formatted text frame and base64-encodes the audio bytes.
Cartesia accepts audio chunks as binary frames: send the raw audio bytes directly:
- { "type": "input_audio_buffer.append", "audio": "<base64 PCM>" }
+ <raw PCM bytes>
There’s no equivalent for OpenAI’s session.update message; reconnect a new WebSocket to change parameters. Cartesia’s control commands are bare text frames, not JSON. To commit buffered audio and emit a transcript, send a finalize frame in place of input_audio_buffer.commit:
finalize
It is important to send the finalize command at the right times in the audio stream.Consider using auto finalization if you don’t know when your user is done speaking.
To transcribe all remaining audio and close the session, send a close frame:
close

Sending audio with the SDK

# raw_audio (bytes) - Raw audio data, about 100 ms at a time
await connection.send_raw(raw_audio)

Decoding base64 encoded audio before sending

from base64 import b64decode

await connection.send_raw(b64decode(audio_base_64))

Finalizing and closing

# Commit input audio
await connection.send("finalize")

# Transcribe remaining audio, then close the socket
await connection.send("close")

Event mapping

OpenAI streams conversation.item.input_audio_transcription.delta events and a completed event per committed turn.
Cartesia emits transcript deltas plus acknowledgments for the finalize and close commands.
OpenAI typeCartesia typeNotes
session.created / session.updatedCartesia has no session-config round-trip. Just start sending audio.
conversation.item.input_audio_transcription.deltatranscriptInk 2 and Whisper only send is_final: true. See the row below.
conversation.item.input_audio_transcription.completedtranscript (is_final: true)OpenAI sends the full committed transcript; Cartesia streams deltas.
input_audio_buffer.committedflush_doneAcknowledgment that the buffer was processed after a commit / finalize.
doneAcknowledgment for close. Sent immediately before the WebSocket closes.
errorerrorClient or server errors.

Completed transcripts

An OpenAI conversation.item.input_audio_transcription.completed event carries the full turn:
{
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_003",
  "content_index": 0,
  "transcript": "Hello world! This is the full transcript."
}
Becomes one or more Cartesia transcript events, each carrying a delta:
{
  "type": "transcript",
  "is_final": true,
  "text": "Hello world!",
  "duration": 0.5,
  "words": [
    {
      "word": "Hello",
      "start": 0,
      "end": 0.2
    },
    {
      "word": " world!",
      "start": 0.2,
      "end": 0.5
    }
  ],
  "request_id": "2ff8af53-4d38-479d-8287-58940f01c701"
}
  • Ink 2 does not return duration or words yet
  • Ink 2 and Whisper currently only emit final transcripts (is_final: true)
Cartesia’s final transcripts are deltas; concatenate them without stripping or add whitespace.
import asyncio
from cartesia.types.stt import STTManualFinalizeWebsocketResponse

committed_transcript = ""

async def receive() -> None:
    global committed_transcript
    async for event in connection:
        if event.type == "transcript":
            if event.is_final:
              # Do not strip or add whitespace!
              committed_transcript += event.text
        elif event.type == "flush_done" or event.type == "done":
            print(f"Transcript: {committed_transcript}")
            committed_transcript = ""
        elif event.type == "error":
            print(f"error: {event.message}")

# Run receive() concurrently with your audio sender:
#   await asyncio.gather(send_audio(), receive())

Example Server Messages

GPT sends full transcripts. Ink sends deltas and may break words.
OpenAI gpt-realtime-whisperCartesia Realtime STT (Manual)
…transcription.delta "GPT sends"is_final: true "GPT sends"
…transcription.delta " full transcripts."is_final: true " full transc"
commit (client)finalize (client)
input_audio_buffer.committedis_final: true "ripts."
…transcription.completed "GPT sends full transcripts."flush_done
…transcription.delta "Ink sends deltas"is_final: true " Ink sends"
…transcription.delta " and may break words."is_final: true " deltas and may break wor"
commit (client)finalize (client)
input_audio_buffer.committedis_final: true "ds."
…transcription.completed "Ink sends deltas and may break words."flush_done

References

API Reference

Cartesia Realtime STT (Manual)

Full Code Example

Using the Cartesia SDK