turn_detection: null.
Back to guides
Other ways to migrate and best practices for Cartesia Speech-to-Text
If you’re already using the Cartesia SDK, upgrade to version
>=3.2.0Connection
Replace the OpenAI WebSocket URL and auth header with Cartesia’s/stt/websocket, including your desired model and input audio format as query parameters:
cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key.
Connect to the manual-finalization WebSocket with the Cartesia SDK:
Session configuration
OpenAI configures the session in thesession.update payload. Cartesia takes the equivalent settings as query parameters.
| OpenAI session config | Cartesia Realtime STT (Manual) | Notes |
|---|---|---|
?intent=transcription | — | Ink only supports transcription. |
audio.input.transcription.model | model=ink-2 required | gpt-realtime-whisper and gpt-4o-transcribe both map to ink-2. |
audio.input.format (audio/pcm, 24 kHz) | encoding=pcm_s16le + sample_rate=24000 required | Cartesia supports many more input audio formats. See encoding for all options. |
audio.input.transcription.language | language | ink-2 only supports en right now. Use ink-whisper for other languages. |
audio.input.turn_detection (null) | — | See auto finalization for server-side turn detection. |
audio.input.transcription.delay | — | Not configurable. |
audio.input.noise_reduction | — | Not required. |
| — | cartesia_version=2026-03-01 required | See API Conventions for details. |
encoding
encoding
OpenAI sets the input format under
OpenAI’s PCM format is 16-bit, 24 kHz, mono. Cartesia accepts that
audio.input.format. Cartesia takes encoding and sample_rate as query parameters.OpenAI audio.input.format | Cartesia encoding | Cartesia sample_rate |
|---|---|---|
{ "type": "audio/pcm", "rate": 24000 } | pcm_s16le | 24000 |
g711_ulaw | pcm_mulaw | 8000 |
g711_alaw | pcm_alaw | 8000 |
sample_rate directly, so you can stream the same audio without resampling. Cartesia also accepts pcm_s32le, pcm_f16le, and pcm_f32le.Sending audio
OpenAI wraps each audio chunk in a JSON formatted text frame and base64-encodes the audio bytes.Cartesia accepts audio chunks as binary frames: send the raw audio bytes directly:
session.update message; reconnect a new WebSocket to change parameters.
Cartesia’s control commands are bare text frames, not JSON.
To commit buffered audio and emit a transcript, send a finalize frame in place of input_audio_buffer.commit:
close frame:
Sending audio with the SDK
Decoding base64 encoded audio before sending
Finalizing and closing
Event mapping
OpenAI streamsconversation.item.input_audio_transcription.delta events and a completed event per committed turn.Cartesia emits
transcript deltas plus acknowledgments for the finalize and close commands.
OpenAI type | Cartesia type | Notes |
|---|---|---|
session.created / session.updated | — | Cartesia has no session-config round-trip. Just start sending audio. |
conversation.item.input_audio_transcription.delta | transcript | Ink 2 and Whisper only send is_final: true. See the row below. |
conversation.item.input_audio_transcription.completed | transcript (is_final: true) | OpenAI sends the full committed transcript; Cartesia streams deltas. |
input_audio_buffer.committed | flush_done | Acknowledgment that the buffer was processed after a commit / finalize. |
| — | done | Acknowledgment for close. Sent immediately before the WebSocket closes. |
error | error | Client or server errors. |
Completed transcripts
An OpenAIconversation.item.input_audio_transcription.completed event carries the full turn:
transcript events, each carrying a delta:
- Ink 2 does not return
durationorwordsyet- Ink 2 and Whisper currently only emit final transcripts (
is_final: true)
Example Server Messages
GPT sends full transcripts. Ink sends deltas and may break words.
| OpenAI gpt-realtime-whisper | Cartesia Realtime STT (Manual) |
|---|---|
…transcription.delta "GPT sends" | is_final: true "GPT sends" |
…transcription.delta " full transcripts." | is_final: true " full transc" |
| commit (client) | finalize (client) |
| input_audio_buffer.committed | is_final: true "ripts." |
…transcription.completed "GPT sends full transcripts." | flush_done |
…transcription.delta "Ink sends deltas" | is_final: true " Ink sends" |
…transcription.delta " and may break words." | is_final: true " deltas and may break wor" |
| commit (client) | finalize (client) |
| input_audio_buffer.committed | is_final: true "ds." |
…transcription.completed "Ink sends deltas and may break words." | flush_done |
References
API Reference
Cartesia Realtime STT (Manual)
Full Code Example
Using the Cartesia SDK