turn_detection: server_vad.
Back to guides
Other ways to migrate and best practices for Cartesia Speech-to-Text
If you’re already using the Cartesia SDK, upgrade to version
>=3.2.0Ink 2 only supports English right now.
We expect to add more languages in the coming months.
We expect to add more languages in the coming months.
Connection
Replace the OpenAI WebSocket URL and auth header with Cartesia’s/stt/turns/websocket, including your desired model and input audio format as query parameters:
cartesia_version query param and use a short-lived access token using the access_token query param instead of an API key.
Connect to the auto-finalization WebSocket with the Cartesia SDK:
Session configuration
OpenAI configures the session in thesession.update payload. Cartesia takes the equivalent settings as query parameters.
| OpenAI session config | Cartesia Realtime STT (Auto) | Notes |
|---|---|---|
?intent=transcription | — | Ink only supports transcription. |
audio.input.transcription.model (gpt-4o-transcribe) | model=ink-2 required | See Models for all options. |
audio.input.format (audio/pcm, 24 kHz) | encoding=pcm_s16le + sample_rate=24000 required | Cartesia supports many more input audio formats. See encoding for all options. |
audio.input.turn_detection (server_vad) | — | See manual finalization to disable turn detection. |
audio.input.transcription.language | — | ink-2 only supports en right now. More languages are coming soon! |
audio.input.transcription.delay | — | Not configurable. |
audio.input.noise_reduction | — | Not required. |
include: ["item.input_audio_transcription.logprobs"] | — | Coming soon! |
| — | cartesia_version=2026-03-01 required | See API Conventions for details. |
encoding
encoding
OpenAI sets the input format under
OpenAI’s PCM format is 16-bit, 24 kHz, mono. Cartesia accepts that
audio.input.format. Cartesia takes encoding and sample_rate as query parameters.OpenAI audio.input.format | Cartesia encoding | Cartesia sample_rate |
|---|---|---|
{ "type": "audio/pcm", "rate": 24000 } | pcm_s16le | 24000 |
g711_ulaw | pcm_mulaw | 8000 |
g711_alaw | pcm_alaw | 8000 |
sample_rate directly, so you can stream the same audio without resampling. Cartesia also accepts pcm_s32le, pcm_f16le, and pcm_f32le.Sending audio
OpenAI wraps each audio chunk in a JSON formatted text frame and base64-encodes the audio bytes.Cartesia accepts audio chunks as binary frames: send the raw audio bytes directly:
session.update message; reconnect a new WebSocket to change parameters.
To commit all audio and close the session, send a JSON formatted text frame:
Sending audio with the SDK
Decoding base64 encoded audio before sending
Closing
Event mapping
OpenAI signals turns withinput_audio_buffer.speech_started / speech_stopped / committed, then bursts transcript deltas and a completed event per turn.
Cartesia folds the same information into a turn lifecycle: turn.start, turn.update, turn.eager_end, turn.resume, and turn.end. See Turn Detection for the full state machine.
OpenAI type | Cartesia type | Notes |
|---|---|---|
session.created / session.updated | connected | Cartesia has no session-config round-trip. You do not need to wait before sending audio. |
input_audio_buffer.speech_started | turn.start | The user began speaking. Carries no transcript. |
conversation.item.input_audio_transcription.delta | turn.update | OpenAI bursts deltas after the turn commits; Cartesia’s turn.update streams during the turn. |
input_audio_buffer.speech_stopped / committed | turn.end | The user stopped speaking and the turn committed. |
conversation.item.input_audio_transcription.completed | turn.end | Final transcript for the turn. |
| — | turn.eager_end | The model predicts the user might be done speaking. Okay to ignore. |
| — | turn.resume | The user kept talking; ignore the last turn.eager_end. |
error | error | Client or server errors. |
Completed transcripts
An OpenAIconversation.item.input_audio_transcription.completed event:
turn.end event:
turn.start and turn.resume events do not carry a transcript.Example Server Messages
OpenAI batches each turn. Ink streams within the turn.
| OpenAI gpt-4o-transcribe (server VAD) | Cartesia Realtime STT (Auto) |
|---|---|
| session.updated | connected |
| speech_started | turn.start |
| — | turn.update "OpenAI batches" |
| — | turn.update "OpenAI batches each turn." |
| — | turn.eager_end "OpenAI batches each turn." |
| speech_stopped + committed | — |
…transcription.delta "OpenAI batches each turn." (burst after commit) | — |
…transcription.completed "OpenAI batches each turn." | turn.end "OpenAI batches each turn." |
| speech_started | turn.start |
| — | turn.update "Ink streams" |
| — | turn.eager_end "Ink streams" |
| — | turn.resume |
| — | turn.update "Ink streams within the turn." |
| — | turn.eager_end "Ink streams within the turn." |
| speech_stopped + committed | — |
…transcription.delta "Ink streams within the turn." (burst after commit) | — |
…transcription.completed "Ink streams within the turn." | turn.end "Ink streams within the turn." |
References
API Reference
Cartesia Realtime STT (Auto)
Full Code Example
Using the Cartesia SDK