Skip to main content

Overview

Realtime STT (Auto)

Most new voice agents should start with Realtime STT (Auto) /stt/turns/websocket to take advantage of built-in turn detection.
A user turn is one stretch of user speech that your app treats as a single response point.
We refer to our /stt/turns/websocket endpoint as “Realtime STT (Auto)” since user turns are automatically finalized by our model.

Realtime STT (Manual)

Cartesia also supports Realtime STT (Manual) /stt/websocket for stacks that already manage VAD themselves and want tight control over when transcripts are emitted. Send "finalize" whenever the user stops speaking.
Voice activity detection (VAD) detects speech versus non-speech in audio.
We refer to our /stt/websocket endpoint as “Realtime STT (Manual)” since user turns are manually finalized by your own VAD.

Batch STT

Use Batch STT /stt to transcribe pre-recorded audio in a single request.
Batch STT accepts the entire recording in a single request while realtime endpoints can only accept one second of audio data per second, i.e. audio needs to be sent “in real time”.

Comparison

/stt/turns/websocket (auto)/stt/websocket (manual)/stt (batch)
TransportWebSocketWebSocketHTTP file upload
Best forNatural back-and-forth voice agentsExplicit turn controlPre-recorded files and offline jobs
Supported modelsink-2 onlyAllink-whisper only; ink-2 coming soon
Who handles VAD?CartesiaYour appN/A
Who decides when a user turn is complete?CartesiaYour appN/A
Do you send finalize?NoYes. This is crucial to ensure low latencyNo
Audio inputChunked streamChunked streamComplete file
What comes back?Turn events with complete user turn transcriptsTranscript deltas as they become availableOne complete transcript
Ink 2 only supports English right now.
We expect to add more languages in the coming months.

How to decide

If you are building a voice agent, start with Realtime STT (Auto) /stt/turns/websocket. If your app already knows exactly when to start and stop transcription, or you want tight control over when transcripts are emitted, use Realtime STT (Manual) /stt/websocket and send "finalize" whenever the user stops speaking. If you are transcribing audio that is already fully recorded, use Batch STT /stt.

Where to go next

Understand turn detection

See how user turn events work in voice agents

Troubleshoot model behavior

Transcription errors, high latency, server errors

Check out some code examples

Simple scripts using each API endpoint