Overview
Realtime STT (Auto)
Most new voice agents should start with Realtime STT (Auto)/stt/turns/websocket to take advantage of built-in turn detection.
A user turn is one stretch of user speech that your app treats as a single response point.
We refer to our/stt/turns/websocketendpoint as “Realtime STT (Auto)” since user turns are automatically finalized by our model.
Realtime STT (Manual)
Cartesia also supports Realtime STT (Manual)/stt/websocket for stacks that already manage VAD themselves and want tight control over when transcripts are emitted.
Send "finalize" whenever the user stops speaking.
Voice activity detection (VAD) detects speech versus non-speech in audio.
We refer to our/stt/websocketendpoint as “Realtime STT (Manual)” since user turns are manually finalized by your own VAD.
Batch STT
Use Batch STT/stt to transcribe pre-recorded audio in a single request.
Batch STT accepts the entire recording in a single request while realtime endpoints can only accept one second of audio data per second, i.e. audio needs to be sent “in real time”.
Comparison
/stt/turns/websocket (auto) | /stt/websocket (manual) | /stt (batch) | |
|---|---|---|---|
| Transport | WebSocket | WebSocket | HTTP file upload |
| Best for | Natural back-and-forth voice agents | Explicit turn control | Pre-recorded files and offline jobs |
| Supported models | ink-2 only | All | ink-whisper only; ink-2 coming soon |
| Who handles VAD? | Cartesia | Your app | N/A |
| Who decides when a user turn is complete? | Cartesia | Your app | N/A |
Do you send finalize? | No | Yes. This is crucial to ensure low latency | No |
| Audio input | Chunked stream | Chunked stream | Complete file |
| What comes back? | Turn events with complete user turn transcripts | Transcript deltas as they become available | One complete transcript |
Ink 2 only supports English right now.
We expect to add more languages in the coming months.
We expect to add more languages in the coming months.
How to decide
If you are building a voice agent, start with Realtime STT (Auto)/stt/turns/websocket.
If your app already knows exactly when to start and stop transcription, or you want tight control over when transcripts are emitted, use Realtime STT (Manual) /stt/websocket
and send "finalize" whenever the user stops speaking.
If you are transcribing audio that is already fully recorded, use Batch STT /stt.
Where to go next
Understand turn detection
See how user turn events work in voice agents
Troubleshoot model behavior
Transcription errors, high latency, server errors
Check out some code examples
Simple scripts using each API endpoint