Ink 2 is our newest realtime STT model for voice agents. It adds built-in turn detection, so most new integrations should start withDocumentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.
/stt/turns/websocket.
Cartesia also supports /stt/websocket for stacks that already manage VAD themselves or want full control over when transcripts are emitted.
Voice activity detection (VAD) detects speech versus non-speech in audio. A user turn is one stretch of user speech that your app treats as a single response point.
API endpoint explanation
/stt/turns/websocket | /stt/websocket | |
|---|---|---|
| Recommended starting point | Best default for new voice agents | Use when you already manage VAD or push-to-talk |
| Who handles VAD? | Cartesia | Your app |
| Who decides when a user turn is complete? | Cartesia | Your app |
Do you send finalize? | No | Yes |
| What comes back? | Turn events plus final transcripts | Transcript deltas as they become available |
| Best for | Natural back-and-forth voice agents | Existing audio pipelines and explicit control |
/stt/turns/websocket.
If your app already knows exactly when to start and stop transcription, or you want tight control over when transcripts are emitted, use /stt/websocket.
Where to go next
Use Cartesia's STT with turn detection
Cartesia manages VAD and turn taking
Understand turn detection
See how user turn events work in voice agents
Transcribe a file
Use file upload when you do not need realtime streaming