Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt

Use this file to discover all available pages before exploring further.

Ink 2 is our newest realtime STT model for voice agents. It adds built-in turn detection, so most new integrations should start with /stt/turns/websocket. Cartesia also supports /stt/websocket for stacks that already manage VAD themselves or want full control over when transcripts are emitted. Voice activity detection (VAD) detects speech versus non-speech in audio. A user turn is one stretch of user speech that your app treats as a single response point.

API endpoint explanation

/stt/turns/websocket/stt/websocket
Recommended starting pointBest default for new voice agentsUse when you already manage VAD or push-to-talk
Who handles VAD?CartesiaYour app
Who decides when a user turn is complete?CartesiaYour app
Do you send finalize?NoYes
What comes back?Turn events plus final transcriptsTranscript deltas as they become available
Best forNatural back-and-forth voice agentsExisting audio pipelines and explicit control
If you are building a voice agent and do not already have strong reasons to run your own VAD, start with /stt/turns/websocket. If your app already knows exactly when to start and stop transcription, or you want tight control over when transcripts are emitted, use /stt/websocket.

Where to go next

Use Cartesia's STT with turn detection

Cartesia manages VAD and turn taking

Understand turn detection

See how user turn events work in voice agents

Transcribe a file

Use file upload when you do not need realtime streaming