Compare STT Endpoints - Cartesia Docs

Ink 2 is our newest realtime STT model for voice agents. It adds built-in turn detection, so most new integrations should start with /stt/turns/websocket. Cartesia also supports /stt/websocket for stacks that already manage VAD themselves or want full control over when transcripts are emitted. Voice activity detection (VAD) detects speech versus non-speech in audio. A user turn is one stretch of user speech that your app treats as a single response point.

API endpoint explanation

	`/stt/turns/websocket`	`/stt/websocket`
Recommended starting point	Best default for new voice agents	Use when you already manage VAD or push-to-talk
Who handles VAD?	Cartesia	Your app
Who decides when a user turn is complete?	Cartesia	Your app
Do you send `finalize`?	No	Yes
What comes back?	Turn events plus final transcripts	Transcript deltas as they become available
Best for	Natural back-and-forth voice agents	Existing audio pipelines and explicit control

If you are building a voice agent and do not already have strong reasons to run your own VAD, start with /stt/turns/websocket. If your app already knows exactly when to start and stop transcription, or you want tight control over when transcripts are emitted, use /stt/websocket.

Where to go next

Use Cartesia's STT with turn detection

Cartesia manages VAD and turn taking

Understand turn detection

See how user turn events work in voice agents

Transcribe a file

Use file upload when you do not need realtime streaming

Buffering

Realtime Speech-to-Text

⌘I

Documentation Index

​API endpoint explanation

​Where to go next

Use Cartesia's STT with turn detection

Understand turn detection

Transcribe a file

API endpoint explanation

Where to go next