Compare Endpoints

Overview

Realtime STT (Auto)

Most new voice agents should start with Realtime STT (Auto) /stt/turns/websocket to take advantage of built-in turn detection.

A user turn is one stretch of user speech that your app treats as a single response point.
We refer to our /stt/turns/websocket endpoint as “Realtime STT (Auto)” since user turns are automatically finalized by our model.

Realtime STT (Manual)

Cartesia also supports Realtime STT (Manual) /stt/websocket for stacks that already manage VAD themselves and want tight control over when transcripts are emitted. Send "finalize" whenever the user stops speaking.

Voice activity detection (VAD) detects speech versus non-speech in audio.
We refer to our /stt/websocket endpoint as “Realtime STT (Manual)” since user turns are manually finalized by your own VAD.

Batch STT

Use Batch STT /stt to transcribe pre-recorded audio in a single request.

Batch STT accepts the entire recording in a single request while realtime endpoints can only accept one second of audio data per second, i.e. audio needs to be sent “in real time”.

Comparison

	`/stt/turns/websocket` (auto)	`/stt/websocket` (manual)	`/stt` (batch)
Transport	WebSocket	WebSocket	HTTP file upload
Best for	Natural back-and-forth voice agents	Explicit turn control	Pre-recorded files and offline jobs
Supported models	`ink-2` only	All	`ink-whisper` only
Who handles VAD?	Cartesia	Your app	N/A
Who decides when a user turn is complete?	Cartesia	Your app	N/A
Do you send `finalize`?	No	Yes. This is crucial to ensure low latency	No
Audio input	Chunked stream	Chunked stream	Complete file
What comes back?	Turn events with complete user turn transcripts	Transcript deltas as they become available	One complete transcript

Ink 2 only supports English right now.
We expect to add more languages in the coming months.

How to decide

If you are building a voice agent, start with Realtime STT (Auto) /stt/turns/websocket. If your app already knows exactly when to start and stop transcription, or you want tight control over when transcripts are emitted, use Realtime STT (Manual) /stt/websocket and send "finalize" whenever the user stops speaking. If you are transcribing audio that is already fully recorded, use Batch STT /stt.

Where to go next

Understand turn detection

See how user turn events work in voice agents

Troubleshoot model behavior

Transcription errors, high latency, server errors

Check out some code examples

Simple scripts using each API endpoint

Get Started

Text-to-Speech

Speech-to-Text

Tools

Integrations

Enterprise

Overview

Realtime STT (Auto)

Realtime STT (Manual)

Batch STT

Comparison

How to decide

Where to go next

Understand turn detection

Troubleshoot model behavior

Check out some code examples

​Overview

​Realtime STT (Auto)

​Realtime STT (Manual)

​Batch STT

​Comparison

​How to decide

​Where to go next

Understand turn detection

Troubleshoot model behavior

Check out some code examples

Overview

Realtime STT (Auto)

Realtime STT (Manual)

Batch STT

Comparison

How to decide

Where to go next