A bidirectional WebSocket connection for real-time speech transcription with native turn detection. It is the recommended endpoint for building voice agents.
This API is organized around user turns (human user starts talking, stops talking), not transcript segments. The model itself signals when a user turn begins and ends, so your agent reacts to events rather than running its own voice activity detection.
All emitted text is final, i.e. only high-accuracy transcripts are sent by this API. Later events will append to the transcript without modifying text sent by earlier events.
For WebSocket connection limits, see the concurrency limits and timeouts page.
Documentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.