Messages
STT
Realtime Speech-to-Text (External VAD)
A bidirectional WebSocket connection for real-time speech transcription that works with external voice activity detection (VAD). It is the recommended endpoint for “push-to-talk” apps.
This API relies on the finalize command to trigger transcription. If you do not know when the user starts and stops speaking, consider Realtime Speech-to-Text to benefit from user turn detection.
Basic Usage:
- Connect to the WebSocket with appropriate query parameters
- Send audio in small chunks (e.g. 100ms) as WebSocket binary messages
- Send
finalizeas a WebSocket text message when the user is done speaking - Receive transcripts as JSON encoded WebSocket text messages (each message is a delta and is not cumulative)
- Repeat 2-4
- Send
closeas a WebSocket text message to finalize any buffered audio and close the session cleanly
For WebSocket connection limits, see the concurrency limits and timeouts page.
WSS
Documentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.
Messages