Compare TTS Endpoints

If you want to generate speech in real-time

We recommend using our WebSocket endpoint for real-time applications for a few reasons:

Latency: You can establish a WebSocket connection in advance, which means that you do not incur any connection latency when you start generating speech. (This usually saves you about 200ms.)
Input Streaming: You can stream in inputs while maintaining the prosody of the generated speech, which is useful when generating text inputs in real-time, such as with an LLM.
Timestamps: You can get timestamped transcripts for the generated speech to build features like subtitles or live transcripts. (For the sonic model, timestamps are only supported for languages en, de, es, and fr. For sonic-preview, timestamps are supported for all languages!)
Multiplexing: You can multiplex multiple conversations over a single connection.

If you want to generate speech ahead of time

We recommend using our raw bytes (i.e. audio file) output endpoint, which can give you outputs in a variety of formats, such as WAV and MP3 (in addition to raw PCM audio).