Our Text-to-Speech API includes many parameters that can be bewildering to developers who have not worked with audio before. In general, you should try to use the same encoding and sample rate across your entire audio pipeline, including telephony and device outputs. If you’re saving audio samples, we recommend using the Text-to-Speech (Bytes) API withDocumentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.
output_format.container: "wav" or output_format.container: "mp3" so audio players can automatically detect the encoding and sample rate.
Reference
The container format for the audio output.Available options:
RAW, WAV, MP3. Only the Bytes endpoint supports all container formats;
our other endpoints (SSE, Websockets) only support RAW.The encoding of the output audio. Available options:
pcm_f32le, pcm_s16le,
pcm_mulaw, pcm_alaw.The sample rate of the output audio. Remember that to represent a given signal, the sample rate
must be at least twice the highest frequency component of the signal (Nyquist theorem).Available options:
8000, 16000, 22050, 24000, 44100, 48000.output_format for RAW (PCM) Audio
When using raw audio, it is important to match the encoding and sample rate with your output device with the output_format parameter.
| Encoding | Bit depth | Commonly used for | Pair with sample rate |
|---|---|---|---|
pcm_s16le | 16-bit int | General-purpose playback, browsers, audio players, most devices | 16000-44100 |
pcm_f32le | 32-bit float | ML post-processing, high-fidelity recording, audio analysis | 48000 |
pcm_mulaw | 8-bit compressed | North American / Japanese telephony (G.711μ), Twilio | 8000 |
pcm_alaw | 8-bit compressed | European / international telephony (G.711A) | 8000 |
Audio CD quality
Standard audio CDs are encoded aspcm_s16le at 44.1 kHz sample rate.