Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt

Use this file to discover all available pages before exploring further.

This page explains how to configure output_format for TTS responses (container, encoding, and sample_rate). In general, use a consistent encoding and sample rate across your audio pipeline (telephony, playback, and storage) to avoid unnecessary transcoding and quality loss. If you’re saving audio samples, we recommend using the Text-to-Speech (Bytes) API with output_format.container: "wav" or output_format.container: "mp3" so audio players can automatically detect the encoding and sample rate.

Reference

output_format.container
string
The container format for the audio output.Available options: RAW, WAV, MP3. Only the Bytes endpoint supports all container formats; our other endpoints (SSE, Websockets) only support RAW.
output_format.encoding
string
The encoding of the output audio. Available options: pcm_f32le, pcm_s16le, pcm_mulaw, pcm_alaw.
output_format.sample_rate
number
The sample rate of the output audio. Remember that to represent a given signal, the sample rate must be at least twice the highest frequency component of the signal (Nyquist theorem).Available options: 8000, 16000, 22050, 24000, 44100, 48000.

output_format for RAW (PCM) Audio

When using raw audio, it is important to match the encoding and sample rate with your output device with the output_format parameter.
EncodingBit depthCommonly used forPair with sample rate
pcm_s16le16-bit intGeneral-purpose playback, browsers, audio players, most devices16000-44100
pcm_f32le32-bit floatML post-processing, high-fidelity recording, audio analysis48000
pcm_mulaw8-bit compressedNorth American / Japanese telephony (G.711μ), Twilio8000
pcm_alaw8-bit compressedEuropean / international telephony (G.711A)8000

Audio CD quality

Standard audio CDs are encoded as pcm_s16le at 44.1 kHz sample rate.
{
  "container": "raw",
  "encoding": "pcm_s16le",
  "sample_rate": 44100
}
This performs well for consumer digital audio setups.

Telephony

North America and Japan

Many customers send their audio output over Twilio. All audio sent over Twilio is transcoded to µ-law encoding with an 8 kHz sample rate.
{
  "container": "raw",
  "encoding": "pcm_mulaw",
  "sample_rate": 8000
}

Europe, India, and others

The standard for European and international telephone networks (G.711A) is 8-bit A-law compressed PCM with an 8 kHz sample rate.
{
  "container": "raw",
  "encoding": "pcm_alaw",
  "sample_rate": 8000
}

Bluetooth headsets

If you happen to know that that the user is using a Bluetooth headset (such as AirPods) to multiplex both microphone input and headphone output, the user will be on the Bluetooth Hands-Free Profile (HFP), limiting sample rate to 16 kHz. (In practice, it’s difficult to programmatically determine the end-user’s microphone/speaker devices, so this example is a bit contrived.)
{
  "container": "raw",
  "encoding": "pcm_s16le",
  "sample_rate": 16000
}