Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt

Use this file to discover all available pages before exploring further.

Our Text-to-Speech API includes many parameters that can be bewildering to developers who have not worked with audio before. In general, you should try to use the same encoding and sample rate across your entire audio pipeline, including telephony and device outputs. If you’re saving audio samples, we recommend using the Text-to-Speech (Bytes) API with output_format.container: "wav" or output_format.container: "mp3" so audio players can automatically detect the encoding and sample rate.

Reference

output_format.container
string
The container format for the audio output.Available options: RAW, WAV, MP3. Only the Bytes endpoint supports all container formats; our other endpoints (SSE, Websockets) only support RAW.
output_format.encoding
string
The encoding of the output audio. Available options: pcm_f32le, pcm_s16le, pcm_mulaw, pcm_alaw.
output_format.sample_rate
number
The sample rate of the output audio. Remember that to represent a given signal, the sample rate must be at least twice the highest frequency component of the signal (Nyquist theorem).Available options: 8000, 16000, 22050, 24000, 44100, 48000.

output_format for RAW (PCM) Audio

When using raw audio, it is important to match the encoding and sample rate with your output device with the output_format parameter.
EncodingBit depthCommonly used forPair with sample rate
pcm_s16le16-bit intGeneral-purpose playback, browsers, audio players, most devices16000-44100
pcm_f32le32-bit floatML post-processing, high-fidelity recording, audio analysis48000
pcm_mulaw8-bit compressedNorth American / Japanese telephony (G.711μ), Twilio8000
pcm_alaw8-bit compressedEuropean / international telephony (G.711A)8000

Audio CD quality

Standard audio CDs are encoded as pcm_s16le at 44.1 kHz sample rate.
{
  "container": "raw",
  "encoding": "pcm_s16le",
  "sample_rate": 44100
}
This performs well for consumer digital audio setups.

Telephony

North America and Japan

Many customers send their audio output over Twilio. All audio sent over Twilio is transcoded to µ-law encoding with an 8 kHz sample rate.
{
  "container": "raw",
  "encoding": "pcm_mulaw",
  "sample_rate": 8000
}

Europe, India, and others

The standard for European and international telephone networks (G.711A) is 8-bit A-law compressed PCM with an 8 kHz sample rate.
{
  "container": "raw",
  "encoding": "pcm_alaw",
  "sample_rate": 8000
}

Bluetooth headsets

If you happen to know that that the user is using a Bluetooth headset (such as AirPods) to multiplex both microphone input and headphone output, the user will be on the Bluetooth Hands-Free Profile (HFP), limiting sample rate to 16 kHz. (In practice, it’s difficult to programmatically determine the end-user’s microphone/speaker devices, so this example is a bit contrived.)
{
  "container": "raw",
  "encoding": "pcm_s16le",
  "sample_rate": 16000
}