Audio encodings - Cartesia Docs

Pick the encoding that matches your downstream pipeline. If unsure, start with pcm_s16le.

TTS output encodings

Used in the output_format.encoding field when generating audio.

Encoding	Bit depth	Best for	Pair with sample rate
`pcm_s16le`	16-bit int	General-purpose playback, browsers, audio players, most devices	44100 (CD quality) or 16000–48000
`pcm_f32le`	32-bit float	ML post-processing, high-fidelity recording, audio analysis	48000
`pcm_mulaw`	8-bit compressed	North American / Japanese telephony (G.711μ), Twilio	8000
`pcm_alaw`	8-bit compressed	European / international telephony (G.711A)	8000

`pcm_s16le`

16-bit signed integer PCM, little-endian. Matches the standard audio CD format and is the most widely supported encoding across audio players, browsers, and hardware. Use this as your default unless you have a specific reason to choose another format.

{
  "container": "raw",
  "encoding": "pcm_s16le",
  "sample_rate": 44100
}

`pcm_f32le`

32-bit floating point PCM, little-endian. Provides the highest precision and dynamic range. Use when your pipeline handles float audio end-to-end—for example, feeding generated audio into an ML model, performing signal processing with NumPy/SciPy, or recording to a lossless format for later mastering.

{
  "container": "raw",
  "encoding": "pcm_f32le",
  "sample_rate": 48000
}

`pcm_mulaw`

8-bit μ-law compressed PCM. The standard encoding for North American and Japanese telephone networks (G.711μ). Use this when sending audio to Twilio or any telephony provider that expects μ-law. Always pair with an 8000 Hz sample rate to match the telephony standard.

{
  "container": "raw",
  "encoding": "pcm_mulaw",
  "sample_rate": 8000
}

`pcm_alaw`

8-bit A-law compressed PCM. The standard encoding for European and international telephone networks (G.711A). Use when your telephony infrastructure expects A-law rather than μ-law. Always pair with an 8000 Hz sample rate.

{
  "container": "raw",
  "encoding": "pcm_alaw",
  "sample_rate": 8000
}

STT input encodings

Used in the encoding parameter when sending audio for transcription. Must match the actual encoding of your audio source.

Encoding	Bit depth	Common sources
`pcm_s16le`	16-bit int	Microphones, browsers (Web Audio API), most audio capture libraries
`pcm_s32le`	32-bit int	Professional audio interfaces
`pcm_f16le`	16-bit float	Half-precision ML pipelines
`pcm_f32le`	32-bit float	ML models, Web Audio API `AudioWorklet` nodes, NumPy/SciPy
`pcm_mulaw`	8-bit compressed	North American telephony, Twilio streams
`pcm_alaw`	8-bit compressed	European telephony systems

For best STT performance, resample your audio to pcm_s16le at 16000 Hz before sending.

How to choose

Identify your output destination

Where does the audio end up? A browser, a phone call, an ML pipeline, a file on disk?

Match the encoding to the destination

Browser or device playback → pcm_s16le
ML or audio processing pipeline → pcm_f32le
Twilio or NA/JP telephony → pcm_mulaw at 8 kHz
European telephony → pcm_alaw at 8 kHz

Pick the highest sample rate your pipeline supports

Higher sample rates preserve more audio detail. Use 44100 or 48000 for general playback, 16000 for Bluetooth HFP, and 8000 for telephony.

​TTS output encodings

​pcm_s16le

​pcm_f32le

​pcm_mulaw

​pcm_alaw

​STT input encodings

​How to choose