Audio Input - Cartesia Docs

When sending raw (PCM) audio for realtime speech-to-text transcription, specify the encoding and sample rate as query parameters since they cannot be detected from the audio data itself. In general, you should match the encoding and sample rate to whatever your upstream pipeline (microphone capture, telephony stream, ML model output) already produces.

API Reference

encoding

string

The encoding of the input audio. Available options: pcm_s16le, pcm_s32le, pcm_f16le, pcm_f32le, pcm_mulaw, pcm_alaw.

sample_rate

number

The sample rate of the input audio in Hz. Must match the actual sample rate of the audio you send.

Unlike realtime endpoints, batch STT also accepts containerized audio (e.g. wav, mp3).You should only supply the encoding and sample_rate query parameters when using raw PCM audio.

Cheat sheet

When sending raw audio, the encoding and sample rate must match what your upstream source produces. Here’s a quick rule-of-thumb to get started:

Encoding	Bit depth	Common sources	Pair with sample rate
`pcm_s16le`	16-bit int	Voice agent platforms, WAV files, most audio capture libraries	8000–48000
`pcm_s32le`	32-bit int	Professional audio interfaces and DAWs	44100–48000
`pcm_f16le`	16-bit float	Uncommon; some half-precision ML pipelines	16000–48000
`pcm_f32le`	32-bit float	Browsers (Web Audio API), ML models (PyTorch, NumPy/SciPy)	16000–48000
`pcm_mulaw`	8-bit compressed	North American / Japanese telephony (G.711μ), Twilio	8000
`pcm_alaw`	8-bit compressed	European / international telephony (G.711A)	8000

Telephony

North America and Japan

Many customers send their audio output over Twilio. All audio sent over Twilio is transcoded to µ-law encoding with an 8 kHz sample rate.

?encoding=pcm_mulaw&sample_rate=8000

Europe, India, and others

The standard for European and international telephone networks (G.711A) is 8-bit A-law compressed PCM with an 8 kHz sample rate.

?encoding=pcm_alaw&sample_rate=8000

Voice agent platforms

Many voice agent platforms use pcm_s16le at a 16 kHz sample rate in their pipeline. You should double check with your specific platform.

?encoding=pcm_s16le&sample_rate=16000

Web browsers

When capturing microphone audio through the Web Audio API, the samples are pcm_f32le. An AudioContext—and the AudioWorklet nodes you read frames from—always produces 32-bit float. The capture sample rate defaults to whatever the user’s input hardware reports, commonly 48 kHz but sometimes 44.1 kHz. Read it from AudioContext.sampleRate and send the same value:

const audioContext = new AudioContext();
console.log(audioContext.sampleRate); // e.g. 48000

?encoding=pcm_f32le&sample_rate=48000

Speech recognition gains little from rates above 16 kHz, so downsampling to pcm_s16le at 16 kHz before you send cuts bandwidth with negligible impact on accuracy.

Double check your parameters

The model decodes your bytes using the encoding and sample_rate you declared in the connection. Our server might not error if these parameters are incorrect. You can validate your parameters by saving your audio data and playing it back with ffplay:

# encoding=pcm_s16le
# sample_rate=16000
# 1 channel (the API expects mono)
ffplay -f s16le -ar 16000 -ac 1 audio.raw

# general format
ffplay -f <encoding_without_pcm_prefix> -ar <sample_rate> -ac <num_channels_must_be_one> <file_path>

If the playback sounds wrong (it should be quite obvious), then your encoding or sample_rate doesn’t match the data. Correct it so your audio plays back cleanly, then send those same values to the API.

​API Reference

​Cheat sheet

​Telephony

​North America and Japan

​Europe, India, and others

​Voice agent platforms

​Web browsers

​Double check your parameters

API Reference

Cheat sheet

Telephony

North America and Japan

Europe, India, and others

Voice agent platforms

Web browsers

Double check your parameters