Skip to main content
When sending raw (PCM) audio for realtime speech-to-text transcription, specify the encoding and sample rate as query parameters since they cannot be detected from the audio data itself. In general, you should match the encoding and sample rate to whatever your upstream pipeline (microphone capture, telephony stream, ML model output) already produces.

API Reference

encoding
string
The encoding of the input audio. Available options: pcm_s16le, pcm_s32le, pcm_f16le, pcm_f32le, pcm_mulaw, pcm_alaw.
sample_rate
number
The sample rate of the input audio in Hz. Must match the actual sample rate of the audio you send.
Unlike realtime endpoints, batch STT also accepts containerized audio (e.g. wav, mp3).You should only supply the encoding and sample_rate query parameters when using raw PCM audio.

Cheat sheet

When sending raw audio, the encoding and sample rate must match what your upstream source produces. Here’s a quick rule-of-thumb to get started:
EncodingBit depthCommon sourcesPair with sample rate
pcm_s16le16-bit intVoice agent platforms, WAV files, most audio capture libraries8000–48000
pcm_s32le32-bit intProfessional audio interfaces and DAWs44100–48000
pcm_f16le16-bit floatUncommon; some half-precision ML pipelines16000–48000
pcm_f32le32-bit floatBrowsers (Web Audio API), ML models (PyTorch, NumPy/SciPy)16000–48000
pcm_mulaw8-bit compressedNorth American / Japanese telephony (G.711μ), Twilio8000
pcm_alaw8-bit compressedEuropean / international telephony (G.711A)8000

Telephony

North America and Japan

Many customers send their audio output over Twilio. All audio sent over Twilio is transcoded to µ-law encoding with an 8 kHz sample rate.
?encoding=pcm_mulaw&sample_rate=8000

Europe, India, and others

The standard for European and international telephone networks (G.711A) is 8-bit A-law compressed PCM with an 8 kHz sample rate.
?encoding=pcm_alaw&sample_rate=8000

Voice agent platforms

Many voice agent platforms use pcm_s16le at a 16 kHz sample rate in their pipeline. You should double check with your specific platform.
?encoding=pcm_s16le&sample_rate=16000

Web browsers

When capturing microphone audio through the Web Audio API, the samples are pcm_f32le. An AudioContext—and the AudioWorklet nodes you read frames from—always produces 32-bit float. The capture sample rate defaults to whatever the user’s input hardware reports, commonly 48 kHz but sometimes 44.1 kHz. Read it from AudioContext.sampleRate and send the same value:
const audioContext = new AudioContext();
console.log(audioContext.sampleRate); // e.g. 48000
?encoding=pcm_f32le&sample_rate=48000
Speech recognition gains little from rates above 16 kHz, so downsampling to pcm_s16le at 16 kHz before you send cuts bandwidth with negligible impact on accuracy.

Double check your parameters

The model decodes your bytes using the encoding and sample_rate you declared in the connection. Our server might not error if these parameters are incorrect. You can validate your parameters by saving your audio data and playing it back with ffplay:
# encoding=pcm_s16le
# sample_rate=16000
# 1 channel (the API expects mono)
ffplay -f s16le -ar 16000 -ac 1 audio.raw

# general format
ffplay -f <encoding_without_pcm_prefix> -ar <sample_rate> -ac <num_channels_must_be_one> <file_path>
If the playback sounds wrong (it should be quite obvious), then your encoding or sample_rate doesn’t match the data. Correct it so your audio plays back cleanly, then send those same values to the API.