> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Audio Input

> How to find the right encoding and sample rate for realtime audio

When sending raw (PCM) audio for realtime speech-to-text transcription, specify the encoding and sample rate as query parameters since they cannot be detected from the audio data itself.

In general, you should match the encoding and sample rate to whatever your upstream pipeline (microphone capture, telephony stream, ML model output) already produces.

## API Reference

<ParamField query="encoding" type="string">
  The encoding of the input audio. Available options: `pcm_s16le`, `pcm_s32le`,
  `pcm_f16le`, `pcm_f32le`, `pcm_mulaw`, `pcm_alaw`.
</ParamField>

<ParamField query="sample_rate" type="number">
  The sample rate of the input audio in Hz. Must match the actual sample rate of
  the audio you send.
</ParamField>

<Info>
  Unlike realtime endpoints, batch STT also accepts containerized audio (e.g. `wav`, `mp3`).

  You should only supply the `encoding` and `sample_rate` query parameters when using raw PCM audio.
</Info>

## Cheat sheet

When sending raw audio, the encoding and sample rate must match what your upstream source produces.
Here's a quick rule-of-thumb to get started:

| Encoding    | Bit depth        | Common sources                                                 | Pair with sample rate |
| ----------- | ---------------- | -------------------------------------------------------------- | --------------------- |
| `pcm_s16le` | 16-bit int       | Voice agent platforms, WAV files, most audio capture libraries | 8000–48000            |
| `pcm_s32le` | 32-bit int       | Professional audio interfaces and DAWs                         | 44100–48000           |
| `pcm_f16le` | 16-bit float     | Uncommon; some half-precision ML pipelines                     | 16000–48000           |
| `pcm_f32le` | 32-bit float     | Browsers (Web Audio API), ML models (PyTorch, NumPy/SciPy)     | 16000–48000           |
| `pcm_mulaw` | 8-bit compressed | North American / Japanese telephony (G.711μ), Twilio           | 8000                  |
| `pcm_alaw`  | 8-bit compressed | European / international telephony (G.711A)                    | 8000                  |

### Telephony

#### North America and Japan

Many customers send their audio output over Twilio. All audio sent over Twilio is
transcoded to µ-law encoding with an 8 kHz sample rate.

```
?encoding=pcm_mulaw&sample_rate=8000
```

#### Europe, India, and others

The standard for European and international telephone networks (G.711A) is 8-bit A-law compressed PCM with an 8 kHz sample rate.

```
?encoding=pcm_alaw&sample_rate=8000
```

### Voice agent platforms

Many voice agent platforms use `pcm_s16le` at a 16 kHz sample rate in their pipeline. You should double check with your specific platform.

```
?encoding=pcm_s16le&sample_rate=16000
```

### Web browsers

When capturing microphone audio through the [Web Audio API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API), the samples are `pcm_f32le`. An `AudioContext`—and the `AudioWorklet` nodes you read frames from—always produces 32-bit float.

The capture sample rate defaults to whatever the user's input hardware reports, commonly 48 kHz but sometimes 44.1 kHz. Read it from `AudioContext.sampleRate` and send the same value:

```ts theme={null}
const audioContext = new AudioContext();
console.log(audioContext.sampleRate); // e.g. 48000
```

```
?encoding=pcm_f32le&sample_rate=48000
```

Speech recognition gains little from rates above 16 kHz, so downsampling to `pcm_s16le` at 16 kHz before you send cuts bandwidth with negligible impact on accuracy.

## Double check your parameters

The model decodes your bytes using the `encoding` and `sample_rate` you declared in the connection. Our server **might not error** if these parameters are incorrect.

You can validate your parameters by saving your audio data and playing it back with [ffplay](https://ffmpeg.org/ffplay.html):

```bash theme={null}
# encoding=pcm_s16le
# sample_rate=16000
# 1 channel (the API expects mono)
ffplay -f s16le -ar 16000 -ac 1 audio.raw

# general format
ffplay -f <encoding_without_pcm_prefix> -ar <sample_rate> -ac <num_channels_must_be_one> <file_path>
```

If the playback sounds wrong (it should be quite obvious), then your `encoding` or `sample_rate` doesn't match the data. Correct it so your audio plays back cleanly, then send those same values to the API.
