When sending raw (PCM) audio for speech-to-text transcription, specify the encoding and sample rate as query parameters since they cannot be detected from the audio data itself. In general, you should match the encoding and sample rate to whatever your upstream pipeline already produces (microphone capture, telephony stream, ML model output) to avoid an extra resampling step. If your audio source is flexible, we recommendDocumentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.
pcm_s16le at 16 kHz for streaming STT.
Reference
The encoding of the input audio. Available options:
pcm_s16le, pcm_s32le,
pcm_f16le, pcm_f32le, pcm_mulaw, pcm_alaw.The sample rate of the input audio in Hz. Must match the actual sample rate of
the audio you send.
RAW (PCM) Audio
When sending raw audio, the encoding and sample rate must match what your upstream source produces.| Encoding | Bit depth | Common sources | Pair with sample rate |
|---|---|---|---|
pcm_s16le | 16-bit int | Microphones, browsers (Web Audio API), most audio capture libraries | 16000-44100 |
pcm_s32le | 32-bit int | Professional audio interfaces | 16000–48000 |
pcm_f16le | 16-bit float | Half-precision ML pipelines | 16000-48000 |
pcm_f32le | 32-bit float | ML models, Web Audio API AudioWorklet nodes, NumPy/SciPy | 16000-48000 |
pcm_mulaw | 8-bit compressed | North American / Japanese telephony (G.711μ), Twilio | 8000 |
pcm_alaw | 8-bit compressed | European / international telephony (G.711A) | 8000 |