Skip to main content
Our Text-to-Speech API includes many parameters that can be bewildering to developers who have not worked with audio before. In general, you should pick the highest precision and sample rate supported by every stage of your audio pipeline, including telephony and device outputs. A typical digital audio setup will perform well with these settings, which match the standard audio CD format:
output_format: {
	container: "raw",
	encoding: "pcm_s16le",
	sample_rate: 44100,
}
If you know your pipeline supports a higher encoding and sample rate end to end, the highest quality settings are:
output_format: {
	container: "raw",
	encoding: "pcm_f32le",
	sample_rate: 48000,
}

Reference

output_format.container
string
The container format (if any), for the audio output.Available options: RAW, WAV, MP3. Only the Bytes endpoint supports all container formats; our streaming endpoints (SSE, Websockets) only support RAW.
generation_config.encoding
string
The encoding of the output audio.Available options: pcm_f32le, pcm_s16le, pcm_mulaw, pcm_alaw.
generation_config.sample_rate
number
The sample rate of the output audio. Remember that to represent a given signal, the sample rate must be at least twice the highest frequency component of the signal (Nyquist theorem).Available options: 8000, 16000, 22050, 24000, 44100, 48000.

Examples

Audio CD quality

Standard audio CDs are encoded as pcm_s16le at 41kHz sample rate:
output_format: {
	container: "raw",
	encoding: "pcm_s16le",
	sample_rate: 44100,
}
This performs well for consumer digital audio setups.

Telephony

Many customers send their audio output over Twilio. Since all audio sent over Twilio is transcoded to µlaw encoding with 8kHz sample rate (to match the telephony standard), you should specify the following output_format:
output_format: {
  container: "raw",
	encoding: "pcm_mulaw",
	sample_rate: 8000,
}

Bluetooth headsets

If you happen to know that that the user is using a Bluetooth headset (such as AirPods) to multiplex both microphone input and headphone output, the user will be on the Bluetooth Hands-Free Profile (HFP), limiting sample rate to 16kHz. (In practice, it’s difficult to programmatically determine the end-user’s microphone/speaker devices, so this example is a bit contrived.)
output_format: {
	container: "raw"
	encoding: "pcm_s16le",
	sample_rate: 16000,
}