Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.
The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.
At least one of left_audio or right_audio must be provided.
As with all generative models, there’s some inherent variability, but here’s some tips we recommend to get the best results from infill:
API version header. Must be set to the API version, e.g. '2024-06-10'.
2024-06-10, 2024-11-13, 2025-04-16 "2024-06-10"
The ID of the model to use for generating audio. Any model other than the first "sonic" model is supported.
The language of the transcript
The infill text to generate
The ID of the voice to use for generating audio
The format of the output audio
raw, wav, mp3 The sample rate of the output audio in Hz. Supported sample rates are 8000, 16000, 22050, 24000, 44100, 48000.
Required for raw and wav containers.
pcm_f32le, pcm_s16le, pcm_mulaw, pcm_alaw Required for mp3 containers.
Either a number between -1.0 and 1.0 or a natural language description of speed.
If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.
An array of emotion:level tags.
Supported emotions are: anger, positivity, surprise, sadness, and curiosity.
Supported levels are: lowest, low, (omit), high, highest.
An array of emotion:level tags.
Supported emotions are: anger, positivity, surprise, sadness, and curiosity.
Supported levels are: lowest, low, (omit), high, highest.
anger:lowest, anger:low, anger, anger:high, anger:highest, positivity:lowest, positivity:low, positivity, positivity:high, positivity:highest, surprise:lowest, surprise:low, surprise, surprise:high, surprise:highest, sadness:lowest, sadness:low, sadness, sadness:high, sadness:highest, curiosity:lowest, curiosity:low, curiosity, curiosity:high, curiosity:highest