Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.
The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.
Only the sonic-preview
model is supported for infill at this time.
At least one of left_audio
or right_audio
must be provided.
As with all generative models, there’s some inherent variability, but here’s some tips we recommend to get the best results from infill:
The ID of the model to use for generating audio
The language of the transcript
The infill text to generate
The ID of the voice to use for generating audio
The format of the output audio
The sample rate of the output audio
Required for raw
and wav
containers.
Required for mp3
containers.
Either a number between -1.0 and 1.0 or a natural language description of speed.
If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.
An array of emotion:level tags.
Supported emotions are: anger, positivity, surprise, sadness, and curiosity.
Supported levels are: lowest, low, (omit), high, highest.