Infill (Bytes)
Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.
The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.
Infilling is only available on sonic-2
at this time.
At least one of left_audio
or right_audio
must be provided.
As with all generative models, there’s some inherent variability, but here’s some tips we recommend to get the best results from infill:
- Use longer infill transcripts
- This gives the model more flexibility to adapt to the rest of the audio
- Target natural pauses in the audio when deciding where to clip
- This means you don’t need word-level timestamps to be as precise
- Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
- This helps the model generate more natural transitions
Headers
Request
The ID of the model to use for generating audio
The language of the transcript
The infill text to generate
The ID of the voice to use for generating audio
The format of the output audio
The sample rate of the output audio
Required for raw
and wav
containers.
Required for mp3
containers.
Either a number between -1.0 and 1.0 or a natural language description of speed.
If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.
An array of emotion:level tags.
Supported emotions are: anger, positivity, surprise, sadness, and curiosity.
Supported levels are: lowest, low, (omit), high, highest.