Infill (Bytes)
Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.
The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.
Infilling is only available on sonic-2
at this time.
At least one of left_audio
or right_audio
must be provided.
As with all generative models, there’s some inherent variability, but here’s some tips we recommend to get the best results from infill:
- Use longer infill transcripts
- This gives the model more flexibility to adapt to the rest of the audio
- Target natural pauses in the audio when deciding where to clip
- This means you don’t need word-level timestamps to be as precise
- Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
- This helps the model generate more natural transitions
Headers
Request
Required for raw
and wav
containers.
Required for mp3
containers.
Either a number between -1.0 and 1.0 or a natural language description of speed.
If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.
An array of emotion:level tags.
Supported emotions are: anger, positivity, surprise, sadness, and curiosity.
Supported levels are: lowest, low, (omit), high, highest.