Infill (Bytes)
Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.
The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.
Infilling is only available on sonic-2
at this time.
At least one of left_audio
or right_audio
must be provided.
As with all generative models, there’s some inherent variability, but here’s some tips we recommend to get the best results from infill:
- Use longer infill transcripts
- This gives the model more flexibility to adapt to the rest of the audio
- Target natural pauses in the audio when deciding where to clip
- This means you don’t need word-level timestamps to be as precise
- Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
- This helps the model generate more natural transitions
Headers
Authorization
Bearer authentication of the form Bearer <token>, where token is your auth token.
Cartesia-Version
Request
This endpoint expects a multipart form with multiple files.
left_audio
right_audio
model_id
The ID of the model to use for generating audio
language
The language of the transcript
transcript
The infill text to generate
voice_id
The ID of the voice to use for generating audio
output_format[container]
The format of the output audio
Allowed values:
output_format[sample_rate]
The sample rate of the output audio
output_format[encoding]
Required for raw
and wav
containers.
Allowed values:
output_format[bit_rate]
Required for mp3
containers.
Response
This endpoint returns a file.