Infill (Bytes)

Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.

The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.

Infilling is only available on sonic-2 at this time.

At least one of left_audio or right_audio must be provided.

As with all generative models, there’s some inherent variability, but here’s some tips we recommend to get the best results from infill:

  • Use longer infill transcripts
    • This gives the model more flexibility to adapt to the rest of the audio
  • Target natural pauses in the audio when deciding where to clip
    • This means you don’t need word-level timestamps to be as precise
  • Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
    • This helps the model generate more natural transitions

Headers

AuthorizationstringRequired

Bearer authentication of the form Bearer <token>, where token is your auth token.

Cartesia-Version"2025-04-16"Required

Request

This endpoint expects a multipart form with multiple files.
left_audiofileRequired
right_audiofileRequired
model_idstringRequired

The ID of the model to use for generating audio

languagestringRequired

The language of the transcript

transcriptstringRequired

The infill text to generate

voice_idstringRequired

The ID of the voice to use for generating audio

output_format[container]enumRequired

The format of the output audio

Allowed values:
output_format[sample_rate]integerRequired

The sample rate of the output audio

output_format[encoding]enumOptional

Required for raw and wav containers.

Allowed values:
output_format[bit_rate]integerOptional

Required for mp3 containers.

Response

This endpoint returns a file.