Infill (Bytes)

Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.

The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.

Infilling is only available on sonic-2 at this time.

At least one of left_audio or right_audio must be provided.

As with all generative models, there’s some inherent variability, but here’s some tips we recommend to get the best results from infill:

  • Use longer infill transcripts
    • This gives the model more flexibility to adapt to the rest of the audio
  • Target natural pauses in the audio when deciding where to clip
    • This means you don’t need word-level timestamps to be as precise
  • Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
    • This helps the model generate more natural transitions

Headers

X-API-KeystringRequired
Cartesia-Version"2024-11-13"Required

Request

This endpoint expects a multipart form with multiple files.
left_audiofileRequired
right_audiofileRequired
model_idstringRequired

The ID of the model to use for generating audio

languagestringRequired

The language of the transcript

transcriptstringRequired

The infill text to generate

voice_idstringRequired

The ID of the voice to use for generating audio

output_format[container]enumRequired

The format of the output audio

Allowed values:
output_format[sample_rate]integerRequired

The sample rate of the output audio

output_format[encoding]enumOptional

Required for raw and wav containers.

Allowed values:
output_format[bit_rate]integerOptional

Required for mp3 containers.

voice[__experimental_controls][speed]double or enumOptional

Either a number between -1.0 and 1.0 or a natural language description of speed.

If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.

voice[__experimental_controls][emotion][]list of enumsOptional

An array of emotion:level tags.

Supported emotions are: anger, positivity, surprise, sadness, and curiosity.

Supported levels are: lowest, low, (omit), high, highest.

Response

This endpoint returns a file.