Infill (Bytes)

POST

Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.

The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.

Only the sonic-preview model is supported for infill at this time.

At least one of left_audio or right_audio must be provided.

As with all generative models, there’s some inherent variability, but here’s some tips we recommend to get the best results from infill:

  • Use longer infill transcripts
    • This gives the model more flexibility to adapt to the rest of the audio
  • Target natural pauses in the audio when deciding where to clip
    • This means you don’t need word-level timestamps to be as precise
  • Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
    • This helps the model generate more natural transitions

Headers

Auth
X-API-KeystringRequired
Cartesia-Version"2024-06-10"Required

Request

This endpoint expects a multipart form with multiple files.
left_audiofileRequired
right_audiofileRequired
model_idstringRequired

The ID of the model to use for generating audio

languagestringRequired

The language of the transcript

transcriptstringRequired

The infill text to generate

voice_idstringRequired

The ID of the voice to use for generating audio

output_format[container]enumRequired

The format of the output audio

Allowed values: rawwavmp3
output_format[sample_rate]integerRequired

The sample rate of the output audio

output_format[encoding]enumOptional

Required for raw and wav containers.

Allowed values: pcm_f32lepcm_s16lepcm_mulawpcm_alaw
output_format[bit_rate]integerOptional

Required for mp3 containers.

voice[__experimental_controls][speed]double or enumOptional

Either a number between -1.0 and 1.0 or a natural language description of speed.

If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.

voice[__experimental_controls][emotion][]list of enumsOptional

An array of emotion:level tags.

Supported emotions are: anger, positivity, surprise, sadness, and curiosity.

Supported levels are: lowest, low, (omit), high, highest.

Response

This endpoint returns a file.
abc
File DownloadBase64 string or null
Built with