Infill (Bytes)

Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions. **The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.** Infilling is only available on `sonic-2` at this time. At least one of `left_audio` or `right_audio` must be provided. As with all generative models, there's some inherent variability, but here's some tips we recommend to get the best results from infill: - Use longer infill transcripts - This gives the model more flexibility to adapt to the rest of the audio - Target natural pauses in the audio when deciding where to clip - This means you don't need word-level timestamps to be as precise - Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible - This helps the model generate more natural transitions

Authentication

X-API-Keystring
API Key authentication via header

Headers

Cartesia-Version"2024-11-13"RequiredDefaults to 2024-11-13

Request

This endpoint expects a multipart form with multiple files.
left_audiofileRequired
right_audiofileRequired
model_idstringRequired
The ID of the model to use for generating audio
languagestringRequired
The language of the transcript
transcriptstringRequired
The infill text to generate
voice_idstringRequired
The ID of the voice to use for generating audio
output_format[container]enumRequired
The format of the output audio
Allowed values:
output_format[sample_rate]integerRequired
The sample rate of the output audio
output_format[encoding]enumOptional

Required for raw and wav containers.

Allowed values:
output_format[bit_rate]integerOptional

Required for mp3 containers.

voice[__experimental_controls][speed]double or enumOptional

Either a number between -1.0 and 1.0 or a natural language description of speed.

If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.

voice[__experimental_controls][emotion][]list of enumsOptional

An array of emotion:level tags.

Supported emotions are: anger, positivity, surprise, sadness, and curiosity.

Supported levels are: lowest, low, (omit), high, highest.

Response

This endpoint returns a file.