Infill (Bytes)

curl --request POST \
  --url https://api.cartesia.ai/infill/bytes \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form 'model_id=<string>' \
  --form 'language=<string>' \
  --form 'transcript=<string>' \
  --form 'voice_id=<string>' \
  --form 'output_format[container]=raw' \
  --form 'output_format[sample_rate]=123' \
  --form 'output_format[encoding]=pcm_f32le' \
  --form 'output_format[bit_rate]=123' \
  --form 'voice[__experimental_controls][speed]=123' \
  --form 'voice[__experimental_controls][emotion][]=anger:lowest' \
  --form left_audio=@example-file \
  --form right_audio=@example-file

Infill

Infill (Bytes)

Generate audio that smoothly connects two existing audio segments. This is useful for inserting new speech between existing speech segments while maintaining natural transitions.

The cost is 1 credit per character of the infill text plus a fixed cost of 300 credits.

Infilling is only available on sonic-2 at this time.

At least one of left_audio or right_audio must be provided.

As with all generative models, there’s some inherent variability, but here’s some tips we recommend to get the best results from infill:

Use longer infill transcripts
- This gives the model more flexibility to adapt to the rest of the audio
Target natural pauses in the audio when deciding where to clip
- This means you don’t need word-level timestamps to be as precise
Clip right up to the start and end of the audio segment you want infilled, keeping as much silence in the left/right audio segments as possible
- This helps the model generate more natural transitions

POST

infill

bytes

Infill (Bytes)

curl --request POST \
  --url https://api.cartesia.ai/infill/bytes \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form 'model_id=<string>' \
  --form 'language=<string>' \
  --form 'transcript=<string>' \
  --form 'voice_id=<string>' \
  --form 'output_format[container]=raw' \
  --form 'output_format[sample_rate]=123' \
  --form 'output_format[encoding]=pcm_f32le' \
  --form 'output_format[bit_rate]=123' \
  --form 'voice[__experimental_controls][speed]=123' \
  --form 'voice[__experimental_controls][emotion][]=anger:lowest' \
  --form left_audio=@example-file \
  --form right_audio=@example-file

Authorizations

X-API-Key

string

header

required

Headers

Cartesia-Version

enum<string>

required

API version header. Must be set to the API version, e.g. '2024-06-10'.

Available options:

2024-06-10,

2024-11-13,

2025-04-16

Example:

"2024-11-13"

Body

multipart/form-data

left_audio

file

right_audio

file

model_id

string

The ID of the model to use for generating audio

language

string

The language of the transcript

transcript

string

The infill text to generate

voice_id

string

The ID of the voice to use for generating audio

output_format[container]

enum<string>

The format of the output audio

Available options:

raw,

wav,

mp3

output_format[sample_rate]

integer

The sample rate of the output audio in Hz. Supported sample rates are 8000, 16000, 22050, 24000, 44100, 48000.

output_format[encoding]

enum<string>

Required for raw and wav containers.

Available options:

pcm_f32le,

pcm_s16le,

pcm_mulaw,

pcm_alaw

output_format[bit_rate]

integer | null

Required for mp3 containers.

voice[__experimental_controls][speed]

Either a number between -1.0 and 1.0 or a natural language description of speed.

If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.

voice[__experimental_controls][emotion][]

enum<string>[] | null

An array of emotion:level tags.

Supported emotions are: anger, positivity, surprise, sadness, and curiosity.

Supported levels are: lowest, low, (omit), high, highest.

Show child attributes

Generate a New Access Token

⌘I

Use the API

API Status

TTS

STT

Voices

Voice Changer

Auth

Infill

Infill (Bytes)

Authorizations

Headers

Body