Skip to main content
POST
/
infill
/
bytes
Infill (Bytes)
curl --request POST \
  --url https://api.cartesia.ai/infill/bytes \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form left_audio='@example-file' \
  --form right_audio='@example-file' \
  --form 'transcript=<string>' \
  --form 'voice_id=<string>' \
  --form 'output_format[sample_rate]=123' \
  --form 'output_format[bit_rate]=123' \
  --form 'voice[__experimental_controls][speed]=123'
"<string>"

Authorizations

X-API-Key
string
header
required

Headers

Cartesia-Version
enum<string>
required

API version header.

Available options:
2024-11-13
Example:

"2024-11-13"

Body

multipart/form-data
left_audio
file

Audio clip that comes before the infill transcript: left_audio -> transcript -> right_audio

For best results, target natural pauses in the audio and clip tightly. At least one of left_audio or right_audio must be provided.

Supported audio formats: flac, mp3, mpeg, mpga, oga, ogg, wav, webm

right_audio
file

Audio clip that comes after the infill transcript: left_audio -> transcript -> right_audio

For best results, target natural pauses in the audio and clip tightly. At least one of left_audio or right_audio must be provided.

Supported audio formats: flac, mp3, mpeg, mpga, oga, ogg, wav, webm

model_id
enum<string>

The ID of the model to use for generating audio

Available options:
sonic-3,
sonic-3-2026-01-12,
sonic-3-2025-10-27
language
enum<string>

The language of the transcript

Available options:
en,
fr,
de,
es,
pt,
zh,
ja,
hi,
it,
ko,
nl,
pl,
ru,
sv,
tr
transcript
string

The infill text to generate. For best results, use longer transcripts to give the model more flexibility to adapt to the rest of the audio.

voice_id
string

The ID of the voice to use for generating audio

output_format[container]
enum<string>

The format of the output audio

Available options:
raw,
wav,
mp3
output_format[sample_rate]
integer

The sample rate of the output audio in Hz. Supported sample rates are 8000, 16000, 22050, 24000, 44100, 48000.

output_format[encoding]
enum<string> | null

Required for raw and wav containers.

Available options:
pcm_f32le,
pcm_s16le,
pcm_mulaw,
pcm_alaw
output_format[bit_rate]
integer | null

Required for mp3 containers.

voice[__experimental_controls][speed]

Either a number between -1.0 and 1.0 or a natural language description of speed.

If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.

voice[__experimental_controls][emotion][]
enum<string>[] | null

An array of emotion:level tags.

Supported emotions are: anger, positivity, surprise, sadness, and curiosity.

Supported levels are: lowest, low, (omit), high, highest.

An array of emotion:level tags.

Supported emotions are: anger, positivity, surprise, sadness, and curiosity.

Supported levels are: lowest, low, (omit), high, highest.

Available options:
anger:lowest,
anger:low,
anger,
anger:high,
anger:highest,
positivity:lowest,
positivity:low,
positivity,
positivity:high,
positivity:highest,
surprise:lowest,
surprise:low,
surprise,
surprise:high,
surprise:highest,
sadness:lowest,
sadness:low,
sadness,
sadness:high,
sadness:highest,
curiosity:lowest,
curiosity:low,
curiosity,
curiosity:high,
curiosity:highest

Response

200 - audio/*

Audio bytes

The response is of type file.