Skip to main content
POST
/
infill
/
bytes
Infill (Bytes)
curl --request POST \
  --url https://api.cartesia.ai/infill/bytes \
  --header 'Authorization: Bearer <token>' \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: multipart/form-data' \
  --form left_audio='@example-file' \
  --form right_audio='@example-file' \
  --form 'transcript=<string>' \
  --form 'voice_id=<string>' \
  --form 'output_format[bit_rate]=123'
"<string>"

Authorizations

Authorization
string
header
required

Cartesia API key (sk_car_...). Get one at play.cartesia.ai/keys.

Headers

Cartesia-Version
enum<string>
required

API version header.

Available options:
2026-03-01
Example:

"2026-03-01"

Body

multipart/form-data
left_audio
file

Audio clip that comes before the infill transcript: left_audio -> transcript -> right_audio

For best results, target natural pauses in the audio and clip tightly. At least one of left_audio or right_audio must be provided.

Supported audio formats: flac, mp3, mpeg, mpga, oga, ogg, wav, webm

right_audio
file

Audio clip that comes after the infill transcript: left_audio -> transcript -> right_audio

For best results, target natural pauses in the audio and clip tightly. At least one of left_audio or right_audio must be provided.

Supported audio formats: flac, mp3, mpeg, mpga, oga, ogg, wav, webm

model_id
enum<string>

The ID of the model to use for generating audio

Available options:
sonic-3,
sonic-3-2026-01-12,
sonic-3-2025-10-27
language
enum<string>

The language of the transcript

Available options:
en,
fr,
de,
es,
pt,
zh,
ja,
hi,
it,
ko,
nl,
pl,
ru,
sv,
tr,
tl,
bg,
ro,
ar,
cs,
el,
fi,
hr,
ms,
sk,
da,
ta,
uk,
hu,
no,
vi,
bn,
th,
he,
ka,
id,
te,
gu,
kn,
ml,
mr,
pa
transcript
string

The infill text to generate. For best results, use longer transcripts to give the model more flexibility to adapt to the rest of the audio.

voice_id
string

The ID of the voice to use for generating audio

output_format[container]
enum<string>

The format of the output audio

Available options:
raw,
wav,
mp3
output_format[sample_rate]
enum<integer>

The sample rate of the output audio

Available options:
8000,
16000,
22050,
24000,
44100,
48000
output_format[encoding]
enum<string> | null

Required for raw and wav containers.

Available options:
pcm_f32le,
pcm_s16le,
pcm_mulaw,
pcm_alaw
output_format[bit_rate]
integer | null

Required for mp3 containers.

Response

200 - audio/*

Audio bytes

The response is of type file.