Skip to main content
POST
/
infill
/
bytes
Infill (Bytes)
curl --request POST \
  --url https://api.cartesia.ai/infill/bytes \
  --header 'Cartesia-Version: <cartesia-version>' \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form left_audio='@example-file' \
  --form right_audio='@example-file' \
  --form 'model_id=<string>' \
  --form 'language=<string>' \
  --form 'transcript=<string>' \
  --form 'voice_id=<string>' \
  --form 'output_format[container]=raw' \
  --form 'output_format[sample_rate]=123' \
  --form 'output_format[encoding]=pcm_f32le' \
  --form 'output_format[bit_rate]=123' \
  --form 'voice[__experimental_controls][speed]=123' \
  --form 'voice[__experimental_controls][emotion][]=anger:lowest'

Authorizations

X-API-Key
string
header
required

Headers

Cartesia-Version
enum<string>
required

API version header. Must be set to the API version, e.g. '2024-06-10'.

Available options:
2024-06-10,
2024-11-13,
2025-04-16
Example:

"2024-06-10"

Body

multipart/form-data
left_audio
file
right_audio
file
model_id
string

The ID of the model to use for generating audio. Any model other than the first "sonic" model is supported.

language
string

The language of the transcript

transcript
string

The infill text to generate

voice_id
string

The ID of the voice to use for generating audio

output_format[container]
enum<string>

The format of the output audio

Available options:
raw,
wav,
mp3
output_format[sample_rate]
integer

The sample rate of the output audio in Hz. Supported sample rates are 8000, 16000, 22050, 24000, 44100, 48000.

output_format[encoding]
enum<string> | null

Required for raw and wav containers.

Available options:
pcm_f32le,
pcm_s16le,
pcm_mulaw,
pcm_alaw
output_format[bit_rate]
integer | null

Required for mp3 containers.

voice[__experimental_controls][speed]

Either a number between -1.0 and 1.0 or a natural language description of speed.

If you specify a number, 0.0 is the default speed, -1.0 is the slowest speed, and 1.0 is the fastest speed.

voice[__experimental_controls][emotion][]
enum<string>[] | null

An array of emotion:level tags.

Supported emotions are: anger, positivity, surprise, sadness, and curiosity.

Supported levels are: lowest, low, (omit), high, highest.

An array of emotion:level tags.

Supported emotions are: anger, positivity, surprise, sadness, and curiosity.

Supported levels are: lowest, low, (omit), high, highest.

Available options:
anger:lowest,
anger:low,
anger,
anger:high,
anger:highest,
positivity:lowest,
positivity:low,
positivity,
positivity:high,
positivity:highest,
surprise:lowest,
surprise:low,
surprise,
surprise:high,
surprise:highest,
sadness:lowest,
sadness:low,
sadness,
sadness:high,
sadness:highest,
curiosity:lowest,
curiosity:low,
curiosity,
curiosity:high,
curiosity:highest

Response

204 - undefined