Infill (Bytes)
Generate audio that smoothly connects two existing audio segments
Authorizations
Cartesia API key (sk_car_...). Get one at play.cartesia.ai/keys.
Headers
API version header.
2026-03-01 "2026-03-01"
Body
Audio clip that comes before the infill transcript:
left_audio -> transcript -> right_audio
For best results, target natural pauses in the audio and clip tightly.
At least one of left_audio or right_audio must be provided.
Supported audio formats: flac, mp3, mpeg, mpga, oga, ogg, wav, webm
Audio clip that comes after the infill transcript:
left_audio -> transcript -> right_audio
For best results, target natural pauses in the audio and clip tightly.
At least one of left_audio or right_audio must be provided.
Supported audio formats: flac, mp3, mpeg, mpga, oga, ogg, wav, webm
The ID of the model to use for generating audio
sonic-3, sonic-3-2026-01-12, sonic-3-2025-10-27 The language of the transcript
en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, sv, tr, tl, bg, ro, ar, cs, el, fi, hr, ms, sk, da, ta, uk, hu, no, vi, bn, th, he, ka, id, te, gu, kn, ml, mr, pa The infill text to generate. For best results, use longer transcripts to give the model more flexibility to adapt to the rest of the audio.
The ID of the voice to use for generating audio
The format of the output audio
raw, wav, mp3 The sample rate of the output audio
8000, 16000, 22050, 24000, 44100, 48000 Required for raw and wav containers.
pcm_f32le, pcm_s16le, pcm_mulaw, pcm_alaw Required for mp3 containers.
Response
Audio bytes
The response is of type file.