Clone Voice
Clone a voice from an audio clip. This endpoint has two modes, stability and similarity.
Similarity mode clones are more similar to the source clip, but may reproduce background noise. For these, use an audio clip about 5 seconds long.
Stability mode clones are more stable, but may not sound as similar to the source clip. For these, use an audio clip 10-20 seconds long.
Headers
X-API-Key
Cartesia-Version
Request
This endpoint expects a multipart form containing a file.
clip
name
The name of the voice.
description
A description for the voice.
language
The language of the voice.
mode
Tradeoff between similarity and stability. Similarity clones sound more like the source clip, but may reproduce background noise. Stability clones always sound like a studio recording, but may not sound as similar to the source clip.
Allowed values:
enhance
Whether to apply AI enhancements to the clip to reduce background noise. This leads to cleaner generated speech at the cost of reduced similarity to the source clip.
base_voice_id
Optional base voice ID that the cloned voice is derived from.
Response
This endpoint returns an object.
id
The ID of the voice.
user_id
The ID of the user who owns the voice.
is_public
Whether the voice is publicly accessible.
name
The name of the voice.
description
The description of the voice.
created_at
The date and time the voice was created.
language
The language that the given voice should speak the transcript in.
Options: English (en), French (fr), German (de), Spanish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Italian (it), Korean (ko), Dutch (nl), Polish (pl), Russian (ru), Swedish (sv), Turkish (tr).