Learn how to get the best voice clones from your audio clips.

Voice cloning is available through the playground and the API. Voice cloning has two modes: stability and similarity. High-similarity clones sound more like the source clip, but may reproduce background noise. Stability clones always sound like a studio recording, but may not sound as similar to the source clip.

For the best voice clones, we recommend following these best practices:

General best practices for voice cloning

  1. Choose an appropriate script to speak. You want your recording to align as closely as possible with the voice you want to generate. For example, don’t read a colorless transcript in a monotone voice unless you’re aiming for a monotonous clone. Instead, prepare a script that is suited to your use case and has the right energy.
  2. Speak as clearly as possible and avoid background noise. For example, when recording yourself, try to use a high-quality microphone, be in a quiet space, and so on.
  3. Avoid long pauses in the clip. Too many long pauses could result in the cloned voice drifting from the source clip.
  4. Speak in the target language. For instance, if you want the cloned voice to speak Spanish, speak Spanish in the recording. If this is not possible, you can use Cartesia’s localization feature—available in the playground and in the API—to convert your clone to a different language.

Best practices for high-similarity clones

  1. Limit your recording to five seconds. This is the sweet spot for high-similarity clones. A longer clip will not result in a better clone.
  2. Set enhance to false when cloning. Unless your source clip has substantial background noise, any postprocessing will reduce the similarity of the clone to the source clip.

Best practices for high-stability clones

  1. Limit your recording to 10 to 20 seconds. This is the sweet spot—a longer clip will not result in a better clone.
  2. Set enhance to true when cloning. This optimizes sample clip quality prior to voice cloning-—improving clone fidelity, especially for lower-quality samples. Note that this may increase overall volume and remove background noises.
Built with