Clone Voices - Cartesia Docs

Voice Cloning is available through the playground and the API. High-similarity clones sound more like the source clip, but may reproduce background noise. For the best voice clones, we recommend following these best practices:

General best practices for voice cloning

Choose an appropriate script to speak. You want your recording to align as closely as possible with the voice you want to generate. For example, don’t read a colorless transcript in a monotone voice unless you’re aiming for a monotonous clone. Instead, prepare a script that is suited to your use case and has the right energy.
Speak as clearly as possible and avoid background noise. For example, when recording yourself, try to use a high-quality microphone and be in a quiet space.
Avoid long pauses. Pauses in the recording will be mimicked by the cloned voice, such as between sentences.
Speak at the desired speed. The cloned voice will speak at the same speed as the sample clip.
Trim your recording. The audio you provide should roughly contain speech from start to finish. Make sure the speaker is not cut-off and that there’s no excessive silence at the beginning or end. You can use a tool like Audacity or our playground make the perfect clip from your recording.
Speak in the target language. For instance, if you want the cloned voice to speak Spanish, speak Spanish in the recording. If this is not possible, you can use Cartesia’s localization feature—available in the playground and in the API—to convert your clone to a different language.
Limit your recording to ten seconds. This is the sweet spot for high-similarity clones. A longer clip will not result in a better clone.

​General best practices for voice cloning

General best practices for voice cloning