Embeddings and Voice Mixing

Learn how Sonic uses embeddings to represent voices and enable voice mixing.

Cartesia models represent voices in the form of embeddings. An embedding is a length-192 vector of floats between -1 and 1. Collectively, these numbers capture the characteristics of the voice: speed, emotion, accent, and so on.

Our text-to-speech endpoints currently require embeddings to be specified for each generation. Support for specifying voice IDs instead of embeddings is planned.

Voice Mixing (Alpha)

A neat feature that embeddings enable is voice mixing. You can interpolate two embeddings to obtain a third voice that sounds somewhere between both of them. This feature is available in the playground.

We suggest using linear interpolation to interpolate between embeddings. To interpolate between embeddings AA and BB to obtain CC, use the following formula:

C=(1α)A+αBC = (1-\alpha )A + \alpha B

α\alpha, or alpha, is the “interpolation coefficient”—an alpha of 1 means C=BC = B and an alpha of 0 means C=AC = A. For example, with α=0.5\alpha=0.5, C=0.5A+0.5B.C=0.5A+0.5B.

The perception of a mixed voice does not change linearly with the interpolation coefficient. For instance, to get a 50/50 perceived mix, you may need to lean towards one of the voices. To get the best performance, do a little exploration.

Prototype embeddings

Our playground’s voice design features (i.e. the speed and emotion controls) rely on prototype embeddings. These are embeddings that capture the essence of some characteristic: fastness, slowness, anger, curiosity, and so on. If you interpolate a voice with a prototype embedding, it will acquire the characteristics of the prototype embedding.

The current voice design sliders (speed, emotion) may alter the speaker identity. Making smaller changes should help with this, but we’re working on improving the stability of this feature!