Volume, Speed, and Emotion

Sonic provides controls for the speed, volume, and emotion of generated speech. These are available on play.cartesia.ai using the UI controls, by passing a generation_config parameter, or by using SSML tags within the transcript.

Sonic interprets these parameters as guidance rather than strict adjustments, to ensure natural speech. Test against your content to confirm the output matches your expectations.

Speed and volume controls

Guide the speed and volume of a TTS generation with the generation_config.speed and generation_config.volume parameters.

number

The speed of the generation, ranging from 0.6 to 1.5.

number

The volume of the generation, ranging from 0.5 to 2.0.

Set speed and volume via generation_config rather than SSML tags. Most well-punctuated transcripts are paced naturally without any adjustment, so treat these controls as a refinement for specific cases rather than a default.

You can also set these inside the transcript using SSML tags.

<speed ratio="1.5"/> I like to speak quickly because it makes me sound smart.
<volume ratio="1.5"/> And I can be loud, too!

Emotion controls Beta

By default, the model interprets the emotional subtext in the provided transcript. Guide the emotion of a TTS generation, the way a director directs an actor, using the generation_config.emotion parameter.

Emotion tags push the model to be more emotive, but only work when the emotion is consistent with the transcript. The mismatch below is unlikely to work well:

<emotion value="sad"/> I'm so excited!

string

The emotional guidance for a generation, one of the emotions below.

The primary emotions, for which we have the most data and produce the best results, are: neutral, calm, angry, content, sad, and scared. The complete list of available emotions is: neutral, happy, excited, enthusiastic, elated, euphoric, triumphant, amazed, surprised, flirtatious, curious, content, peaceful, serene, calm, grateful, affectionate, trust, sympathetic, anticipation, mysterious, angry, mad, outraged, frustrated, agitated, threatened, disgusted, contempt, envious, sarcastic, ironic, sad, dejected, melancholic, disappointed, hurt, guilty, bored, tired, rejected, nostalgic, wistful, apologetic, hesitant, insecure, confused, resigned, anxious, panicked, alarmed, scared, proud, confident, distant, skeptical, contemplative, determined. The voices with the best emotional response are:

Leo (id: 0834f3df-e650-4766-a20c-5a93a43aa6e3)
Jace (id: 6776173b-fd72-460d-89b3-d85812ee518d)
Kyle (id: c961b81c-a935-4c17-bfb3-ba2239de8c2f)
Gavin (id: f4a3a8e4-694c-4c45-9ca0-27caf97901b5)
Maya (id: cbaf8084-f009-4838-a096-07ee2e6612b1)
Tessa (id: 6ccbfb76-1fc6-48f7-b71d-91ac6298247b)
Dana (id: cc00e582-ed66-4004-8336-0175b85c85f6)
Marian (id: 26403c37-80c1-4a1a-8692-540551ca2ae5)

View the full list of emotive voices in our Voice Library. You can also use SSML tags for emotions:

<emotion value="angry"/> How dare you speak to me like I'm just a robot!

Nonverbalisms

Insert [laughter] in your transcript to make the model laugh.

Get Started

Text-to-Speech

Speech-to-Text

Tools

Integrations

Enterprise

Volume, Speed, and Emotion

Speed and volume controls

Emotion controls Beta

Nonverbalisms

​Speed and volume controls

​Emotion controls Beta

​Nonverbalisms

Speed and volume controls

Emotion controls Beta

Nonverbalisms