Skip to main content
Sonic-3 provides rich controls for the speed, volume, and emotion of generated speech. These controls are available on play.cartesia.ai using the UI controls, or by passing in a generation_config parameter, or by using SSML tags within the transcript itself.
Sonic-3 interprets these parameters as guidance instead of as strict adjustments to ensure natural speech, so we recommend testing them against your content to ensure the output matches your expectations.

Speed and Volume Controls

You can guide the speed and volume of a TTS generation with the generation_config.speed and generation_config.volume parameters. These values are roughly a multiplier on the default speed and volume, eg, 1.5 will generate audio at 1.5x the default speed.
generation_config.speed
number
The speed of the generation, ranging from 0.6 to 1.5.
generation_config.volume
number
The volume of the generation, ranging from 0.5 to 2.0.
You can also specify these inside the transcript itself, using SSML, like: <speed ratio="1.5"/> I like to speak quickly because it makes me sound smart. <volume level="1.5"/> And I can be loud, too!

Emotion Controls Beta

By default, the model attempts to interpret the emotional subtext present in the provided transcript. You can guide the emotion of a TTS generation, like a director providing guidance to an actor, using the generation_config.emotion parameter.
Emotion tags are good to push the model to be more emotive, but they only work when the emotion is consistent with transcript. For instance `I’m so excited!` is unlikely to work well.
generation_config.emotion
string
The emotional guidance for a generation, one of the emotions below.
The primary emotions, for which we have the most data and produce the best results, are: neutral, angry, excited, content, sad, and scared. The complete list of available emotions is: happy, excited, enthusiastic, elated, euphoric, triumphant, amazed, surprised, flirtatious, joking/comedic, curious, content, peaceful, serene, calm, grateful, affectionate, trust, sympathetic, anticipation, mysterious, angry, mad, outraged, frustrated, agitated, threatened, disgusted, contempt, envious, sarcastic, ironic, sad, dejected, melancholic, disappointed, hurt, guilty, bored, tired, rejected, nostalgic, wistful, apologetic, hesitant, insecure, confused, resigned, anxious, panicked, alarmed, scared, neutral, proud, confident, distant, skeptical, contemplative, determined. The Voices with the best emotional response are:
  • Leo (id: 0834f3df-e650-4766-a20c-5a93a43aa6e3)
  • Jace (id: 6776173b-fd72-460d-89b3-d85812ee518d)
  • Kyle (id: c961b81c-a935-4c17-bfb3-ba2239de8c2f)
  • Gavin (id: f4a3a8e4-694c-4c45-9ca0-27caf97901b5)
  • Maya (id: cbaf8084-f009-4838-a096-07ee2e6612b1)
  • Tessa (id: 6ccbfb76-1fc6-48f7-b71d-91ac6298247b)
  • Dana (id: cc00e582-ed66-4004-8336-0175b85c85f6)
  • Marian (id: 26403c37-80c1-4a1a-8692-540551ca2ae5)
View the full list of emotive Voices on our Voice Library with voices tagged ‘Emotive’. You can also use SSML tags for emotions, for example: <emotion value="angry" /> How dare you speak to me like I'm just a robot!

Nonverbalisms

Insert [laughter]in your transcript to make the model laugh. In the future we plan to add more non-speech verbalisms like sighs and coughs.