Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt

Use this file to discover all available pages before exploring further.

Speed and volume controls are temporarily disabled on sonic-3.5 — will be re-enabled soon. They remain available on sonic-3 snapshots; if you rely on speed or volume augmentation today, pin to sonic-3.
Sonic provides controls for the speed, volume, and emotion of generated speech. These are available on play.cartesia.ai using the UI controls, by passing a generation_config parameter, or by using SSML tags within the transcript.
Sonic interprets these parameters as guidance rather than strict adjustments, to ensure natural speech. Test against your content to confirm the output matches your expectations.

Speed and volume controls

Guide the speed and volume of a TTS generation with the generation_config.speed and generation_config.volume parameters. These values are roughly a multiplier on the default — for example, 1.5 generates audio at 1.5x the default speed.
generation_config.speed
number
The speed of the generation, ranging from 0.6 to 1.5.
generation_config.volume
number
The volume of the generation, ranging from 0.5 to 2.0.
You can also specify these inside the transcript itself using SSML:
<speed ratio="1.5"/> I like to speak quickly because it makes me sound smart.
<volume ratio="1.5"/> And I can be loud, too!

Emotion controls Beta

By default, the model interprets the emotional subtext in the provided transcript. Guide the emotion of a TTS generation — like a director providing guidance to an actor — using the generation_config.emotion parameter.
Emotion tags push the model to be more emotive, but only work when the emotion is consistent with the transcript. The mismatch below is unlikely to work well:
<emotion value="sad"/> I'm so excited!
generation_config.emotion
string
The emotional guidance for a generation, one of the emotions below.
The primary emotions, for which we have the most data and produce the best results, are: neutral, angry, excited, content, sad, and scared. The complete list of available emotions is: happy, excited, enthusiastic, elated, euphoric, triumphant, amazed, surprised, flirtatious, joking/comedic, curious, content, peaceful, serene, calm, grateful, affectionate, trust, sympathetic, anticipation, mysterious, angry, mad, outraged, frustrated, agitated, threatened, disgusted, contempt, envious, sarcastic, ironic, sad, dejected, melancholic, disappointed, hurt, guilty, bored, tired, rejected, nostalgic, wistful, apologetic, hesitant, insecure, confused, resigned, anxious, panicked, alarmed, scared, neutral, proud, confident, distant, skeptical, contemplative, determined. The voices with the best emotional response are:
  • Leo (id: 0834f3df-e650-4766-a20c-5a93a43aa6e3)
  • Jace (id: 6776173b-fd72-460d-89b3-d85812ee518d)
  • Kyle (id: c961b81c-a935-4c17-bfb3-ba2239de8c2f)
  • Gavin (id: f4a3a8e4-694c-4c45-9ca0-27caf97901b5)
  • Maya (id: cbaf8084-f009-4838-a096-07ee2e6612b1)
  • Tessa (id: 6ccbfb76-1fc6-48f7-b71d-91ac6298247b)
  • Dana (id: cc00e582-ed66-4004-8336-0175b85c85f6)
  • Marian (id: 26403c37-80c1-4a1a-8692-540551ca2ae5)
View the full list of emotive voices in our Voice Library. You can also use SSML tags for emotions:
<emotion value="angry"/> How dare you speak to me like I'm just a robot!

Nonverbalisms

Insert [laughter] in your transcript to make the model laugh. We plan to add more non-speech verbalisms like sighs and coughs in future releases.