Sonic supports SSML-like (Speech Synthesis Markup Language) tags to control generated speech.Documentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.
Speed
Available onsonic-3 only. Temporarily disabled on sonic-3.5 — will be re-enabled soon.
You can guide the speed of a TTS generation with a speed tag, which takes a scalar between 0.6 and 1.5.
This value is roughly a multiplier on the default speed. For example, 1.5 will generate audio at roughly 1.5x the
default speed.
Volume
Available onsonic-3 only. Temporarily disabled on sonic-3.5 — will be re-enabled soon.
You can guide the volume of a TTS generation with a volume tag, which is a number between 0.5
and 2.0. The default volume is 1.
Emotion Beta
Pauses and breaks
To insert breaks (or pauses) in generated speech, use abreak tags with one attribute, time. For
example, <break time="1s"/>. You can specify the time in seconds (s) or milliseconds (ms).
For accounting purposes, these tags are considered 1 character and do not need to be separated with adjacent text using a
space — to save credits you can remove spaces around break tags.
Spelling out numbers and letters
To spell out input text, you can wrap it in<spell> tags.
This is particularly useful for pronouncing long numbers or identifiers, such as credit card numbers, phone numbers, or unique IDs.
<break> and <spell> tags. These tags are considered 1 character and do not need to be separated with adjacent text using a space — to save credits you can remove spaces around break and spell tags.