SSML Tags - Cartesia Docs

Tags for volume, speed, and emotions is in beta and subject to change in the future.

Sonic-3 supports SSML-like (Speech Synthesis Markup Language) tags to control generated speech.

Speed

Note that if you’re streaming token by token, you’ll need to buffer the whole value of the speed or volume tags. Passing in 1, ., 0 as separate inputs, for example, will result in reading out the tags.

You can guide the speed of a TTS generation with a speed tag, which takes a scalar between 0.6 and 1.5. This value is roughly a multiplier on the default speed. For example, 1.5 will generate audio at roughly 1.5x the default speed.

<speed ratio="1.5"/> I like to speak quickly because it makes me sound smart.

Volume

You can guide the volume of a TTS generation with a volume tag, which is a number between 0.5 and 2.0. The default volume is 1.

<volume ratio="0.5"/> I will speak softly.

Emotion Beta

Emotion control is highly experimental, particularly when emotion shifts occur mid-generation. If you need to change the emotion in a transcript, we recommend using separate generation contexts for each emotion. For best results, use Voices tagged as “Emotive”, as emotions may not work reliably with other Voices.

<emotion value="angry" /> I will not allow you to continue this! <emotion value="sad" /> I was hoping for a peaceful resolution.

Pauses and breaks

To insert breaks (or pauses) in generated speech, use a break tags with one attribute, time. For example, <break time="1s" />. You can specify the time in seconds (s) or milliseconds (ms). For accounting purposes, these tags are considered 1 character and do not need to be separated with adjacent text using a space — to save credits you can remove spaces around break tags.

Hello, my name is Sonic.<break time="1s"/>Nice to meet you.

Spelling out numbers and letters

To spell out input text, you can wrap it in <spell> tags. This is particularly useful for pronouncing long numbers or identifiers, such as credit card numbers, phone numbers, or unique IDs.

My name is Bob, spelled <spell>Bob</spell>, my account number is <spell>ABC-123</spell>, my phone number is <spell>(123) 456-7890</spell>, and my credit card is <spell>1234-5678-9012-3456</spell>.

If you want to spell out numbers or identifiers and have planned breaks between the generations (e.g. taking a break between the area code of a phone number and the rest of that number), you can combine <break> and <spell> tags. These tags are considered 1 character and do not need to be separated with adjacent text using a space — to save credits you can remove spaces around break and spell tags.

My phone number is <spell>(123)</spell><break time="200ms"/><spell>4712177</spell> and my credit card number is <spell>1234</spell><break time="200ms"/><spell>5678</spell> <break time="200ms"/><spell>6347</spell><break time="200ms"/><spell>4537</spell>.

​Speed

​Volume

​Emotion Beta

​Pauses and breaks

​Spelling out numbers and letters

Speed

Volume

Emotion Beta

Pauses and breaks

Spelling out numbers and letters