Tags for volume, speed, and emotions is in beta and subject to change in the
future.
Sonic-3 supports SSML-like (Speech Synthesis Markup Language) tags to control generated speech.
Speed
Note that if you’re streaming token by token, you’ll need to buffer the whole value of the speed or volume tags.
Passing in 1, ., 0 as separate inputs, for example, will result in reading out the tags.
You can guide the speed of a TTS generation with a speed tag, which takes a scalar between 0.6 and 1.5.
This value is roughly a multiplier on the default speed. For example, 1.5 will generate audio at roughly 1.5x the
default speed.
<speed ratio="1.5"/> I like to speak quickly because it makes me sound smart.
Volume
You can guide the volume of a TTS generation with a volume tag, which is a number between 0.5
and 2.0. The default volume is 1.
<volume ratio="0.5"/> I will speak softly.
Emotion Beta
Emotion control is highly experimental, particularly when emotion shifts occur
mid-generation. For best results, use voices tagged as “Emotive”, as emotions
may not work reliably with other voice types.
<emotion value="angry" /> I will not allow you to continue this! <emotion value="sad" /> I was hoping for a peaceful resolution.
Pauses and breaks
To insert breaks (or pauses) in generated speech, use a break tags with one attribute, time. For
example, <break time="1s" />. You can specify the time in seconds (s) or milliseconds (ms).
For accounting purposes, these tags are considered 1 character and do not need to be separated with adjacent text using a
space — to save credits you can remove spaces around break tags.
Hello, my name is Sonic.<break time="1s"/>Nice to meet you.
Spelling out numbers and letters
To spell out input text, you can wrap it in <spell> tags.
This is particularly useful for pronouncing long numbers or identifiers, such as credit card numbers, phone numbers, or unique IDs.
My name is Bob, spelled <spell>Bob</spell>, my account number is <spell>ABC-123</spell>, my phone number is <spell>(123) 456-7890</spell>, and my credit card is <spell>1234-5678-9012-3456</spell>.
If you want to spell out numbers or identifiers and have planned breaks between the generations (e.g. taking a break between the area code of a phone number and the rest of that number), you can combine <break> and <spell> tags. These tags are considered 1 character and do not need to be separated with adjacent text using a space — to save credits you can remove spaces around break and spell tags.
My phone number is <spell>(123)</spell><break time="200ms"/><spell>4712177</spell> and my credit card number is <spell>1234</spell><break time="200ms"/><spell>5678</spell> <break time="200ms"/><spell>6347</spell><break time="200ms"/><spell>4537</spell>.