Tags for volume, speed, and emotions is in beta and subject to change in the
future.
Speed
You can guide the speed of a TTS generation with aspeed tag, which takes a scalar between 0.6 and 1.5.
This value is roughly a multiplier on the default speed. For example, 1.5 will generate audio at roughly 1.5x the
default speed.
Volume
You can guide the volume of a TTS generation with avolume tag, which is a number between 0.5
and 2.0. The default volume is 1.
Emotion Beta
Emotion control is highly experimental, particularly when emotion shifts occur
mid-generation. For best results, use voices tagged as “Emotive”, as emotions
may not work reliably with other voice types.
Pauses and breaks
To insert breaks (or pauses) in generated speech, use abreak tags with one attribute, time. For
example, <break time="1s" />. You can specify the time in seconds (s) or milliseconds (ms).
For accounting purposes, these tags are considered 1 character and do not need to be separated with adjacent text using a
space — to save credits you can remove spaces around break tags.
Spelling out numbers and letters
To spell out input text, you can wrap it in<spell> tags.
This is particularly useful for pronouncing long numbers or identifiers, such as credit card numbers, phone numbers, or unique IDs.
<break> and <spell> tags. These tags are considered 1 character and do not need to be separated with adjacent text using a space — to save credits you can remove spaces around break and spell tags.