SSML Tags

Tags for volume, speed, and emotions are in beta and subject to change in the future.

Sonic supports SSML-like (Speech Synthesis Markup Language) tags to control generated speech. The supported tags are speed, volume, emotion, break, and spell.

Speed

Available on sonic-3 and sonic-3.5.

Note that if you’re streaming token by token, you’ll need to buffer the whole value of the speed or volume tags. Passing in 1, ., 0 as separate inputs, for example, will result in reading out the tags.

You can guide the speed of a TTS generation with a speed tag, which takes a scalar between 0.6 and 1.5. This value is roughly a multiplier on the default speed. For example, 1.5 will generate audio at roughly 1.5x the default speed.

<speed ratio="1.5"/> I like to speak quickly because it makes me sound smart.

Volume

Available on sonic-3 and sonic-3.5. You can guide the volume of a TTS generation with a volume tag, which is a number between 0.5 and 2.0. The default volume is 1.

<volume ratio="0.5"/> I will speak softly.

Emotion Beta

Emotion control is highly experimental, particularly when emotion shifts occur mid-generation. If you need to change the emotion in a transcript, we recommend using separate generation contexts for each emotion. For best results, use Voices tagged as “Emotive”, as emotions may not work reliably with other Voices.

<emotion value="angry"/> I will not allow you to continue this! <emotion value="sad"/> I was hoping for a peaceful resolution.

Pauses and breaks

Punctuation is the first tool for pausing — a comma or period usually produces a natural, well-paced pause in context. Reserve break tags for when you need an explicit silence of a specific duration. A break tag takes one attribute, time, in seconds (s) or milliseconds (ms):

Hello, my name is Sonic.<break time="1s"/>Nice to meet you.

Break tags split the generation, so the model has less surrounding context and the speech can sound less natural. Avoid placing several break tags in quick succession, which can cause the model to hallucinate. Each tag counts as 1 character and doesn’t need surrounding whitespace.

Spelling out numbers and letters

To read input out character by character, wrap it in <spell> tags. This is useful for confirmation codes, order IDs, serial numbers, or spelling a name.

My name is Bob, spelled <spell>Bob</spell>, and my confirmation code is <spell>ABC123</spell>.

The model adds a slight pause between runs of letters and digits automatically. To force a longer pause at a specific point, add a space inside the tag:

Your confirmation code is <spell>ABC 123</spell>.

Avoid other punctuation inside <spell> tags — it may be read aloud (for example, a period is read as “dot”). For phone numbers, credit card numbers, and similar sequences, write them as a plain string and let text normalization handle the grouping and pacing. Reach for a <spell> tag only when you need a strict character-by-character read-out, and don’t chain <spell> and <break> tags.

Get Started

Text-to-Speech

Speech-to-Text

Tools

Integrations

Enterprise

Speed

Volume

Emotion Beta

Pauses and breaks

Spelling out numbers and letters

​Speed

​Volume

​Emotion Beta

​Pauses and breaks

​Spelling out numbers and letters

Speed

Volume

Emotion Beta

Pauses and breaks

Spelling out numbers and letters