Sonic 3.5 is designed to sound natural with minimal prompt engineering. In most cases you can pass your transcript as-is and let the model handle normalization, pacing, and expression. The tips below apply across the Sonic family; differences between Sonic 3.5 and Sonic 3 are called out inline.Documentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.
Recommendations
-
Pass natural, well-punctuated text. Full sentences with normal capitalization and punctuation produce the best pacing and intonation. End each transcript with terminal punctuation (
.,?,!). -
Pass numbers, dates, times, and common acronyms in conventional written form unless you have a specific reason to override. The list below is example shapes to put in the transcript (or to instruct your model to output)—not shorthand for “ignore formatting.” With typical text normalization enabled, Sonic maps these patterns to natural speech for most inputs:
- Large numbers like
1,234,567 - US phone numbers:
(415) 555-1212 - Email addresses:
user@example.com - Dates in
MM/DD/YYYY(orDD/MM/YYYYbased on locale):04/20/2025 - Times with a space before AM/PM:
7:00 PM,7 PM,7:00 P.M. - Common acronyms (
NASA) and initialisms (USA)
@reads asat(email addresses),()is silent (phone numbers). When an LLM produces this text, see Voice agents (LLM-authored text) below for how normalization, optional bypass settings, and system prompts fit together. - Large numbers like
- Match the voice to the language. Each voice has a primary language it works best with. Use the Playground to audition voices for a given language.
- Keep prompts in their natural written form. Heavy preprocessing (stripping punctuation, forcing all caps) generally hurts output quality.
Controlling pacing and spelling
When you need character-by-character read-out (confirmation codes, order IDs, serial numbers, spelled-out names) or fine-grained pacing, use one of the following:- Spell tags (recommended). Wrap the string in
<spell>...</spell>. Most reliable option, works for letters, digits, and mixed alphanumerics in all supported languages. - Space-delimited characters. If you prefer not to use tags, separate characters with single spaces.
- Commas for pauses between groups. Use commas where a human would naturally pause.
Migrating from Sonic 3? The recommended delimiter format has changed in Sonic 3.5. Use spaces between characters and commas between groups instead of commas between characters and periods between groups. The old format still works on
sonic-3 snapshots but is no longer recommended going forward.| Scenario | Old (Sonic 3) | New (Sonic 3.5) |
|---|---|---|
Spell out letters HELLO | H, E, L, L, O | H E L L O |
Spell out digits 123456 | 1, 2, 3, 4, 5, 6 | 1 2 3 4 5 6 |
Confirmation code ABC123 | A, B, C. 1, 2, 3. | A B C, 1 2 3 |
Voice agents (LLM-authored text)
When a language model writes the transcript (for example a voice agent), apply the same spell-tag and spacing rules as in Controlling pacing and spelling. A few extra guidelines:- What to output. Recommendations lists literal text shapes for Sonic (or for your LLM to emit):
12%, common phone and email layouts, typical dates, and similar. It is normal to repeat those shapes in your system prompt on purpose so behavior stays predictable—this doc is not telling you to stay vague. - Normalization and explicitness. When text normalization is enabled (the common default), those conventional forms often read well without spelling everything in prose (for example rewriting
12%as “twelve percent”). Some integrations or vendors expose an option to skip or bypass normalization for latency or control—if yours does, plan for more explicit spoken wording instead. For recurring misreads, add custom pronunciations or a narrow LLM rule before a long catch-all prompt. - Prompt size: prefer the smallest system prompt that passes your tests; expand when you change pipeline settings or hit new edge cases.
- Codes and IDs: prefer
<spell>...</spell>when your client passes tags through to Sonic; otherwise use spaces between characters and commas between groups (Controlling pacing and spelling). NATO phonetics (Alpha,Bravo) are a valid choice when you want the listener to disambiguate letters clearly (models often handle them well).<spell>and the spaced formats remain the most deterministic for Sonic pacing and tag behavior. - 24-hour times: in some locales, a written 24-hour time (e.g.
14:30) may be normalized to a more colloquial 12-hour style when spoken; English and Hindi do not behave the same as every other language here, and the stack is still evolving toward options like stricter read-as-written behavior. Validate in your target language and voice if you need speech to match clock digits literally, then adjust the system prompt or custom pronunciations. - Markdown and machine-shaped text: if the reply is read verbatim, avoid markdown (lists,
#headers,**bold**), raw JSON, emoji, and other symbols or special characters that TTS may speak oddly—unless your client strips or normalizes them before Sonic. Many teams use a single rule covering bullets,*, and non-spoken punctuation. - Streaming: when streaming tokens into TTS, use continuations as in Streaming below.
<spell> or other tags through to Sonic, omit the spell-tag lines and use the spaced-format fallback only.