Sonic 3.5 is designed to sound natural with minimal prompt engineering. In most cases you can pass your transcript as-is and let the model handle normalization, pacing, and expression. The tips below apply across the Sonic family; differences between Sonic 3.5 and Sonic 3 are called out inline.
Recommendations
-
Pass natural, well-punctuated text. Full sentences with normal capitalization and punctuation produce the best pacing and intonation. End each transcript with terminal punctuation (
., ?, !).
-
Send complete phrases. Full sentences sound more natural than isolated fragments or single words. Don’t send a number, code, or spell tag on its own — include the surrounding sentence, e.g.
Your confirmation code is <spell>ABC123</spell>.
-
Use normal casing. Reserve all-caps for acronyms you want read out letter by letter (e.g.
USA). Other all-caps words may be misread as initialisms (e.g. NASA). Avoid using capitalization for emphasis or to indicate shouting.
-
Pass numbers, currency, dates, and common acronyms in conventional written form. Sonic maps these patterns to natural speech for most inputs:
- Large numbers like
1,234,567
- Currency like
$19.99
- US phone numbers:
(415) 555-1212
- Street addresses like
123 Main St
- Email addresses:
user@example.com
- Dates in
MM/DD/YYYY: 04/20/2025
- Times with a space before AM/PM:
7:00 PM, 7 PM, 7:00 P.M.
- Common acronyms (
NASA) and initialisms (USA)
Symbols are handled naturally — @ reads as at (email addresses), () is silent (for US phone numbers). When an LLM writes the transcript, see Voice agents (LLM-authored text).
-
Match the voice to the language. Each voice has a primary language it works best with. Use the Playground to audition voices for a given language.
-
Keep prompts in their natural written form. Heavy preprocessing (stripping punctuation, forcing casing) generally hurts output quality.
Pre-normalization
Sonic 3.5 automatically covers the common cases above for most inputs. If you hit an unusual case or a bug where something is still misread, you may consider pre-normalizing your text as a fallback. Have your LLM write the transcript fully spelled out, the way it would be spoken.
| Written | Spoken (fully normalized) |
|---|
$123.50 | one hundred twenty-three dollars and fifty cents |
Dr. Smith | Doctor Smith |
14:30 | two thirty PM |
Pre-normalizing is a fallback for edge cases. Well-punctuated text in conventional form is read correctly in the large majority of cases.
Controlling pacing and spelling
When you need character-by-character read-out (confirmation codes, order IDs, serial numbers, spelled-out names) or fine-grained pacing, use one of the following:
- Spell tags (recommended). Wrap the string in
<spell>...</spell>. Most reliable option, works for letters, digits, and mixed alphanumerics in all supported languages.
Your confirmation code is <spell>AB12CD</spell>.
- Space-delimited characters. Alternatively, you can achieve the same result by separating characters with single spaces for a natural spelling pace.
Your code is A B C 1 2 3.
- Comma-delimited characters. If your use case requires longer pauses, you can add a comma and a space after each character.
Your code is A, B, C, 1, 2, 3.
- Semantic grouping. For more natural pacing, use spaces and add commas where a human would naturally pause.
Your code is A B C, 1 2 3.
Migrating from Sonic 3? The recommended delimiter format has changed in Sonic 3.5. Separate characters with spaces or commas and put a comma between groups. Don’t put periods between characters or mix commas and periods, this format still works on sonic-3 snapshots but is not recommended for Sonic 3.5.
| Scenario | Old (Sonic 3) | New (Sonic 3.5) |
|---|
Spell out letters HELLO | H. E. L. L. O. | H E L L O |
Spell out digits 123456 | 1. 2. 3. 4. 5. 6. | 1 2 3 4 5 6 |
Confirmation code ABC123 | A, B, C. 1, 2, 3. | A B C, 1 2 3 |
Slow, digit-by-digit 266AO48 | 2. 6. 6. A. O. 4. 8. | 2, 6, 6, A, O, 4, 8 |
Voice agents (LLM-authored text)
Starter system prompt (v1). Baseline you can paste and trim for your product. If your stack does not pass <spell> or other tags through to Sonic, omit the spell-tag lines and use the spaced-format fallback only.
You are a voice agent. Everything you output will be spoken aloud by Cartesia Sonic text-to-speech.
Goals:
- Sound natural: full sentences, normal capitalization, end with . ? or !
- Use conventional written forms and let text normalization speak them: numbers, currency like $19.99, dates, common acronyms, US phone numbers like (415) 555-1212, emails like user@example.com, symbols like 12%.
- Use natural punctuation (commas, periods) for pauses. Avoid SSML break tags except for a deliberate fixed-duration silence, and never place several in quick succession. To read something slowly — a confirmation code, a legal disclaimer — use a speed tag or comma-delimited characters rather than break tags.
- For confirmation codes, reference numbers, or mixed IDs: use <spell>...</spell> when supported, else delimit the characters — spaces (A B C 1 2 3) for a natural pace or commas (A, B, C, 1, 2, 3) to slow it down. NATO phonetics (Alpha, Bravo) work when the listener needs to disambiguate letters.
- Avoid markdown, raw JSON, emoji, and stray symbols like brackets, curly braces, and quotation marks in spoken output unless your client strips them; write plain prose.
- For unusual proper nouns or product names that misread, give a short spoken-friendly form or rely on app-level pronunciation settings when available.
Inserting pauses
Use natural punctuation for pauses — a comma or period usually produces the right pause in context. For an explicit, fixed-duration silence, use a break tag. Break tags split the generation, so they can sound less natural; avoid placing several in quick succession, which can cause hallucinations. Each tag counts as a single character and doesn’t need surrounding whitespace.
Pronunciation
For proper nouns, trademarks, and domain-specific terms — or to disambiguate identical spellings (e.g. Nice, the city, vs. nice, the adjective) — use custom pronunciations.
Streaming
Use continuations when generating chunks of audio that need to sound contiguous (for example, LLM-streamed output). This preserves prosody and voice consistency across chunk boundaries.