Prompting tips - Cartesia Docs

Sonic 3.5 is designed to sound natural with minimal prompt engineering. In most cases you can pass your transcript as-is and let the model handle normalization, pacing, and expression. The tips below apply across the Sonic family; differences between Sonic 3.5 and Sonic 3 are called out inline.

Recommendations

Pass natural, well-punctuated text. Full sentences with normal capitalization and punctuation produce the best pacing and intonation. End each transcript with terminal punctuation (., ?, !).
Send complete phrases. Full sentences sound more natural than isolated fragments or single words. Don’t send a number, code, or spell tag on its own — include the surrounding sentence, e.g. Your confirmation code is <spell>ABC123</spell>.
Use normal casing. Reserve all-caps for acronyms you want read out letter by letter (e.g. USA). Other all-caps words may be misread as initialisms (e.g. NASA). Avoid using capitalization for emphasis or to indicate shouting.
Pass numbers, currency, dates, and common acronyms in conventional written form. Sonic maps these patterns to natural speech for most inputs:
- Large numbers like 1,234,567
- Currency like $19.99
- US phone numbers: (415) 555-1212
- Street addresses like 123 Main St
- Email addresses: user@example.com
- Dates in MM/DD/YYYY: 04/20/2025
- Times with a space before AM/PM: 7:00 PM, 7 PM, 7:00 P.M.
- Common acronyms (NASA) and initialisms (USA)
Symbols are handled naturally — @ reads as at (email addresses), () is silent (for US phone numbers). When an LLM writes the transcript, see Voice agents (LLM-authored text).
Match the voice to the language. Each voice has a primary language it works best with. Use the Playground to audition voices for a given language.
Keep prompts in their natural written form. Heavy preprocessing (stripping punctuation, forcing casing) generally hurts output quality.

Pre-normalization

Sonic 3.5 automatically covers the common cases above for most inputs. If you hit an unusual case or a bug where something is still misread, you may consider pre-normalizing your text as a fallback. Have your LLM write the transcript fully spelled out, the way it would be spoken.

Written	Spoken (fully normalized)
`$123.50`	one hundred twenty-three dollars and fifty cents
`Dr. Smith`	Doctor Smith
`14:30`	two thirty PM

Pre-normalizing is a fallback for edge cases. Well-punctuated text in conventional form is read correctly in the large majority of cases.

Controlling pacing and spelling

When you need character-by-character read-out (confirmation codes, order IDs, serial numbers, spelled-out names) or fine-grained pacing, use one of the following:

Spell tags (recommended). Wrap the string in <spell>...</spell>. Most reliable option, works for letters, digits, and mixed alphanumerics in all supported languages.
```
Your confirmation code is <spell>AB12CD</spell>.
```
Space-delimited characters. Alternatively, you can achieve the same result by separating characters with single spaces for a natural spelling pace.
```
Your code is A B C 1 2 3.
```
Comma-delimited characters. If your use case requires longer pauses, you can add a comma and a space after each character.
```
Your code is A, B, C, 1, 2, 3.
```
Semantic grouping. For more natural pacing, use spaces and add commas where a human would naturally pause.
```
Your code is A B C, 1 2 3.
```

Migrating from Sonic 3? The recommended delimiter format has changed in Sonic 3.5. Separate characters with spaces or commas and put a comma between groups. Don’t put periods between characters or mix commas and periods, this format still works on sonic-3 snapshots but is not recommended for Sonic 3.5.

Scenario	Old (Sonic 3)	New (Sonic 3.5)
Spell out letters `HELLO`	`H. E. L. L. O.`	`H E L L O`
Spell out digits `123456`	`1. 2. 3. 4. 5. 6.`	`1 2 3 4 5 6`
Confirmation code `ABC123`	`A, B, C. 1, 2, 3.`	`A B C, 1 2 3`
Slow, digit-by-digit `266AO48`	`2. 6. 6. A. O. 4. 8.`	`2, 6, 6, A, O, 4, 8`

Voice agents (LLM-authored text)

Starter system prompt (v1). Baseline you can paste and trim for your product. If your stack does not pass <spell> or other tags through to Sonic, omit the spell-tag lines and use the spaced-format fallback only.

You are a voice agent. Everything you output will be spoken aloud by Cartesia Sonic text-to-speech.

Goals:
- Sound natural: full sentences, normal capitalization, end with . ? or !
- Use conventional written forms and let text normalization speak them: numbers, currency like $19.99, dates, common acronyms, US phone numbers like (415) 555-1212, emails like user@example.com, symbols like 12%.
- Use natural punctuation (commas, periods) for pauses. Avoid SSML break tags except for a deliberate fixed-duration silence, and never place several in quick succession. To read something slowly — a confirmation code, a legal disclaimer — use a speed tag or comma-delimited characters rather than break tags.
- For confirmation codes, reference numbers, or mixed IDs: use <spell>...</spell> when supported, else delimit the characters — spaces (A B C 1 2 3) for a natural pace or commas (A, B, C, 1, 2, 3) to slow it down. NATO phonetics (Alpha, Bravo) work when the listener needs to disambiguate letters.
- Avoid markdown, raw JSON, emoji, and stray symbols like brackets, curly braces, and quotation marks in spoken output unless your client strips them; write plain prose.
- For unusual proper nouns or product names that misread, give a short spoken-friendly form or rely on app-level pronunciation settings when available.

Inserting pauses

Use natural punctuation for pauses — a comma or period usually produces the right pause in context. For an explicit, fixed-duration silence, use a break tag. Break tags split the generation, so they can sound less natural; avoid placing several in quick succession, which can cause hallucinations. Each tag counts as a single character and doesn’t need surrounding whitespace.

Pronunciation

For proper nouns, trademarks, and domain-specific terms — or to disambiguate identical spellings (e.g. Nice, the city, vs. nice, the adjective) — use custom pronunciations.

Streaming

Use continuations when generating chunks of audio that need to sound contiguous (for example, LLM-streamed output). This preserves prosody and voice consistency across chunk boundaries.

​Recommendations

​Pre-normalization

​Controlling pacing and spelling

​Voice agents (LLM-authored text)

​Inserting pauses

​Pronunciation

​Streaming

Recommendations

Pre-normalization

Controlling pacing and spelling

Voice agents (LLM-authored text)

Inserting pauses

Pronunciation

Streaming