Prompting tips - Cartesia Docs

Sonic 3.5 is designed to sound natural with minimal prompt engineering. In most cases you can pass your transcript as-is and let the model handle normalization, pacing, and expression. The tips below apply across the Sonic family; differences between Sonic 3.5 and Sonic 3 are called out inline.

Recommendations

Pass natural, well-punctuated text. Full sentences with normal capitalization and punctuation produce the best pacing and intonation. End each transcript with terminal punctuation (., ?, !).
Pass numbers, dates, times, and common acronyms in conventional written form unless you have a specific reason to override. The list below is example shapes to put in the transcript (or to instruct your model to output)—not shorthand for “ignore formatting.” With typical text normalization enabled, Sonic maps these patterns to natural speech for most inputs:
- Large numbers like 1,234,567
- US phone numbers: (415) 555-1212
- Email addresses: user@example.com
- Dates in MM/DD/YYYY (or DD/MM/YYYY based on locale): 04/20/2025
- Times with a space before AM/PM: 7:00 PM, 7 PM, 7:00 P.M.
- Common acronyms (NASA) and initialisms (USA)
Symbols are handled naturally — @ reads as at (email addresses), () is silent (phone numbers). When an LLM produces this text, see Voice agents (LLM-authored text) below for how normalization, optional bypass settings, and system prompts fit together.
Match the voice to the language. Each voice has a primary language it works best with. Use the Playground to audition voices for a given language.
Keep prompts in their natural written form. Heavy preprocessing (stripping punctuation, forcing all caps) generally hurts output quality.

Controlling pacing and spelling

When you need character-by-character read-out (confirmation codes, order IDs, serial numbers, spelled-out names) or fine-grained pacing, use one of the following:

Spell tags (recommended). Wrap the string in <spell>...</spell>. Most reliable option, works for letters, digits, and mixed alphanumerics in all supported languages.
```
Your confirmation code is <spell>AB12CD</spell>.
```
Space-delimited characters. If you prefer not to use tags, separate characters with single spaces.
```
Your code is A B C 1 2 3.
```
Commas for pauses between groups. Use commas where a human would naturally pause.
```
Your code is A B C, 1 2 3.
```

Migrating from Sonic 3? The recommended delimiter format has changed in Sonic 3.5. Use spaces between characters and commas between groups instead of commas between characters and periods between groups. The old format still works on sonic-3 snapshots but is no longer recommended going forward.

Scenario	Old (Sonic 3)	New (Sonic 3.5)
Spell out letters `HELLO`	`H, E, L, L, O`	`H E L L O`
Spell out digits `123456`	`1, 2, 3, 4, 5, 6`	`1 2 3 4 5 6`
Confirmation code `ABC123`	`A, B, C. 1, 2, 3.`	`A B C, 1 2 3`

Voice agents (LLM-authored text)

When a language model writes the transcript (for example a voice agent), apply the same spell-tag and spacing rules as in Controlling pacing and spelling. A few extra guidelines:

What to output. Recommendations lists literal text shapes for Sonic (or for your LLM to emit): 12%, common phone and email layouts, typical dates, and similar. It is normal to repeat those shapes in your system prompt on purpose so behavior stays predictable—this doc is not telling you to stay vague.
Normalization and explicitness. When text normalization is enabled (the common default), those conventional forms often read well without spelling everything in prose (for example rewriting 12% as “twelve percent”). Some integrations or vendors expose an option to skip or bypass normalization for latency or control—if yours does, plan for more explicit spoken wording instead. For recurring misreads, add custom pronunciations or a narrow LLM rule before a long catch-all prompt.
Prompt size: prefer the smallest system prompt that passes your tests; expand when you change pipeline settings or hit new edge cases.
Codes and IDs: prefer <spell>...</spell> when your client passes tags through to Sonic; otherwise use spaces between characters and commas between groups (Controlling pacing and spelling). NATO phonetics (Alpha, Bravo) are a valid choice when you want the listener to disambiguate letters clearly (models often handle them well). <spell> and the spaced formats remain the most deterministic for Sonic pacing and tag behavior.
24-hour times: in some locales, a written 24-hour time (e.g. 14:30) may be normalized to a more colloquial 12-hour style when spoken; English and Hindi do not behave the same as every other language here, and the stack is still evolving toward options like stricter read-as-written behavior. Validate in your target language and voice if you need speech to match clock digits literally, then adjust the system prompt or custom pronunciations.
Markdown and machine-shaped text: if the reply is read verbatim, avoid markdown (lists, # headers, **bold**), raw JSON, emoji, and other symbols or special characters that TTS may speak oddly—unless your client strips or normalizes them before Sonic. Many teams use a single rule covering bullets, *, and non-spoken punctuation.
Streaming: when streaming tokens into TTS, use continuations as in Streaming below.

Starter system prompt (v1). Baseline you can paste and trim for your product. If your stack does not pass <spell> or other tags through to Sonic, omit the spell-tag lines and use the spaced-format fallback only.

You are a voice agent. Everything you output will be spoken aloud by Cartesia Sonic text-to-speech.

Goals:
- Sound natural: full sentences, normal capitalization, end with . ? or !
- Prefer conventional written forms when your pipeline keeps text normalization on: numbers, dates, common acronyms, typical US phones like (415) 555-1212, emails like user@example.com, symbols like 12%. You may still spell amounts or symbols in words in the system prompt if you want that behavior every time.
- For confirmation codes, reference numbers, or mixed IDs: use <spell>...</spell> when supported, else Sonic 3.5 spaced style (A B C, 1 2 3). NATO phonetics are fine when listener clarity matters.
- Avoid markdown, raw JSON, emoji, special characters, and other stray symbols in spoken output unless your client strips them; write plain prose.
- For unusual proper nouns or product names that misread, give a short spoken-friendly form or rely on app-level pronunciation settings when available.

Inserting pauses

Sonic respects natural punctuation like commas and periods. For a longer or specifically-located pause, use a break tag. Break tags count as a single character and don’t need surrounding whitespace.

Pronunciation

For proper nouns, trademarks, and domain-specific terms — or to disambiguate identical spellings (e.g. Nice, the city, vs. nice, the adjective) — use custom pronunciations.

Streaming

Use continuations when generating chunks of audio that need to sound contiguous (for example, LLM-streamed output). This preserves prosody and voice consistency across chunk boundaries.

Documentation Index

​Recommendations

​Controlling pacing and spelling

​Voice agents (LLM-authored text)

​Inserting pauses

​Pronunciation

​Streaming

Recommendations

Controlling pacing and spelling

Voice agents (LLM-authored text)

Inserting pauses

Pronunciation

Streaming