> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Prompting tips

> Get natural-sounding output from Sonic with minimal prompt engineering.

Sonic 3.5 is designed to sound natural with minimal prompt engineering. In most cases you can pass your transcript as-is and let the model handle normalization, pacing, and expression. The tips below apply across the Sonic family; differences between Sonic 3.5 and Sonic 3 are called out inline.

## Recommendations

* **Pass natural, well-punctuated text.** Full sentences with normal capitalization and punctuation produce the best pacing and intonation. End each transcript with terminal punctuation (`.`, `?`, `!`).
* **Pass numbers, dates, times, and common acronyms in conventional written form** unless you have a specific reason to override. The list below is **example shapes to put in the transcript** (or to instruct your model to output)—not shorthand for "ignore formatting." With typical text normalization enabled, Sonic maps these patterns to natural speech for most inputs:

  * Large numbers like `1,234,567`
  * US phone numbers: `(415) 555-1212`
  * Email addresses: `user@example.com`
  * Dates in `MM/DD/YYYY` (or `DD/MM/YYYY` based on locale): `04/20/2025`
  * Times with a space before AM/PM: `7:00 PM`, `7 PM`, `7:00 P.M.`
  * Common acronyms (`NASA`) and initialisms (`USA`)

  Symbols are handled naturally — `@` reads as `at` (email addresses), `()` is silent (phone numbers).

  When an **LLM** produces this text, see [**Voice agents (LLM-authored text)**](#voice-agents-llm-authored-text) below for how normalization, optional bypass settings, and system prompts fit together.
* **Match the voice to the language.** Each voice has a primary language it works best with. Use the [Playground](https://play.cartesia.ai) to audition voices for a given language.
* **Keep prompts in their natural written form.** Heavy preprocessing (stripping punctuation, forcing all caps) generally hurts output quality.

## Controlling pacing and spelling

When you need character-by-character read-out (confirmation codes, order IDs, serial numbers, spelled-out names) or fine-grained pacing, use one of the following:

1. **Spell tags (recommended).** Wrap the string in `<spell>...</spell>`. Most reliable option, works for letters, digits, and mixed alphanumerics in all supported languages.
   ```
   Your confirmation code is <spell>AB12CD</spell>.
   ```
2. **Space-delimited characters.** If you prefer not to use tags, separate characters with single spaces.
   ```
   Your code is A B C 1 2 3.
   ```
3. **Commas for pauses between groups.** Use commas where a human would naturally pause.
   ```
   Your code is A B C, 1 2 3.
   ```

<Note>
  **Migrating from Sonic 3?** The recommended delimiter format has changed in Sonic 3.5. Use **spaces between characters** and **commas between groups** instead of commas between characters and periods between groups. The old format still works on `sonic-3` snapshots but is no longer recommended going forward.
</Note>

| Scenario                   | Old (Sonic 3)       | New (Sonic 3.5) |
| -------------------------- | ------------------- | --------------- |
| Spell out letters `HELLO`  | `H, E, L, L, O`     | `H E L L O`     |
| Spell out digits `123456`  | `1, 2, 3, 4, 5, 6`  | `1 2 3 4 5 6`   |
| Confirmation code `ABC123` | `A, B, C. 1, 2, 3.` | `A B C, 1 2 3`  |

## Voice agents (LLM-authored text)

When a **language model** writes the transcript (for example a voice agent), apply the same spell-tag and spacing rules as in [**Controlling pacing and spelling**](#controlling-pacing-and-spelling). A few extra guidelines:

* **What to output.** [**Recommendations**](#recommendations) lists **literal text shapes** for Sonic (or for your LLM to emit): `12%`, common phone and email layouts, typical dates, and similar. It is normal to **repeat those shapes in your system prompt** on purpose so behavior stays predictable—this doc is not telling you to stay vague.
* **Normalization and explicitness.** When **text normalization** is enabled (the common default), those conventional forms often read well **without** spelling everything in prose (for example rewriting `12%` as "twelve percent"). Some integrations or vendors expose an option to **skip or bypass normalization** for latency or control—if yours does, plan for **more** explicit spoken wording instead. For recurring misreads, add [custom pronunciations](#pronunciation) or a **narrow** LLM rule before a long catch-all prompt.
* **Prompt size:** prefer the smallest system prompt that passes your tests; expand when you change pipeline settings or hit new edge cases.
* **Codes and IDs:** prefer `<spell>...</spell>` when your client passes tags through to Sonic; otherwise use spaces between characters and commas between groups ([**Controlling pacing and spelling**](#controlling-pacing-and-spelling)). **NATO phonetics** (`Alpha`, `Bravo`) are a valid choice when you want the **listener** to disambiguate letters clearly (models often handle them well). `<spell>` and the spaced formats remain the most **deterministic** for Sonic pacing and tag behavior.
* **24-hour times:** in some locales, a written 24-hour time (e.g. `14:30`) may be normalized to a more colloquial 12-hour style when spoken; English and Hindi do not behave the same as every other language here, and the stack is still evolving toward options like stricter read-as-written behavior. Validate in your target language and voice if you need speech to match clock digits literally, then adjust the system prompt or [custom pronunciations](#pronunciation).
* **Markdown and machine-shaped text:** if the reply is read verbatim, avoid markdown (lists, `#` headers, `**bold**`), raw **JSON**, **emoji**, and other symbols or special characters that TTS may speak oddly—unless your client strips or normalizes them before Sonic. Many teams use a single rule covering bullets, `*`, and non-spoken punctuation.
* **Streaming:** when streaming tokens into TTS, use [continuations](#streaming) as in **Streaming** below.

**Starter system prompt (v1).** Baseline you can paste and trim for your product. If your stack **does not** pass `<spell>` or other tags through to Sonic, omit the spell-tag lines and use the spaced-format fallback only.

```text theme={null}
You are a voice agent. Everything you output will be spoken aloud by Cartesia Sonic text-to-speech.

Goals:
- Sound natural: full sentences, normal capitalization, end with . ? or !
- Prefer conventional written forms when your pipeline keeps text normalization on: numbers, dates, common acronyms, typical US phones like (415) 555-1212, emails like user@example.com, symbols like 12%. You may still spell amounts or symbols in words in the system prompt if you want that behavior every time.
- For confirmation codes, reference numbers, or mixed IDs: use <spell>...</spell> when supported, else Sonic 3.5 spaced style (A B C, 1 2 3). NATO phonetics are fine when listener clarity matters.
- Avoid markdown, raw JSON, emoji, special characters, and other stray symbols in spoken output unless your client strips them; write plain prose.
- For unusual proper nouns or product names that misread, give a short spoken-friendly form or rely on app-level pronunciation settings when available.
```

## Inserting pauses

Sonic respects natural punctuation like commas and periods. For a longer or specifically-located pause, use a [break tag](/build-with-cartesia/capability-guides/ssml-tags#pauses-and-breaks). Break tags count as a single character and don't need surrounding whitespace.

## Pronunciation

For proper nouns, trademarks, and domain-specific terms — or to disambiguate identical spellings (e.g. *Nice*, the city, vs. *nice*, the adjective) — use [custom pronunciations](/build-with-cartesia/capability-guides/custom-pronunciations).

## Streaming

Use [continuations](/build-with-cartesia/capability-guides/stream-inputs-using-continuations) when generating chunks of audio that need to sound contiguous (for example, LLM-streamed output). This preserves prosody and voice consistency across chunk boundaries.
