Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt

Use this file to discover all available pages before exploring further.

Sonic 3.5 is a drop-in replacement for Sonic 3 for most customers. Your existing voice IDs, request shape, and prompts work as-is.

Switching the model ID

# Previous
model_id = "sonic-3"

# Current
model_id = "sonic-3.5"

What’s new in Sonic 3.5

Compared with sonic-3:
  • More natural speech, pacing, and emotional expression, especially on expressive, conversational, and support-style transcripts.
  • Cleaner audio quality across all languages and voices.
  • Dramatically better alphanumeric read-out — confirmation codes, order numbers, phone numbers, IDs, and emails sound meaningfully more natural across all supported languages.
  • Step-change multilingual performance, particularly Hebrew, Japanese, Spanish, Hindi, German, Korean, and French.
  • English heteronyms like read, bass, and bow now have more accurate pronunciation in context.

What to know before you switch

  • Spell tags work the same way. If you already wrap alphanumerics in <spell>...</spell>, you don’t need to change anything — you’ll just get better-sounding output. If you use punctuation (commas, periods, spaces) instead of spell tags, the recommended format has changed; see prompting tips.
  • Speed and volume controls are temporarily disabled on sonic-3.5. If you rely on speed or volume augmentation (including via SSML), pin to a sonic-3 snapshot for now — see Sonic 3 in Older Models. We believe Sonic 3.5 has more natural pacing and you may find you don’t need speed control as much.
  • Timestamps. End-of-word timestamps used for interruption handling should be unchanged. If you depend on beginning-of-word timestamps, test carefully and reach out if you see regressions.

Tips for best results

  • Providing proper context to the model improves naturalness. See our buffering guide for details.
  • Keep prompts in their natural written form. Heavy pre-processing (stripping punctuation, forcing all caps) generally hurts output quality.