Documentation Index
Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt
Use this file to discover all available pages before exploring further.
All models in the Sonic TTS family support custom pronunciations in your transcripts. Try out the pronunciation tool on our demo page.
Sonic 3.5 and Sonic 3
Sonic-2 and Sonic-turbo
sonic-3.5 and sonic-3 support custom pronunciation dictionaries, which let you specify how to pronounce a specific word or phrase.A dictionary is a simple search and replace, which directs the model to use another string in lieu of the text from the transcript. The pronunciation can be either an IPA pronunciation or a “sounds-like” guidance:[
{
"text": "bayou",
"pronunciation": "<<ˈ|b|ɑ|ˈ|j|u>>"
},
{
"text": "jambalaya",
"pronunciation": "<<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>>"
},
{
"text": "tchoupitoulas",
"pronunciation": "chop-uh-TOO-liss"
}
]
The legacy alias field is deprecated. Use pronunciation for new dictionary items.
Save these JSONs as pronunciation dictionaries through our API or through our playground:
Once a dictionary is created, use it in any TTS API by passing its id as pronunciation_dict_id.With the dictionary above, the string I ate some jambalaya on tchoupitoulas street becomes I ate some <<ˈ|dʒ|ə|m|ˈ|b|ə|ˈ|l|aɪ|ˈ|ə>> on chop-uh-TOO-liss street before being handed off to the model.Case Sensitivity
Dictionary matching is case-sensitive, with one exception: a lowercase entry also matches its sentence-start capitalized form. For example, cat matches both cat and Cat, but not CAT. An entry for CAT only matches CAT.This applies to multi-word entries too. An entry for green valley matches green valley and Green valley, but not Green Valley.Use lowercase entries for common words. These match the word both mid-sentence (cat) and at the start of a sentence (Cat), covering the two most common positions.Use exact capitalization for proper nouns. A term like LaTeX should be entered as LaTeX so it doesn’t collide with a different pronunciation for the common word latex. For multi-word proper nouns, enter the exact casing as it appears in your transcripts — for example, Green Valley if the transcript capitalizes both words.For the best controllability around pronunciation, we recommend using sonic-3.5.
sonic-2 and sonic-turbo use MFA-style IPA for all languages.
For the best controllability around pronunciation, we recommend using sonic-2.You can also get custom pronunciations with older Sonic models.
The sonic, sonic-2024-12-12, and sonic-2024-10-19 models use Sonic-flavored IPA phonemes for English.
The sonic and sonic-2024-12-12 use MFA-style IPA for languages other than English, and the Sonic Preview model uses MFA-style IPA for all languages.
Note that sonic-2024-10-19 does not support custom pronunciations for languages other than English.
We will soon be updating all models to use MFA-style IPA.Custom words should be wrapped in double angle brackets << >> , with pipe characters | between phonemes and no whitespace.
For example:
Can I get <<x|a|l|a|p|e|ɲ|o>> on that? (MFA-style IPA)
Can I get <<h|ɑː|l|ˈə|p|eɪ|n|y|ˌoʊ|>> on that? (Sonic-flavored IPA)
Each individual word should be wrapped in it’s own set of angle brackets.MFA-style IPA
Constructing Pronunciations
We use the IPA phoneset as defined by the Montreal Forced Aligner. Because of the size and complexity of this phoneset, you may find it easier to construct your custom pronunciation starting from existing words with known phonemizations. We suggest the following workflow for constructing a custom pronunciation for a word:
- Go to the MFA pronunciation dictionary index and find the page corresponding to your language. Make sure the phoneset is MFA, and try to download the latest version (for most languages, v3.0 or v3.1).
- This page will give you the full range of acceptable phones for your language under the “phones” section.
- Scroll down to the
Installation section and click on the Download from the release page link.
- Scroll to the bottom of the release page and download the .dict file; this is a text file mapping words to their constituent phonemes.
- The first column in the file contains words, and the last column contains space delimited phonemes. Ignore the other columns.
- Look up your word or words that sound similar to your intended pronunciation in the dictionary. Use these pronunciations as a starting point to construct your custom pronunciation.
Automatic pronunciation suggestions based on audio samples will be added in a future update. Note that MFA-style IPA does not support stress markers.Example
Suppose I want to generate the text “This is a generation from Cartesia” and the model is not pronouncing “Cartesia” correctly. I would do the following:
-
Go to the MFA pronunciation dictionary index and look for English pronunciation dictionaries. I see that for US English, the most recent version is v3.1.
- I note that the page says that the acceptable phones for US english are
aj aw b bʲ c cʰ cʷ d dʒ dʲ d̪ ej f fʲ h i iː j k kʰ kʷ l m mʲ m̩ n n̩ ow p pʰ pʲ pʷ s t tʃ tʰ tʲ tʷ t̪ v vʲ w z æ ç ð ŋ ɐ ɑ ɑː ɒ ɒː ɔj ə ɚ ɛ ɝ ɟ ɟʷ ɡ ɡʷ ɪ ɫ ɫ̩ ɱ ɲ ɹ ɾ ɾʲ ɾ̃ ʃ ʉ ʉː ʊ ʎ ʒ ʔ θ
-
Download the .dict file from the bottom of the release page.
-
Find a word in this dictionary that sounds similar to how I want “Cartesia” to be pronounced. I see this entry in the dictionary:
cartesian 0.99 0.14 1.0 1.0 kʰ ɑ ɹ tʲ i ʒ ə n
-
Ignore the middle four numeric columns. I want to cut off the part of the pronunciation that corresponds to “-an” and replace it with an “uh” sound. I know that the MFA phoneme for “uh” is
ɐ (if I didn’t know that, I could also look up “uh” in the dictionary). So the pronunciation I want is kʰ ɑ ɹ tʲ i ʒ ɐ.
-
Format the phonemes it in angle brackets with pipe characters between phonemes and no whitespace. So my transcript is
This is a generation from <<kʰ|ɑ|ɹ|tʲ|i|ʒ|ɐ>>.
(Deprecated) Sonic-flavored IPA
Sonic-flavored IPA is only for sonic and users of our latest models (sonic-2 and sonic-turbo) should use MFA-style IPA.Here is a pronunciation guide for Sonic-flavored IPA.
It follows the English phonology article on Wikipedia for most phonemes,
but in spots where our model requires different notation than you may expect, we’ve included a blue <= in the margins.You can copy/paste some of these uncommon symbols from the original charts here.Stresses and vowel length markers
Sonic English requires stress markers for first (ˈ) and second (ˌ) stressed syllables, which go directly before the vowel. We also use annotations for vowel length (ː). The model can also operate without them, but you will have noticeably better robustness and control when using them.