Improving Speech Quality

Tips and tricks for better generation quality with Sonic

The tricks shown below are specific to the current models (sonic-english and sonic-multilingual), and may change in the future as we improve our modeling and training procedure.

Text-to-speech

Here are a few guidelines to achieve the best performance:

  1. Add punctuation where appropriate and at the end of each transcript whenever possible.
  2. Enter dates in MM/DD/YYYY form, such as 04/20/2023.
  3. To insert pauses, insert “-” where you need the pause.
  4. For the multilingual model, use one of the voices that matches with your desired language for the best results.
  5. Use continuations if generating audio that should sound contiguous in separate chunks.
  6. Use the Custom Pronunciation Guide to insert phonetic transcriptions to ensure correct pronunciation, especially for uncommon words like unique names and chemical compounds or words written the same but pronounced differently, like the city of Nice and the adjective “nice.”
  7. Using words instead of digits and using commas to introduce pauses can help phone numbers sound more natural. For example, “650-791-3124” can be input as “six five zero, seven nine one, three one two four.”
  8. To emphasize a question, using double question marks instead of a single one can help. (i.e. “Are you here??” vs. “Are you here?”)
  9. Avoid using quotation marks in your input text unless you intend to refer to a quote.