Welcome to Cartesia
Our API enables developers to build real-time, multimodal AI experiences that feel natural and responsive.
The Cartesia API currently serves our state-of-the-art multilingual generative voice model, Sonic. Sonic takes text input and and streams back ultra-realistic speech in response. It can also clone voices. You can control every aspect of generations: from speed and emotion to pronunciation and accent.
Sonic is the world’s fastest text-to-speech model. It can stream out the first byte of audio in just 90ms, making it perfect for real-time and conversational experiences. (To put things into perspective, that’s about twice as fast as the blink of an eye.)
Sonic is the world’s highest quality text-to-speech model. So even if speed isn’t your top priority, Sonic is the best choice—whether you want to narrate content, dub videos, or anything else.
Learn more about available Sonic model variants and their capabilities in the Models section.