Welcome to Cartesia
Our API enables developers to build real-time, multimodal AI experiences that feel natural and responsive.
The Cartesia API currently serves our state-of-the-art multilingual generative voice model family, Sonic. Sonic takes text input and and streams back ultra-realistic speech in response. It can also clone voices. You can control every aspect of generations: from speed and emotion to pronunciation and accent.
With Sonic models, you get the world’s highest quality text-to-speech model.
With Sonic 2.0
, you get the
So even if speed isn’t your top priority, Sonic is the best choice—whether you want to narrate content, dub videos, or anything else.
Interested in latency? The Sonic family includes is the world’s fastest text-to-speech model Sonic Turbo
.
It can stream out the first byte of audio in just 40ms, making it perfect for real-time and conversational experiences.
To put things into perspective, that’s about four times as fast as the blink of an eye.
Sonic 2.0
also has a model latency of 90ms, making it perfect for high-quality, real-time applications.
Learn more about available Sonic model variants and their capabilities in the Models section.