Welcome to Cartesia
Our API enables developers to build real-time, multimodal AI experiences that feel natural and responsive.
The Cartesia API is the fastest, ultra-realistic voice AI platform. Purpose-built for developers, it serves state-of-the-art models for both text-to-speech and speech-to-text, enabling seamless conversational AI experiences.
Sonic Models for Text-to-Speech
Sonic models take text input and and stream back ultra-realistic speech in response. They can also clone voices, with full control over pronunciation and accent.
Sonic 2 is the world’s fastest ultra-realistic text-to-speech model. It can stream out the first byte of audio in just 90ms, making it perfect for real-time and conversational experiences as well as dubbing, narration, AI avatars, and more. (To put things into perspective, 90ms is about twice as fast as the blink of an eye.)
If real-time performance is your top priority, Sonic Turbo offers even better performance, streaming out the first byte of audio in just 40ms.
Learn more about available Sonic model variants and their capabilities in the TTS Models section.
Ink Models for Speech-to-Text
Ink models provide streaming speech-to-text transcription optimized for real-time voice applications.
Ink-Whisper, our debut model, is specifically engineered for conversational AI—handling telephony artifacts, background noise, accents, and proper nouns that typically challenge standard STT systems.
Ink-Whisper uses advanced dynamic chunking to process variable-length audio segments, reducing errors and hallucinations during pauses or audio gaps. At just $0.13/hour, it’s the most affordable streaming STT model available.
Learn more about the Ink model and its capabilities in the STT Models section.