Welcome to Cartesia

The Cartesia API is the fastest, most emotive, ultra-realistic voice AI platform. Purpose-built for developers, it serves state-of-the-art models for both text-to-speech and speech-to-text, enabling seamless conversational AI experiences.

Sonic Models for Text-to-Speech

Sonic models take text input and and stream back ultra-realistic speech in response. They can also clone voices, with full control over pronunciation and accent. Sonic 3.5 is the world’s fastest, most emotive, ultra-realistic text-to-speech model. It can stream out the first byte of audio in just 90ms, making it perfect for real-time and conversational experiences as well as dubbing, narration, AI avatars, and more. (To put things into perspective, 90ms is about twice as fast as the blink of an eye.) Learn more about available Sonic model variants and their capabilities in the TTS Models section.

Ink Models for Speech-to-Text

Ink models provide speech-to-text transcription optimized for real-time voice agents. Ink 2 is the world’s fastest, most accurate, streaming speech-to-text model with native turn detection. It uses context to intelligently decide when the human user is waiting for the agent to respond and when the agent should wait for the human user to finish speaking. Learn more about the Ink model and its capabilities in the STT Models section.

Support

Email

Email us at support@cartesia.ai to get help with integrating Cartesia, your account, or billing.

Realtime Text-to-Speech Quickstart

⌘I

Get Started

Text-to-Speech

Speech-to-Text

Tools

Integrations

Enterprise

Sonic Models for Text-to-Speech

Ink Models for Speech-to-Text

Support

Email

​Sonic Models for Text-to-Speech

​Ink Models for Speech-to-Text

​Support

Email

Sonic Models for Text-to-Speech

Ink Models for Speech-to-Text

Support