At a glance
| Capability | Endpoint | Cost |
|---|---|---|
| Agents | Line | Billed per minute in USD, not credits |
| TTS | /tts/bytes, /tts/sse, /tts/websocket | ~1 credit per character |
| PVC / Fine-tune TTS | /tts/bytes, /tts/sse, /tts/websocket | ~1.5 credits per character |
| STT | /stt, /stt/websocket, /stt/turns/websocket | Depends on endpoint, model, and audio duration |
| PVC Fine-Tuning | /fine-tunes/create | 1 million credits per fine-tune |
| Infill | /infill/bytes | 300 credits + ~1 credit per character |
| Voice changer | /voice-changer/bytes, /voice-changer/sse | 15 credits per second |
Agents
Cartesia’s hosted Line voice agents are billed per minute in US dollars. This does not affect your credit balance.| Feature | Price per minute | Notes |
|---|---|---|
| Agent calling | $0.06 | Base rate for all voice agent calls |
| Telephony (add-on) | +$0.014 | Additional when using a Cartesia-provided number |
Text-to-speech
Standard TTS costs approximately 1 credit per character. The exact number of credits can vary slightly due to transcript pre-processing. This applies to every TTS endpoint:/tts/bytes, /tts/sse, and /tts/websocket.
TTS with a Pro Voice Clone
Generating speech with a Pro Voice Clone costs approximately 1.5 credits per character, 50% more than standard TTS, because it runs on a bespoke model fine-tuned to your data. This does not apply to Instant Voice Clones, which are billed at the standard rate.Speech-to-text
STT pricing depends on the model and whether you use the batch or realtime endpoint. Silence is also included, even if no transcript is produced.| Endpoint | ink-2 | ink-whisper |
|---|---|---|
/stt/websocket | 3 credits per second of audio | 1 credit per second of audio |
/stt/turns/websocket | 3 credits per second of audio | 1 credit per second of audio |
/stt | Not available yet | 1 credit per 2 seconds of audio |
Pro Voice Clone Fine-Tuning
Creating a Pro Voice Clone fine-tunes a model on your data via/fine-tunes/create and costs 1,000,000 credits.
You’re only charged when training succeeds. Pro Voice Clones are pinned to the base model they were trained on, so retraining on a new base model or new data costs another 1,000,000 credits.
Infill
Infill generates audio that bridges two existing clips. Each request costs a fixed 300 credits, plus the standard TTS rate applied to the infill transcript.Voice changer
Voice changer converts input audio into a target voice. It costs 15 credits per second of input audio on both/voice-changer/bytes and /voice-changer/sse.