Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cartesia.ai/llms.txt

Use this file to discover all available pages before exploring further.

Cartesia’s models are portable enough to run on widely available GPU hardware. In the table below we show the recommended concurrency for our TTS and STT model workers.
GPUSonic ConcurrencyInk-2 Concurrency
A10G4
L40S8128
A1008
H100 (MIG)8128
H10016256
See Metrics for more details on performance metrics.

Compatibility Matrix

Kubernetes and tooling

ComponentTested version
Kubernetes (AWS EKS)1.31
Kubernetes (GCP GKE)1.34 (Stable channel)

GPU

ComponentValue
GPU architectureAmpere or newer (A10G, A100, L40S, H100, H200)
GPU memory24 GB minimum per device
Worker container OSUbuntu 22.04 LTS
CUDA12.9 — bundled in the worker image, no host install required

MIG (Multi-Instance GPU)

PlatformMIG support
GKESupported via gpu_partition_size on the node pool
EKSNot configured in Terraform — set up manually with the GPU Operator if needed
Docker Compose / SwarmSupported via --mig flag and nvidia-smi -L UUIDs (see Docker)
When choosing hardware you need to consider the tradeoffs between latency (TTFA), and throughput. See the table below for the metrics on the different set of GPUs we test on:
The benchmarks below are for Sonic 3.5 and require release tag sonic-20260503 or later. Updated April 2026.
ConcurrencyTTFA P50 (ms)TTFA P95 (ms)RTF P50RTF P95Throughput (chars/s)
150550.100.10105
250550.100.10200
4801150.150.15325
81201650.200.20550
121252250.200.25760
161953000.300.30795
With these you’ll setup your per worker configurations. For handling your application’s scaling requirements, you’ll need to configure autoscaling behavior. See autoscaling for more details.