Cartesia’s models are portable enough to run on widely available GPU hardware.
In the table below we show the recommended concurrency for our TTS and STT model workers.
| GPU | Sonic Concurrency | Ink-2 Concurrency |
|---|
| A10G | 4 | |
| L40S | 8 | 128 |
| A100 | 8 | |
| H100 (MIG) | 8 | 128 |
| H100 | 16 | 256 |
See Metrics for more details on performance metrics.
Compatibility Matrix
| Component | Tested version |
|---|
| Kubernetes (AWS EKS) | 1.31 |
| Kubernetes (GCP GKE) | 1.34 (Stable channel) |
GPU
| Component | Value |
|---|
| GPU architecture | Ampere or newer (A10G, A100, L40S, H100, H200) |
| GPU memory | 24 GB minimum per device |
| Worker container OS | Ubuntu 22.04 LTS |
| CUDA | 12.9 — bundled in the worker image, no host install required |
MIG (Multi-Instance GPU)
| Platform | MIG support |
|---|
| GKE | Supported via gpu_partition_size on the node pool |
| EKS | Not configured in Terraform — set up manually with the GPU Operator if needed |
| Docker Compose / Swarm | Supported via --mig flag and nvidia-smi -L UUIDs (see Docker) |
When choosing hardware you need to consider the tradeoffs between latency (TTFA), and throughput.
See the table below for the metrics on the different set of GPUs we test on:
The benchmarks below are for Sonic 3.5 and require release tag sonic-20260503 or later. Updated April 2026.
H100
H100 (MIG)
L40S
A100
A10
| Concurrency | TTFA P50 (ms) | TTFA P95 (ms) | RTF P50 | RTF P95 | Throughput (chars/s) |
|---|
| 1 | 50 | 55 | 0.10 | 0.10 | 105 |
| 2 | 50 | 55 | 0.10 | 0.10 | 200 |
| 4 | 80 | 115 | 0.15 | 0.15 | 325 |
| 8 | 120 | 165 | 0.20 | 0.20 | 550 |
| 12 | 125 | 225 | 0.20 | 0.25 | 760 |
| 16 | 195 | 300 | 0.30 | 0.30 | 795 |
| Concurrency | TTFA P50 (ms) | TTFA P95 (ms) | RTF P50 | RTF P95 | Throughput (chars/s) |
|---|
| 1 | 60 | 65 | 0.10 | 0.15 | 125 |
| 2 | 65 | 100 | 0.15 | 0.15 | 230 |
| 4 | 110 | 150 | 0.15 | 0.20 | 385 |
| 8 | 165 | 230 | 0.25 | 0.25 | 575 |
| 12 | 215 | 290 | 0.30 | 0.35 | 730 |
| 16 | 290 | 340 | 0.35 | 0.40 | 780 |
| Concurrency | TTFA P50 (ms) | TTFA P95 (ms) | RTF P50 | RTF P95 | Throughput (chars/s) |
|---|
| 1 | 45 | 50 | 0.10 | 0.10 | 100 |
| 2 | 50 | 55 | 0.15 | 0.15 | 180 |
| 4 | 75 | 105 | 0.15 | 0.15 | 330 |
| 8 | 125 | 165 | 0.20 | 0.25 | 485 |
| Concurrency | TTFA P50 (ms) | TTFA P95 (ms) | RTF P50 | RTF P95 | Throughput (chars/s) |
|---|
| 1 | 60 | 65 | 0.15 | 0.15 | 85 |
| 2 | 70 | 85 | 0.15 | 0.15 | 150 |
| 4 | 100 | 135 | 0.20 | 0.20 | 285 |
| 8 | 145 | 260 | 0.25 | 0.30 | 410 |
| Concurrency | TTFA P50 (ms) | TTFA P95 (ms) | RTF P50 | RTF P95 | Throughput (chars/s) |
|---|
| 1 | 80 | 85 | 0.15 | 0.20 | 75 |
| 2 | 90 | 155 | 0.20 | 0.20 | 130 |
| 4 | 165 | 240 | 0.25 | 0.30 | 210 |
| 8 | 270 | 355 | 0.40 | 0.45 | 305 |
With these you’ll setup your per worker configurations. For handling your application’s scaling requirements, you’ll need to configure autoscaling behavior. See autoscaling for more details.