Hardware Selection

Cartesia’s models are portable enough to run on widely available GPU hardware. In the table below we show the recommended concurrency for our TTS and STT model workers.

GPU	Sonic Concurrency	Ink-2 Concurrency
A10G	4
L40S	8	128
A100	8
H100 (MIG)	8	128
H100	16	256

See Metrics for more details on performance metrics.

Compatibility Matrix

Kubernetes and tooling

Component	Tested version
Kubernetes (AWS EKS)	`1.31`
Kubernetes (GCP GKE)	`1.34` (Stable channel)

GPU

Component	Value
GPU architecture	Ampere or newer (A10G, A100, L40S, H100, H200)
GPU memory	24 GB minimum per device
Worker container OS	Ubuntu 22.04 LTS
CUDA	`12.9` — bundled in the worker image, no host install required

MIG (Multi-Instance GPU)

Platform	MIG support
GKE	Supported via `gpu_partition_size` on the node pool
EKS	Not configured in Terraform — set up manually with the GPU Operator if needed
Docker Compose / Swarm	Supported via `--mig` flag and `nvidia-smi -L` UUIDs (see Docker)

When choosing hardware you need to consider the tradeoffs between latency (TTFA), and throughput. See the table below for the metrics on the different set of GPUs we test on:

The benchmarks below are for Sonic 3.5 and require release tag sonic-20260503 or later. Updated April 2026.

H100
H100 (MIG)
L40S
A100
A10

Concurrency	TTFA P50 (ms)	TTFA P95 (ms)	RTF P50	RTF P95	Throughput (chars/s)
1	50	55	0.10	0.10	105
2	50	55	0.10	0.10	200
4	80	115	0.15	0.15	325
8	120	165	0.20	0.20	550
12	125	225	0.20	0.25	760
16	195	300	0.30	0.30	795

Concurrency	TTFA P50 (ms)	TTFA P95 (ms)	RTF P50	RTF P95	Throughput (chars/s)
1	60	65	0.10	0.15	125
2	65	100	0.15	0.15	230
4	110	150	0.15	0.20	385
8	165	230	0.25	0.25	575
12	215	290	0.30	0.35	730
16	290	340	0.35	0.40	780

Concurrency	TTFA P50 (ms)	TTFA P95 (ms)	RTF P50	RTF P95	Throughput (chars/s)
1	45	50	0.10	0.10	100
2	50	55	0.15	0.15	180
4	75	105	0.15	0.15	330
8	125	165	0.20	0.25	485

Concurrency	TTFA P50 (ms)	TTFA P95 (ms)	RTF P50	RTF P95	Throughput (chars/s)
1	60	65	0.15	0.15	85
2	70	85	0.15	0.15	150
4	100	135	0.20	0.20	285
8	145	260	0.25	0.30	410

Concurrency	TTFA P50 (ms)	TTFA P95 (ms)	RTF P50	RTF P95	Throughput (chars/s)
1	80	85	0.15	0.20	75
2	90	155	0.20	0.20	130
4	165	240	0.25	0.30	210
8	270	355	0.40	0.45	305

With these you’ll setup your per worker configurations. For handling your application’s scaling requirements, you’ll need to configure autoscaling behavior. See autoscaling for more details.

Overview

Deployments

Guides

Performance

Compatibility Matrix

Kubernetes and tooling

GPU

MIG (Multi-Instance GPU)

Overview

Deployments

Guides

Performance

Documentation Index

​Compatibility Matrix

​Kubernetes and tooling

​GPU

​MIG (Multi-Instance GPU)

Compatibility Matrix

Kubernetes and tooling

GPU

MIG (Multi-Instance GPU)