Managed buffering
Stream LLM tokens directly to Cartesia and let the API decide when to start generating speech. This is the same approach used in Cartesia’s managed voice agents platform. Setmax_buffer_delay_ms to a value greater than 0 (the default is 3000ms) and stream text token by token.
max_buffer_delay_ms elapses—whichever comes first. This produces results similar to sentence-level aggregation while still optimizing for latency.
When to use managed buffering:
- You’re streaming LLM output token by token
- You want natural-sounding speech without building buffering logic
- You want a simple integration with good defaults
Custom buffering
Handle buffering yourself and send complete phrases or sentences to Cartesia. Setmax_buffer_delay_ms to 0 so the API generates speech immediately from whatever you provide.
- Full sentences produce the best prosody but add latency while you wait for the sentence to complete.
- Partial sentences reduce latency but may result in less natural speech at chunk boundaries.
- You need precise control over when speech generation starts
- You have your own sentence detection or text aggregation logic
- You’re optimizing for a specific latency target
Avoid the middle ground
A common mistake is to aggregate text client-side into sentences or phrases and use the defaultmax_buffer_delay_ms of 3000ms. This can cause unnecessary latency—after receiving a complete sentence, the API may wait up to 3000ms for additional input before generating speech.
Pick one approach:
- Managed buffering: Stream tokens with
max_buffer_delay_ms > 0and let Cartesia handle aggregation. - Custom buffering: Aggregate text yourself and set
max_buffer_delay_ms = 0.
Configuration reference
Maximum time in milliseconds the API waits for additional input before generating speech from buffered text.
- Range: 0–5000ms
- Default: 3000ms
- Set to
0for custom buffering (no server-side buffering) - Set to
> 0for managed buffering
Tips for best results
- End sentences with punctuation. Without closing punctuation (
.,?,!), the model may treat text as incomplete and wait for the buffer delay to elapse before generating. See streaming inputs with continuations for more details. - Signal when input is done. When a turn is complete, use
continue: false(WebSocket) orno_more_inputs()(SDK) so the model doesn’t wait for more text. - Test with realistic input patterns. Buffering behavior depends on how text arrives—test with actual LLM output rather than pre-written text.