Skip to main content
Cartesia supports two buffering modes for streaming TTS: managed buffering and custom buffering. The right choice depends on how much control you need over the prosody-latency tradeoff.
Start with managed buffering. It produces natural-sounding speech with minimal integration effort. Switch to custom buffering only if you need fine-grained control.

Managed buffering

Stream LLM tokens directly to Cartesia and let the API decide when to start generating speech. This is the same approach used in Cartesia’s managed voice agents platform. Set max_buffer_delay_ms to a value greater than 0 (the default is 3000ms) and stream text token by token.
{
  "model_id": "sonic-3",
  "transcript": "Hello",
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-context",
  "continue": true,
  "max_buffer_delay_ms": 3000
}
The API buffers incoming text until it has enough context to produce high-quality speech, or until max_buffer_delay_ms elapses—whichever comes first. This produces results similar to sentence-level aggregation while still optimizing for latency. When to use managed buffering:
  • You’re streaming LLM output token by token
  • You want natural-sounding speech without building buffering logic
  • You want a simple integration with good defaults

Custom buffering

Handle buffering yourself and send complete phrases or sentences to Cartesia. Set max_buffer_delay_ms to 0 so the API generates speech immediately from whatever you provide.
{
  "model_id": "sonic-3",
  "transcript": "Hello, my name is Sonic.",
  "voice": {
    "mode": "id",
    "id": "a0e99841-438c-4a64-b679-ae501e7d6091"
  },
  "context_id": "my-context",
  "continue": true,
  "max_buffer_delay_ms": 0
}
With custom buffering, you control the prosody-latency tradeoff directly:
  • Full sentences produce the best prosody but add latency while you wait for the sentence to complete.
  • Partial sentences reduce latency but may result in less natural speech at chunk boundaries.
When to use custom buffering:
  • You need precise control over when speech generation starts
  • You have your own sentence detection or text aggregation logic
  • You’re optimizing for a specific latency target

Avoid the middle ground

A common mistake is to aggregate text client-side into sentences or phrases and use the default max_buffer_delay_ms of 3000ms. This can cause unnecessary latency—after receiving a complete sentence, the API may wait up to 3000ms for additional input before generating speech. Pick one approach:
  • Managed buffering: Stream tokens with max_buffer_delay_ms > 0 and let Cartesia handle aggregation.
  • Custom buffering: Aggregate text yourself and set max_buffer_delay_ms = 0.

Configuration reference

max_buffer_delay_ms
number
default:"3000"
Maximum time in milliseconds the API waits for additional input before generating speech from buffered text.
  • Range: 0–5000ms
  • Default: 3000ms
  • Set to 0 for custom buffering (no server-side buffering)
  • Set to > 0 for managed buffering
If you use speed or volume SSML tags with managed buffering, make sure decimal values are not split across tokens. Submitting 1.0 as 1, ., 0 will cause parsing errors.

Tips for best results

  • End sentences with punctuation. Without closing punctuation (., ?, !), the model may treat text as incomplete and wait for the buffer delay to elapse before generating. See streaming inputs with continuations for more details.
  • Signal when input is done. When a turn is complete, use continue: false (WebSocket) or no_more_inputs() (SDK) so the model doesn’t wait for more text.
  • Test with realistic input patterns. Buffering behavior depends on how text arrives—test with actual LLM output rather than pre-written text.