Streaming Responses
Streaming responses enable real-time, chunk-by-chunk delivery of agent responses, providing lower latency and more natural conversational flow.
Pattern Overview
This pattern enables:
- Low Latency: Start speaking as soon as first chunk is available
- Natural Flow: Stream responses as they’re generated by LLMs
- Interruption Support: Each chunk can be individually interrupted
- Progressive Processing: Handle large responses incrementally
Key Components
Events
AgentResponse
: Individual response chunks with content- Chunk Types: Text chunks, audio chunks, or other media types
- Progressive Delivery: Multiple events for a single logical response
Nodes
- Async Generators: Use
yield
to emit response chunks - LLM Integration: Stream directly from LLM API responses
- Chunk Processing: Transform and validate each chunk
Routes
stream()
: Process async generators that yield multiple valuesbroadcast()
: Send each chunk immediately to output- Interruption: Handle cancellation between chunks
Basic Streaming Example
Advanced Streaming Patterns
Chunked Processing with Validation
Multi-Stage Streaming Pipeline
Buffered Streaming
Streaming with Tool Integration
Interruption-Aware Streaming
Performance Optimizations
Parallel Streaming
Cached Streaming
Best Practices
- Small Chunks: Send manageable chunk sizes for smooth streaming
- Buffer Management: Use appropriate buffering for sentence boundaries
- Error Handling: Handle stream cancellation gracefully
- Resource Cleanup: Always clean up streaming resources on interruption
- Progress Tracking: Monitor streaming progress for debugging
- Rate Limiting: Consider rate limits for high-frequency streaming
- Memory Management: Clear large responses after streaming completes
Common Use Cases
- Conversational Agents: Real-time chat responses
- Content Generation: Long-form content with immediate feedback
- Live Translation: Streaming translation of ongoing speech
- Code Generation: Progressive code output with syntax validation
- Data Analysis: Streaming analysis results as they’re computed
- Multi-modal Responses: Streaming text while preparing audio/images
Troubleshooting
Choppy Streaming
- Increase buffer size for smoother delivery
- Check network latency between LLM API and application
- Monitor CPU usage during chunk processing
Memory Issues
- Implement response chunk limits
- Clear large context periodically
- Monitor memory usage during long streams
Interruption Problems
- Ensure proper CancelledError handling in async generators
- Test interruption at different streaming stages
- Verify resource cleanup in interrupt handlers
This pattern is essential for creating responsive, natural-feeling voice agents that provide immediate feedback to users while generating comprehensive responses.