Building a Real-Time Voice Pipeline
Latency kills conversations. When we took on a healthcare client's voice interface project, the requirement was clear: under 200ms round-trip from speech input to AI response audio.
The Challenge
Most voice AI solutions add 500ms+ of latency through STT → LLM → TTS pipelines. Each hop introduces buffering, model loading, and network overhead. For a clinical triage assistant, this felt like talking through a wall.
Our Approach
We collapsed the pipeline. Instead of three separate services, we built a unified streaming architecture:
- Streaming STT — partial transcripts sent every 50ms as audio arrives
- Early LLM engagement — model begins generating on partial input
- Streaming TTS — audio chunks generated and sent before full response is ready
Results
The system achieves 140ms average round-trip latency in production. Clinicians report the experience feels like talking to a person, not a machine.
Tech Stack
- WebRTC for low-latency audio transport
- Custom WebSocket protocol for streaming
- Fine-tuned Whisper model for medical terminology
- Edge-deployed TTS with sub-100ms first-byte times