bashwerkz

Latency kills conversations. When we took on a healthcare client's voice interface project, the requirement was clear: under 200ms round-trip from speech input to AI response audio.

The Challenge

Most voice AI solutions add 500ms+ of latency through STT → LLM → TTS pipelines. Each hop introduces buffering, model loading, and network overhead. For a clinical triage assistant, this felt like talking through a wall.

Our Approach

We collapsed the pipeline. Instead of three separate services, we built a unified streaming architecture:

Streaming STT — partial transcripts sent every 50ms as audio arrives
Early LLM engagement — model begins generating on partial input
Streaming TTS — audio chunks generated and sent before full response is ready

Results

The system achieves 140ms average round-trip latency in production. Clinicians report the experience feels like talking to a person, not a machine.

Tech Stack

WebRTC for low-latency audio transport
Custom WebSocket protocol for streaming
Fine-tuned Whisper model for medical terminology
Edge-deployed TTS with sub-100ms first-byte times