Building a Real-Time Voice Pipeline

Latency kills conversations. When we took on a healthcare client's voice interface project, the requirement was clear: under 200ms round-trip from speech input to AI response audio.

The Challenge

Most voice AI solutions add 500ms+ of latency through STT → LLM → TTS pipelines. Each hop introduces buffering, model loading, and network overhead. For a clinical triage assistant, this felt like talking through a wall.

Our Approach

We collapsed the pipeline. Instead of three separate services, we built a unified streaming architecture:

  1. Streaming STT — partial transcripts sent every 50ms as audio arrives
  2. Early LLM engagement — model begins generating on partial input
  3. Streaming TTS — audio chunks generated and sent before full response is ready

Results

The system achieves 140ms average round-trip latency in production. Clinicians report the experience feels like talking to a person, not a machine.

Tech Stack

  • WebRTC for low-latency audio transport
  • Custom WebSocket protocol for streaming
  • Fine-tuned Whisper model for medical terminology
  • Edge-deployed TTS with sub-100ms first-byte times