Building an AI-Powered Log Analysis Pipeline for Operations Teams
Every operations team faces the same problem: too many logs, too little signal. Sifting through gigabytes of log output to find the root cause of an outage is the kind of work that burns out good engineers.
We built a tool that lets the logs answer for themselves — with sub-agents doing the investigative work.
The Challenge
Log analysis tools fall into two camps: ELK-style aggregators that give you dashboards but no answers, and manual grep-based workflows that require you to already know what you're looking for. Neither helps when you have an unknown unknown — a pattern you didn't think to search for.
We needed something that could:
- Ingest logs at scale and store them in a way that captures meaning, not just keywords
- Search semantically — find logs that are conceptually related to a question, not just text-matched
- Summarize intelligently — reduce thousands of events into a readable narrative
- Orchestrate investigation — root cause analysis requires multi-step reasoning: retrieval, pattern detection, causal linking, and reporting. A single LLM prompt can't do this reliably.
- Run locally — operations teams often work in air-gapped or sensitive environments where cloud APIs aren't an option
Our Approach
Semantic ingestion pipeline. Logs flow into ChromaDB as vector embeddings, not raw text. Each log entry is chunked and embedded so that queries find conceptually related events — a search for "database timeout spikes" surfaces relevant slow queries, connection pool exhaustion logs, and replication lag entries, even if none of those exact phrases appear.
Multi-LLM backend. The tool supports three LLM backends interchangeably: OpenAI, Anthropic, and local open-weight models via Ollama. In connected environments, GPT-4 or Claude handles reasoning. In air-gapped or cost-sensitive setups, local models via Ollama provide the same pipeline with no external API calls. The abstraction is a single config flag — swap the backend without touching the logic.
CrewAI agent orchestration. This is where the tool does more than answer "what happened?" — CrewAI coordinates a team of specialized sub-agents, each with a defined role:
- Log retriever — queries ChromaDB for semantically related events based on the investigation scope
- Pattern detector — analyzes the evidence set for anomalies, spikes, error cascades, and temporal correlations
- Root cause analyzer — constructs a causal chain from the correlated events and formulates the most likely explanation
- Summarizer — produces a human-readable incident report with citations linking each conclusion back to specific log entries
Each sub-agent operates with its own context window and assigned responsibility. CrewAI manages the handoffs: the retriever's output becomes the pattern detector's input, whose findings feed the root cause analyzer, whose chain gets formatted by the summarizer. The engineer sees a complete investigative trail, not just a one-shot answer.
Docker Compose local setup. The entire stack — ChromaDB, CrewAI engine, LLM backends, API layer, and frontend — runs from a single docker compose up. No cloud dependencies required for the Ollama variant.
Results
The prototype demonstrates a complete AI-assisted log investigation workflow using a fraction of the infrastructure of traditional observability stacks. Semantic retrieval surfaces conceptually related events that keyword search would miss. The multi-LLM backend provides flexibility across cloud and local environments. The CrewAI sub-agent architecture turns root cause analysis from a single-prompt guess into a structured, multi-step investigation with traceable conclusions.
Tech Stack
- ChromaDB — semantic vector storage for log embeddings
- CrewAI — agent orchestration framework for multi-step investigation
- OpenAI + Anthropic + Ollama — multi-LLM backend (cloud and local)
- Docker Compose — single-command local deployment
- RAG pipeline — log ingestion, embedding, semantic retrieval, sub-agent reasoning