The Problem
Network Operations Centers in telecom are drowning in documentation. Every node, every protocol, every alarm code has a manual. When an alert fires at 2 AM, the engineer on duty needs answers in seconds — not a search through 400-page PDFs.
That’s the problem I set out to solve with TelecomOps RAG.
Why RAG Over Fine-tuning
The first architectural decision: RAG or fine-tuning?
Fine-tuning would have encoded knowledge into the model weights. That sounds appealing until you realize telecom documentation changes constantly — new firmware, updated RAN specs, revised NOC runbooks. A fine-tuned model becomes stale the moment the documentation changes.
RAG, by contrast, separates the knowledge from the model. You update the vector store, not the weights. For a domain with living documentation, this is the only sane choice.
The Architecture
User Query
↓
Query Embedding (sentence-transformers)
↓
ChromaDB — Cosine Similarity Search
↓
Top-k Chunks (k=5 by default)
↓
Prompt Construction (system + context + query)
↓
Mistral 7B (via HuggingFace Spaces)
↓
Streamed Response
The entire pipeline runs on FastAPI, with the frontend consuming a streaming SSE endpoint.
The Chunking Strategy
This is where most RAG implementations fail. I spent three weeks on chunking alone.
The naive approach — split on 512 tokens with 50-token overlap — destroyed the semantic structure of telecom documents. An alarm code would be split mid-description. A command sequence would lose its header.
What worked:
Semantic section detection. Telecom manuals use predictable heading patterns. I wrote a parser that respects document structure — each chunk maps to one complete conceptual unit (a procedure, an alarm, a configuration parameter).
Metadata tagging. Every chunk carries: document name, section path, equipment type, and protocol family. This metadata becomes part of the retrieval context.
Hybrid size limits. Natural sections can be 200 or 800 tokens. I allowed this variance rather than enforcing artificial uniformity. The 31K chunks in the production index range from 120 to 650 tokens.
Choosing Mistral 7B
I evaluated three models: Llama 2 7B, Mistral 7B, and Phi-2.
Mistral won on two dimensions:
- Instruction following. NOC queries are precise (“what is the RACH failure threshold for Ericsson ENB firmware 21.Q3?”). Mistral’s instruction tuning handles these better than Llama 2 at the same scale.
- Inference speed. On the HuggingFace Spaces T4 GPU, Mistral 7B streams tokens fast enough for interactive use. Latency from query to first token: ~1.8 seconds.
The Streaming Implementation
Users expect real-time responses. A 10-second wait before seeing any output felt unacceptable.
FastAPI’s StreamingResponse with server-sent events solved this:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
async def stream_response(query: str, context: str):
async for token in llm.astream(build_prompt(query, context)):
yield f"data: {token}\n\n"
@app.post("/query")
async def query_endpoint(req: QueryRequest):
chunks = retriever.retrieve(req.query)
context = format_context(chunks)
return StreamingResponse(
stream_response(req.query, context),
media_type="text/event-stream"
)
The frontend consumes the SSE stream and renders tokens progressively. The perceived latency drops dramatically even though total generation time is the same.
What I’d Do Differently
Re-ranking. The top-5 retrieval by cosine similarity is often good but not always right. Adding a cross-encoder re-ranker (a small BERT-based model that re-scores the shortlisted chunks) would improve answer quality on ambiguous queries.
Query expansion. When an engineer types “RACH failure,” they might mean several things. A query rewriter that generates 3-5 paraphrased versions before retrieval would widen the net without hurting precision.
Feedback loop. There’s no signal for when the system gives a wrong answer. Thumbs up/down on responses, logged and analyzed weekly, would be the fastest path to improvement.
Results
The system handles ~200 queries per day from a NOC team of 12 engineers. Based on operator feedback:
- 78% of queries get a satisfactory answer without follow-up
- Average time from query to actionable answer: 4.2 seconds
- Documentation lookup time (pre-RAG baseline): estimated 8-12 minutes per query
The 7B parameter model running on consumer-grade GPU hardware — not a proprietary API — handles this load at near-zero marginal cost per query.
The full system is live at jocular-torte-262c74.netlify.app. The source is on GitHub.