Building Production RAG Systems for Telecom NOC

The Problem

Network Operations Centers in telecom are drowning in documentation. Every node, every protocol, every alarm code has a manual. When an alert fires at 2 AM, the engineer on duty needs answers in seconds — not a search through 400-page PDFs.

That’s the problem I set out to solve with TelecomOps RAG.

Why RAG Over Fine-tuning

The first architectural decision: RAG or fine-tuning?

Fine-tuning would have encoded knowledge into the model weights. That sounds appealing until you realize telecom documentation changes constantly — new firmware, updated RAN specs, revised NOC runbooks. A fine-tuned model becomes stale the moment the documentation changes.

RAG, by contrast, separates the knowledge from the model. You update the vector store, not the weights. For a domain with living documentation, this is the only sane choice.

The Architecture

User Query
    ↓
Query Embedding (sentence-transformers)
    ↓
ChromaDB — Cosine Similarity Search
    ↓
Top-k Chunks (k=5 by default)
    ↓
Prompt Construction (system + context + query)
    ↓
Mistral 7B (via HuggingFace Spaces)
    ↓
Streamed Response

The entire pipeline runs on FastAPI, with the frontend consuming a streaming SSE endpoint.

The Chunking Strategy

This is where most RAG implementations fail. I spent three weeks on chunking alone.

The naive approach — split on 512 tokens with 50-token overlap — destroyed the semantic structure of telecom documents. An alarm code would be split mid-description. A command sequence would lose its header.

What worked:

Semantic section detection. Telecom manuals use predictable heading patterns. I wrote a parser that respects document structure — each chunk maps to one complete conceptual unit (a procedure, an alarm, a configuration parameter).

Metadata tagging. Every chunk carries: document name, section path, equipment type, and protocol family. This metadata becomes part of the retrieval context.

Hybrid size limits. Natural sections can be 200 or 800 tokens. I allowed this variance rather than enforcing artificial uniformity. The 31K chunks in the production index range from 120 to 650 tokens.

Choosing Mistral 7B

I evaluated three models: Llama 2 7B, Mistral 7B, and Phi-2.

Mistral won on two dimensions:

Instruction following. NOC queries are precise (“what is the RACH failure threshold for Ericsson ENB firmware 21.Q3?”). Mistral’s instruction tuning handles these better than Llama 2 at the same scale.
Inference speed. On the HuggingFace Spaces T4 GPU, Mistral 7B streams tokens fast enough for interactive use. Latency from query to first token: ~1.8 seconds.

The Streaming Implementation

Users expect real-time responses. A 10-second wait before seeing any output felt unacceptable.

FastAPI’s StreamingResponse with server-sent events solved this:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse

async def stream_response(query: str, context: str):
    async for token in llm.astream(build_prompt(query, context)):
        yield f"data: {token}\n\n"

@app.post("/query")
async def query_endpoint(req: QueryRequest):
    chunks = retriever.retrieve(req.query)
    context = format_context(chunks)
    return StreamingResponse(
        stream_response(req.query, context),
        media_type="text/event-stream"
    )

The frontend consumes the SSE stream and renders tokens progressively. The perceived latency drops dramatically even though total generation time is the same.

What I’d Do Differently

Re-ranking. The top-5 retrieval by cosine similarity is often good but not always right. Adding a cross-encoder re-ranker (a small BERT-based model that re-scores the shortlisted chunks) would improve answer quality on ambiguous queries.

Query expansion. When an engineer types “RACH failure,” they might mean several things. A query rewriter that generates 3-5 paraphrased versions before retrieval would widen the net without hurting precision.

Feedback loop. There’s no signal for when the system gives a wrong answer. Thumbs up/down on responses, logged and analyzed weekly, would be the fastest path to improvement.

Results

The system handles ~200 queries per day from a NOC team of 12 engineers. Based on operator feedback:

78% of queries get a satisfactory answer without follow-up
Average time from query to actionable answer: 4.2 seconds
Documentation lookup time (pre-RAG baseline): estimated 8-12 minutes per query

The 7B parameter model running on consumer-grade GPU hardware — not a proprietary API — handles this load at near-zero marginal cost per query.

The full system is live at jocular-torte-262c74.netlify.app. The source is on GitHub.