I Built a Local RAG System for Telecom Operations — Here's Everything I Learned

The Problem Nobody Talks About

When a P1 incident fires in a Telecom NOC at 2 AM, the engineer on duty has one job: restore service. But here’s what actually happens first — they open three browser tabs, search through a shared drive full of PDF runbooks, ping a Slack channel hoping a senior engineer is awake in another timezone, and start reading 400-page vendor documentation to find the one procedure that applies to this specific alarm code on this specific node.

The average time wasted before the actual remediation starts: 30 to 45 minutes.

Two other problems sit underneath this:

Institutional knowledge walks out the door. When a senior NOC engineer retires after 15 years, their mental map of which runbooks apply to which vendors in which regions leaves with them. It doesn’t live in any system.

Sensitive configs can’t go to ChatGPT. Network topology, IMSI ranges, vendor-specific configurations, incident logs — all of it is NDA-protected or carrier-sensitive. You can’t paste it into a commercial API and ask GPT-4 to help.

I built TelecomOps RAG to solve all three problems. Here’s exactly how I did it, what broke, and what I’d do differently.

The Architecture

The system runs across two machines on a LAN:

[ Mac — Brain ]                    [ Windows — Muscle ]
  FastAPI server                     Ollama + Mistral 7B
  ChromaDB (31,364 chunks)           RTX 3050 (8GB VRAM)
  Embedding: nomic-embed-text        4-bit quantized (GGUF)
       |                                     |
       └─────── HTTP (LAN) ─────────────────┘

Query flow:

User Query
    ↓
nomic-embed-text (768-dim embedding on Mac)
    ↓
ChromaDB cosine similarity search
    ↓
Top-K chunks (K=5, distance < 0.7)
    ↓
Prompt: system + retrieved context + query
    ↓
Mistral 7B on RTX 3050 (via Ollama HTTP API)
    ↓
Streaming token-by-token response

The Mac handles all the retrieval logic. The Windows machine is purely a GPU inference server. This split let me iterate quickly on the retrieval pipeline without touching the model setup, and it mimics how you’d deploy this in production (retrieval service + inference service as separate concerns).

Building the Data Pipeline

This is where the real work was. A RAG system is only as good as what you put in it.

Synthetic Documents

I generated 100,000 synthetic telecom documents using Python’s Faker library augmented with a custom telecom vocabulary:

from faker import Faker
import random

fake = Faker()

VENDORS = ["Ericsson", "Nokia", "Huawei", "ZTE", "Samsung"]
REGIONS = ["APAC", "EMEA", "LATAM", "NA"]
ALARM_CODES = ["ALM-3201", "ALM-4401", "RRC-SETUP-FAIL", "S1-RESET", "X2-LINK-DOWN"]
SEVERITIES = ["Critical", "Major", "Minor", "Warning"]

def generate_incident_report():
    return {
        "doc_type": "incident_report",
        "vendor": random.choice(VENDORS),
        "region": random.choice(REGIONS),
        "alarm_code": random.choice(ALARM_CODES),
        "severity": random.choice(SEVERITIES),
        "timestamp": fake.date_time_this_year().isoformat(),
        "description": fake.paragraph(nb_sentences=8),
        "resolution": fake.paragraph(nb_sentences=5),
        "engineer": fake.name(),
        "node_id": f"eNB-{random.randint(1000, 9999)}"
    }

The key was making the synthetic documents structurally realistic — not just random text, but documents that look like actual incident reports with vendor names, alarm codes, resolution steps, and engineer sign-offs. Retrieval systems learn from structure, not just words.

Real HuggingFace Datasets

Synthetic data gets you coverage. Real data gets you accuracy. I pulled from two sources:

GSMA/ot-full — The official GSMA Open-Telco benchmark dataset. It contains TeleQnA (question-answer pairs from telecom standards), ORANBench (O-RAN architecture questions), and TeleLogs (actual log patterns). This is industry-validated data that your model will see in production questions.

ymoslem/TeleQnA-processed — A cleaned, processed version of the TeleQnA dataset with better formatting for RAG ingestion.

from datasets import load_dataset

gsma = load_dataset("GSMA/ot-full", split="train")
telqa = load_dataset("ymoslem/TeleQnA-processed", split="train")

def format_for_rag(row, source):
    # Normalize different schemas into a single document format
    if source == "gsma":
        return {
            "content": f"Q: {row['question']}\nA: {row['answer']}",
            "doc_type": "standards_qa",
            "source": "GSMA"
        }
    elif source == "telqa":
        return {
            "content": f"{row['question']} {row['answer']}",
            "doc_type": "teleqna",
            "source": "TeleQnA"
        }

Chunking Strategy

I settled on 512-word chunks with 64-word overlap after testing several configurations. Why words not tokens? Because telecom documents have wildly variable token densities — a chunk of alarm codes tokenizes very differently from a chunk of narrative runbook text. Word-based chunking gives more consistent semantic units.

The overlap is non-negotiable. A resolution procedure split at step 3 of 7 is useless without steps 1-2 as context. Overlap ensures that critical procedure sequences stay together in at least one chunk.

def chunk_document(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
    words = text.split()
    chunks = []
    start = 0
    
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        
        if end == len(words):
            break
        start += chunk_size - overlap
    
    return chunks

Final count: 31,364 chunks across all document types, stored as 768-dimensional vectors in ChromaDB.

Embedding Model: nomic-embed-text

I evaluated three embedding models before choosing nomic-embed-text:

Model	Dimensions	MTEB Score	Speed (chunks/sec)
nomic-embed-text	768	62.4	~180
all-MiniLM-L6-v2	384	56.3	~320
bge-large-en-v1.5	1024	63.6	~85

nomic-embed-text hit the sweet spot: good semantic quality for telecom domain text, reasonable speed, and 768 dimensions that ChromaDB handles efficiently with cosine similarity. bge-large was marginally better on benchmarks but 2x slower — not worth it for a retrieval system where embedding speed matters during ingestion.

The Retrieval Strategy

This is the part that most RAG tutorials skip over, and it’s where I spent the most time.

Two-Stage Retrieval

A naive search across all 31K chunks with the query “session timeout” returns garbage. The word “session” appears in BGP sessions, PDU sessions, HTTP sessions, management sessions — completely different contexts.

I implemented two-stage retrieval:

Stage 1 — Filter by document type. For operational queries, prioritize incident_report and runbook documents:

async def retrieve(query: str, collection: chromadb.Collection) -> list[dict]:
    query_embedding = await embed(query)
    
    # Stage 1: Operational docs first
    operational_results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3,
        where={"doc_type": {"$in": ["incident_report", "runbook"]}},
        include=["documents", "metadatas", "distances"]
    )
    
    # Stage 2: Standards/benchmarks as fallback
    standards_results = collection.query(
        query_embeddings=[query_embedding],
        n_results=2,
        where={"doc_type": {"$in": ["standards_qa", "teleqna"]}},
        include=["documents", "metadatas", "distances"]
    )
    
    # Merge and filter by distance threshold
    all_results = merge_results(operational_results, standards_results)
    return [r for r in all_results if r["distance"] < 0.7]

Distance Threshold Filtering

The threshold of 0.7 (for cosine distance, lower = more similar) was calibrated empirically. Below 0.7 means the retrieved chunk is genuinely relevant. Above 0.7 means you’re retrieving noise — worse than no context at all, because the model will try to use it.

If no results pass the threshold, the system responds: “I couldn’t find relevant documentation for this query. Try rephrasing with specific alarm codes or vendor names.”

This is honest and useful. It doesn’t hallucinate.

Deployment: Local and Production

Local Setup (Mac + Windows LAN)

The critical environment variable that burned me:

# On Windows, set this BEFORE starting Ollama:
set OLLAMA_HOST=0.0.0.0:11434

# Then start Ollama:
ollama serve

The gotcha: System environment variables don’t propagate to already-running processes. If Ollama is already running and you set OLLAMA_HOST afterward, nothing changes. You have to set the variable in the same shell session before starting the process. This wasted two hours.

On the Mac side, the FastAPI server talks to Ollama over LAN:

OLLAMA_URL = os.getenv("OLLAMA_URL", "http://192.168.1.x:11434")

async def generate_stream(prompt: str):
    async with httpx.AsyncClient(timeout=120.0) as client:
        async with client.stream(
            "POST",
            f"{OLLAMA_URL}/api/generate",
            json={
                "model": "mistral",
                "prompt": prompt,
                "stream": True,
                "options": {"temperature": 0.1, "num_predict": 512}
            }
        ) as response:
            async for line in response.aiter_lines():
                if line:
                    chunk = json.loads(line)
                    if not chunk.get("done"):
                        yield chunk["response"]

Low temperature (0.1) is intentional for a technical assistant. You want deterministic, factual responses, not creative ones.

Production: HuggingFace Spaces

The production deployment bakes FastAPI + ChromaDB into a single Docker container on HF Spaces:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Pre-load the vectorstore (285MB, tracked with Git LFS)
COPY vectorstore/ /app/vectorstore/

COPY . .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]

For embeddings and inference on HF Spaces (no GPU), I use:

Embeddings: HF Inference API with nomic-ai/nomic-embed-text-v1 (via huggingface_hub)
Generation: Cerebras provider via HF Inference API with Llama 3.1-8B (fast inference)

Why huggingface_hub instead of raw HTTP calls? Because Kaspersky (and some enterprise firewalls) block httpx and requests to api-inference.huggingface.co but allow huggingface_hub traffic. I discovered this when deploying in a corporate environment where the demo worked perfectly from my home network but failed completely in the client’s office. The SDK routes through different endpoints.

Vectorstore Size and Git LFS

The ChromaDB vectorstore is 285MB. Regular Git can’t handle this — it’ll corrupt or reject the push. Setup:

git lfs install
git lfs track "vectorstore/**"
echo "vectorstore/**" >> .gitattributes
git add .gitattributes vectorstore/
git commit -m "Add vectorstore via Git LFS"

The Production UI

The React frontend on Netlify has four components:

Chat interface — Streaming token-by-token display (no full-response wait)
Sources panel — Shows the top-K retrieved chunks with doc type, vendor, region, and distance score
Query history — Last 20 queries with timestamp, persisted to localStorage
GPU stats — Polls Ollama /api/ps to show model load status (local only)

The streaming display is the most important UX decision. A 4-second wait for a complete response feels broken. The same 4 seconds streaming token-by-token feels responsive. Human perception of latency is weird.

One practical note: add time.sleep(0.02) between token yields in demos. Production streaming on localhost is so fast that the streaming effect isn’t visible. The sleep makes it look like the model is “thinking” — which is exactly what you want in a demo context.

The Hard Lessons

1. Embedding Model Mismatch Kills Everything

I initially built the vectorstore with all-MiniLM-L6-v2 (384 dimensions), then switched to nomic-embed-text (768 dimensions) without rebuilding. ChromaDB stored the 768-dim vectors fine, but all similarity searches returned random noise because the collection now contained mixed-dimension vectors.

The fix: If you change the embedding model, delete and rebuild the entire collection. There’s no migration path.

# Always validate before ingesting:
test_embedding = embed_model.encode("test")
assert len(test_embedding) == EXPECTED_DIMS, \
    f"Dimension mismatch: got {len(test_embedding)}, expected {EXPECTED_DIMS}"

2. Retrieval Quality > Model Quality

I spent a week trying different LLMs (Mistral, Llama 3.1, Phi-3). The answer quality difference was marginal — maybe 15-20% on subjective evals.

Then I spent two days improving the retrieval pipeline (two-stage retrieval, distance filtering, metadata enrichment). Answer quality improved by 60-70%.

The LLM is the last mile. If the retrieved context is wrong, even GPT-4 will give you a bad answer. Fix retrieval first.

3. HF Inference API Provider Hell

The HuggingFace Inference API has multiple “providers” (Cerebras, Together, Fireworks, HF itself). Each provider supports a different subset of models. This is not clearly documented.

from huggingface_hub import InferenceClient

# This may or may not work depending on the provider:
client = InferenceClient(
    model="meta-llama/Llama-3.1-8B-Instruct",
    provider="cerebras"  # Works
    # provider="together"  # May not support this model
)

# Always test with a health check before building:
try:
    test = client.text_generation("Hello", max_new_tokens=5)
    print(f"Provider working: {test}")
except Exception as e:
    print(f"Provider failed: {e}")
    # Fall back to different provider

4. The OLLAMA_HOST Shell Session Issue

Already covered above, but it deserves its own bullet because it’s completely non-obvious. The env var must exist in the shell session that launches the process. Setting it in .bashrc or System Properties and then starting Ollama from a running session won’t work. Kill the process, set the var in the same shell, restart.

A Real Query

Let me show you what the system actually does.

Query: "How do I resolve BGP session down on eNodeB?"

Retrieved chunks (top 5, ordered by distance):

#	Distance	Type	Vendor	Region
1	0.31	incident_report	Ericsson	APAC
2	0.34	incident_report	Nokia	EMEA
3	0.39	runbook	Ericsson	—
4	0.42	incident_report	Huawei	LATAM
5	0.51	teleqna	—	—

Generated response (streaming):

Based on the retrieved incident reports and runbooks, BGP session down on eNodeB typically indicates one of three root causes:

1. IP connectivity loss — Verify Layer 3 reachability to the peer using ping from the OAM interface. Check for route table changes, particularly if this coincides with a maintenance window.

2. BGP timer mismatch — Confirm hold-time and keepalive values match between the eNodeB and the core router. Ericsson APAC incident ENB-4821 (March 2024) was resolved by aligning the keepalive from 60s to 30s.

3. Authentication failure — If MD5 authentication is configured, verify the key hasn’t been rotated on one side without updating the other. Nokia EMEA incident from last quarter shows this as the second most common cause after IP connectivity.

Resolution steps from the Ericsson runbook: [specific steps follow]

That’s the system working correctly. Real chunks, real context, specific resolution steps with vendor and region references.

Links

Live demo: jocular-torte-262c74.netlify.app
GitHub: github.com/aidataguy/telecom-rag
HF Space: huggingface.co/spaces/aidataguy/telecom-rag-api

If you’re building something similar, the single most important thing I can tell you is this: spend 80% of your effort on the data pipeline and retrieval, and 20% on the model. The retrieval system is your product. The model is a commodity.