🤖 How to Evaluate a RAG Pipeline with Node.js (with Real Examples!)

RAG (Retrieval-Augmented Generation) is the backbone of AI-powered apps like document Q&A, chatbots, customer support copilots, and even legal research assistants.

But here’s the truth 👉 just building a RAG pipeline is not enough.

You need to evaluate it properly. Otherwise, you risk deploying a chatbot that:
❌ misses key information,
❌ hallucinates facts, or
❌ costs you a fortune in tokens.

In this blog, we’ll go deep into how to evaluate a RAG pipeline in Node.js, with real PDF examples, code snippets, and all the metrics that matter.

🔹 What is a RAG Pipeline?

A RAG pipeline has two main components:

Retriever → Finds the most relevant chunks from your knowledge base (PDFs, docs, wikis).
Generator (LLM) → Uses those chunks to craft a final answer.

Think of it like a student answering exam questions:

The retriever is the student searching their notes 📚
The generator is how well they explain the answer ✍️

If retrieval fails → wrong notes.
If generation fails → the explanation is nonsense.

That’s why evaluation is critical. ✅

🔹 Real-Life Example: Smartwatch PDF RAG

Imagine you uploaded a 50-page smartwatch user manual into your RAG system.

You want it to answer questions like:

“⏱ How do I reset the smartwatch?”
“🔋 What’s the battery life?”

Here’s how we build an evaluation dataset for testing:

[
  {
    "query": "How do I reset the smartwatch?",
    "reference_answer": "Hold the power button for 10 seconds.",
    "relevant_chunks": [
      "To reset the smartwatch, hold the power button for 10 seconds."
    ]
  },
  {
    "query": "What is the battery life of the smartwatch?",
    "reference_answer": "Battery lasts up to 48 hours with normal use.",
    "relevant_chunks": [
      "Battery life lasts up to 48 hours with normal use."
    ]
  }
]

Now, we can check if our pipeline:

Retrieves the right passages ✅
Generates correct, fluent answers ✅
Runs within acceptable latency & cost ✅

🔹 The Metrics That Matter 🎯

We evaluate RAG across three categories:

1️⃣ Retrieval Metrics 🔍

a) Precision@K

Definition: % of retrieved documents that are actually relevant.

Example:

Query: “How do I reset the smartwatch?”

Here are the retrieved top 3 chunks:

“To reset the smartwatch, hold the power button for 10 seconds.” ✅

“Battery lasts 48 hours.” ❌

“To connect to Bluetooth, go to settings.” ❌

Precision@3 = 1 relevant / 3 retrieved = 33%

b) Recall@K

Definition: % of all relevant documents that appear in top-K retrieved documents.

Example:

Relevant chunks in PDF: 2 chunks (reset instructions + troubleshooting reset tips)

Retrieved top 3: only the main reset instruction is retrieved

Recall@3 = 1 retrieved relevant / 2 total relevant = 50%

c) MRR (Mean Reciprocal Rank)

Definition: How early the first relevant document appears in the retrieval list.
Example:
First relevant doc is in position 2 (out of 3)
MRR = 1 / 2 = 0.5

d) nDCG (Normalized Discounted Cumulative Gain)

Definition: Rewards higher-ranked relevant documents more than lower-ranked ones.
Example:
Relevance scores of retrieved docs (top 3): [1, 1, 0]

Doc1: relevant → 1

Doc2: relevant → 1

Doc3: irrelevant → 0

nDCG = (1/ log2(1+1) + 1/ log2(2+1)) / ideal DCG = 1 (perfect if ranked optimally)

2️⃣ Generation Metrics ✍️

a) Faithfulness

Definition: Answer is grounded in retrieved chunks (no hallucination).
Example:
Retrieved chunk: “Hold the power button for 10 seconds to reset.”
Generated answer: “Press both buttons for 30 seconds” ❌
Faithfulness score: 0 / Not faithful

b) Correctness

Definition: Matches the reference answer from the PDF.
Example:
Reference answer: “Hold the power button for 10 seconds.”
Generated answer: “Hold the power button for 10 seconds.” ✅
Correctness = 1 / Correct

c) Relevance

Definition: Does the answer address the user’s query directly?
Example:
Query: “How do I reset the smartwatch?”
Generated answer: “The battery lasts 48 hours.” ❌
Relevance = 0 / Not relevant

d) Fluency

Definition: Is the answer readable, coherent, and grammatically correct?
Example:
Generated answer: “Hold button for 10 second reset device” ❌ (grammatically incorrect)
Fluency score: 0.3 / 1
Correct version: “Hold the power button for 10 seconds to reset the device.” ✅

e) Conciseness

Definition: Is the answer appropriately short & clear?
Example:
Overly verbose: “To reset your smartwatch, first you need to find the power button, then press it and hold it for a sufficient amount of time, which is generally 10 seconds, and then release it.” ❌
Concise: “Hold the power button for 10 seconds to reset.” ✅

f) Semantic Similarity

Definition: Cosine similarity between generated answer and reference answer embeddings.
Example:
Generated: “Press the power button for ten seconds to restart the smartwatch.”
Reference: “Hold the power button for 10 seconds.”
Cosine similarity = 0.95 → high semantic overlap ✅

3️⃣ System Metrics ⚡

a) Latency

Definition: Time taken per query.
Example:
Retrieval: 120 ms
Generation: 800 ms
Total latency = 920 ms ≈ ~1 second per query

b) Throughput

Definition: Number of queries the system can handle per second.
Example:
If latency is 1s per query → throughput = 1 query/sec
With batching / async retrieval → can improve to 5–10 queries/sec

c) Cost per query

Definition: Token usage × API cost
Example:
Retrieval embeddings: 50 tokens
Generation: 300 tokens
Cost: $0.0004 per 1000 tokens → Total ≈ $0.00015/query

d) Context usage

Definition: Tokens used for retrieved docs + query + answer
Example:
Query tokens: 10
Retrieved chunk tokens: 500
Generated answer tokens: 50
Total context length: 560 tokens

🔹 Node.js Code: Evaluating a RAG Pipeline

Let’s roll up our sleeves 🧑‍💻

1. Setup

npm install langchain @langchain/openai hnswlib-node fs

2. Directory Structure

project/
├─ vector_store/          # pre-built embeddings from PDFs
├─ eval_dataset.json      # dataset of queries, reference answers, relevant chunks
├─ eval_rag.js            # evaluation script

3. Example `val_dataset.json`

[
  {
    "query": "How do I reset the smartwatch?",
    "reference_answer": "Hold the power button for 10 seconds.",
    "relevant_chunks": [
      "To reset the smartwatch, hold the power button for 10 seconds."
    ]
  },
  {
    "query": "What is the battery life of the smartwatch?",
    "reference_answer": "Battery lasts up to 48 hours with normal use.",
    "relevant_chunks": [
      "Battery life lasts up to 48 hours with normal use."
    ]
  }
]

4. `eval_rag.js` — Full Evaluation Script

import fs from "fs";
import { OpenAI } from "@langchain/openai";
import { OpenAIEmbeddings } from "@langchain/openai";
import { HNSWLib } from "langchain/vectorstores/hnswlib";
import { RetrievalQAChain } from "langchain/chains";

// --- Load vector store ---
const vectorStore = await HNSWLib.load("./vector_store", new OpenAIEmbeddings());
const retriever = vectorStore.asRetriever();
// --- Setup LLM ---
const llm = new OpenAI({ modelName: "gpt-4o-mini", temperature: 0 });
const ragChain = RetrievalQAChain.fromLLM(llm, retriever);
// --- Load dataset ---
const dataset = JSON.parse(fs.readFileSync("./eval_dataset.json", "utf-8"));
// --- Metrics Functions ---
// Retrieval metrics
function precisionAtK(retrieved, relevant, k) {
  const topK = retrieved.slice(0, k);
  const hits = topK.filter(doc => relevant.some(r => doc.pageContent.includes(r))).length;
  return hits / k;
}
function recallAtK(retrieved, relevant, k) {
  const topK = retrieved.slice(0, k);
  const hits = topK.filter(doc => relevant.some(r => doc.pageContent.includes(r))).length;
  return hits / relevant.length;
}
function f1AtK(retrieved, relevant, k) {
  const p = precisionAtK(retrieved, relevant, k);
  const r = recallAtK(retrieved, relevant, k);
  return p + r === 0 ? 0 : (2 * p * r) / (p + r);
}
function meanReciprocalRank(retrieved, relevant) {
  for (let i = 0; i < retrieved.length; i++) {
    if (relevant.some(r => retrieved[i].pageContent.includes(r))) {
      return 1 / (i + 1);
    }
  }
  return 0;
}
function dcg(scores) {
  return scores.reduce((sum, rel, i) => sum + (Math.pow(2, rel) - 1) / Math.log2(i + 2), 0);
}
function ndcg(retrieved, relevant) {
  const relScores = retrieved.map(doc => relevant.some(r => doc.pageContent.includes(r)) ? 1 : 0);
  const ideal = [...relScores].sort((a,b) => b-a);
  return dcg(relScores) / dcg(ideal);
}
// Generation metrics
function checkCorrectness(answer, reference) {
  return answer.toLowerCase().includes(reference.toLowerCase());
}
// Optional: Semantic similarity (cosine) using embeddings
async function semanticSimilarity(text1, text2) {
  const embeddings = new OpenAIEmbeddings();
  const vec1 = await embeddings.embedQuery(text1);
  const vec2 = await embeddings.embedQuery(text2);
  const dot = vec1.reduce((sum, i, idx) => sum + i * vec2[idx], 0);
  const mag1 = Math.sqrt(vec1.reduce((sum, i) => sum + i*i, 0));
  const mag2 = Math.sqrt(vec2.reduce((sum, i) => sum + i*i, 0));
  return dot / (mag1 * mag2); // cosine similarity
}
// LLM-as-judge for Faithfulness, Fluency, Relevance
async function evaluateWithLLM(query, answer, reference) {
  const evalPrompt = `
  You are an evaluator.
  Question: ${query}
  Generated Answer: ${answer}
  Reference Answer: ${reference}
  Evaluate the answer (0-1):
  - Correctness
  - Faithfulness
  - Fluency
  - Relevance
  Respond in JSON.
  `;
  return JSON.parse(await llm.call(evalPrompt));
}
// --- Evaluation Loop ---
for (const item of dataset) {
  console.log(`\nQ: ${item.query}`);
  const start = Date.now();
  // Run RAG pipeline
  const res = await ragChain.call({ query: item.query });
  const latency = Date.now() - start;
  console.log(`Answer: ${res.text}`);
  console.log(`Latency (ms): ${latency}`);
  // --- Retrieval metrics ---
  const retrievedDocs = await retriever.getRelevantDocuments(item.query);
  console.log("Precision@3:", precisionAtK(retrievedDocs, item.relevant_chunks, 3));
  console.log("Recall@3:", recallAtK(retrievedDocs, item.relevant_chunks, 3));
  console.log("F1@3:", f1AtK(retrievedDocs, item.relevant_chunks, 3));
  console.log("MRR:", meanReciprocalRank(retrievedDocs, item.relevant_chunks));
  console.log("nDCG:", ndcg(retrievedDocs, item.relevant_chunks));
  // --- Generation metrics ---
  console.log("Correctness:", checkCorrectness(res.text, item.reference_answer) ? "✅" : "❌");
  const sim = await semanticSimilarity(res.text, item.reference_answer);
  console.log("Semantic Similarity:", sim.toFixed(3));
  const llmEval = await evaluateWithLLM(item.query, res.text, item.reference_answer);
  console.log("LLM Evaluation:", llmEval);
}

✅ What This Script Does

Retrieval Metrics: Precision@K, Recall@K, F1@K, MRR, nDCG
Generation Metrics: Correctness, Semantic similarity, LLM evaluation (Faithfulness, Relevance, Fluency)
System Metrics: Latency (you can extend for Throughput & Token usage)

This gives a complete evaluation framework for PDF RAG pipelines.

🔹 Key Takeaways

✅ Retrieval → Are we fetching the right knowledge? (Precision, Recall, nDCG)
✅ Generation → Are answers faithful, correct, and fluent?
✅ System → Is it fast & cost-efficient for production?

RAG evaluation isn’t just about correctness — it’s about balancing accuracy, speed, and cost for real-world use.

🚀 Final Thoughts

When you deploy a RAG pipeline for your business docs, support center, or research papers, evaluation will help you:

Spot hallucinations before users do 🔍
Optimize retrievers & embeddings 📊
Keep costs under control 💸
Build trustworthy AI assistants 🤝

So don’t just build RAG. Evaluate it. Improve it. Scale it.

Command Palette