Skip to main content

Command Palette

Search for a command to run...

๐Ÿค– How to Evaluate a RAG Pipeline with Node.js (with Real Examples!)

Published
โ€ข8 min read

RAG (Retrieval-Augmented Generation) is the backbone of AI-powered apps like document Q&A, chatbots, customer support copilots, and even legal research assistants.

But hereโ€™s the truth ๐Ÿ‘‰ just building a RAG pipeline is not enough.

You need to evaluate it properly. Otherwise, you risk deploying a chatbot that:
โŒ misses key information,
โŒ hallucinates facts, or
โŒ costs you a fortune in tokens.

In this blog, weโ€™ll go deep into how to evaluate a RAG pipeline in Node.js, with real PDF examples, code snippets, and all the metrics that matter.

๐Ÿ”น What is a RAG Pipeline?

A RAG pipeline has two main components:

  1. Retriever โ†’ Finds the most relevant chunks from your knowledge base (PDFs, docs, wikis).

  2. Generator (LLM) โ†’ Uses those chunks to craft a final answer.

Think of it like a student answering exam questions:

  • The retriever is the student searching their notes ๐Ÿ“š

  • The generator is how well they explain the answer โœ๏ธ

If retrieval fails โ†’ wrong notes.
If generation fails โ†’ the explanation is nonsense.

Thatโ€™s why evaluation is critical. โœ…

๐Ÿ”น Real-Life Example: Smartwatch PDF RAG

Imagine you uploaded a 50-page smartwatch user manual into your RAG system.

You want it to answer questions like:

  • โ€œโฑ How do I reset the smartwatch?โ€

  • โ€œ๐Ÿ”‹ Whatโ€™s the battery life?โ€

Hereโ€™s how we build an evaluation dataset for testing:

[
  {
    "query": "How do I reset the smartwatch?",
    "reference_answer": "Hold the power button for 10 seconds.",
    "relevant_chunks": [
      "To reset the smartwatch, hold the power button for 10 seconds."
    ]
  },
  {
    "query": "What is the battery life of the smartwatch?",
    "reference_answer": "Battery lasts up to 48 hours with normal use.",
    "relevant_chunks": [
      "Battery life lasts up to 48 hours with normal use."
    ]
  }
]

Now, we can check if our pipeline:

  • Retrieves the right passages โœ…

  • Generates correct, fluent answers โœ…

  • Runs within acceptable latency & cost โœ…

๐Ÿ”น The Metrics That Matter ๐ŸŽฏ

We evaluate RAG across three categories:

1๏ธโƒฃ Retrieval Metrics ๐Ÿ”

a) Precision@K

Definition: % of retrieved documents that are actually relevant.

  • Example:

Query: โ€œHow do I reset the smartwatch?โ€

  • Here are the retrieved top 3 chunks:

โ€œTo reset the smartwatch, hold the power button for 10 seconds.โ€ โœ…

โ€œBattery lasts 48 hours.โ€ โŒ

โ€œTo connect to Bluetooth, go to settings.โ€ โŒ

  • Precision@3 = 1 relevant / 3 retrieved = 33%

b) Recall@K

Definition: % of all relevant documents that appear in top-K retrieved documents.

  • Example:

Relevant chunks in PDF: 2 chunks (reset instructions + troubleshooting reset tips)

Retrieved top 3: only the main reset instruction is retrieved

  • Recall@3 = 1 retrieved relevant / 2 total relevant = 50%

c) MRR (Mean Reciprocal Rank)

  • Definition: How early the first relevant document appears in the retrieval list.

  • Example:

  • First relevant doc is in position 2 (out of 3)

  • MRR = 1 / 2 = 0.5

d) nDCG (Normalized Discounted Cumulative Gain)

  • Definition: Rewards higher-ranked relevant documents more than lower-ranked ones.

  • Example:

  • Relevance scores of retrieved docs (top 3): [1, 1, 0]

Doc1: relevant โ†’ 1

Doc2: relevant โ†’ 1

Doc3: irrelevant โ†’ 0

  • nDCG = (1/ log2(1+1) + 1/ log2(2+1)) / ideal DCG = 1 (perfect if ranked optimally)

2๏ธโƒฃ Generation Metrics โœ๏ธ

a) Faithfulness

  • Definition: Answer is grounded in retrieved chunks (no hallucination).

  • Example:

  • Retrieved chunk: โ€œHold the power button for 10 seconds to reset.โ€

  • Generated answer: โ€œPress both buttons for 30 secondsโ€ โŒ

  • Faithfulness score: 0 / Not faithful

b) Correctness

  • Definition: Matches the reference answer from the PDF.

  • Example:

  • Reference answer: โ€œHold the power button for 10 seconds.โ€

  • Generated answer: โ€œHold the power button for 10 seconds.โ€ โœ…

  • Correctness = 1 / Correct

c) Relevance

  • Definition: Does the answer address the userโ€™s query directly?

  • Example:

  • Query: โ€œHow do I reset the smartwatch?โ€

  • Generated answer: โ€œThe battery lasts 48 hours.โ€ โŒ

  • Relevance = 0 / Not relevant

d) Fluency

  • Definition: Is the answer readable, coherent, and grammatically correct?

  • Example:

  • Generated answer: โ€œHold button for 10 second reset deviceโ€ โŒ (grammatically incorrect)

  • Fluency score: 0.3 / 1

  • Correct version: โ€œHold the power button for 10 seconds to reset the device.โ€ โœ…

e) Conciseness

  • Definition: Is the answer appropriately short & clear?

  • Example:

  • Overly verbose: โ€œTo reset your smartwatch, first you need to find the power button, then press it and hold it for a sufficient amount of time, which is generally 10 seconds, and then release it.โ€ โŒ

  • Concise: โ€œHold the power button for 10 seconds to reset.โ€ โœ…

f) Semantic Similarity

  • Definition: Cosine similarity between generated answer and reference answer embeddings.

  • Example:

  • Generated: โ€œPress the power button for ten seconds to restart the smartwatch.โ€

  • Reference: โ€œHold the power button for 10 seconds.โ€

  • Cosine similarity = 0.95 โ†’ high semantic overlap โœ…

3๏ธโƒฃ System Metrics โšก

a) Latency

  • Definition: Time taken per query.

  • Example:

  • Retrieval: 120 ms

  • Generation: 800 ms

  • Total latency = 920 ms โ‰ˆ ~1 second per query

b) Throughput

  • Definition: Number of queries the system can handle per second.

  • Example:

  • If latency is 1s per query โ†’ throughput = 1 query/sec

  • With batching / async retrieval โ†’ can improve to 5โ€“10 queries/sec

c) Cost per query

  • Definition: Token usage ร— API cost

  • Example:

  • Retrieval embeddings: 50 tokens

  • Generation: 300 tokens

  • Cost: $0.0004 per 1000 tokens โ†’ Total โ‰ˆ $0.00015/query

d) Context usage

  • Definition: Tokens used for retrieved docs + query + answer

  • Example:

  • Query tokens: 10

  • Retrieved chunk tokens: 500

  • Generated answer tokens: 50

  • Total context length: 560 tokens

๐Ÿ”น Node.js Code: Evaluating a RAG Pipeline

Letโ€™s roll up our sleeves ๐Ÿง‘โ€๐Ÿ’ป

1. Setup

npm install langchain @langchain/openai hnswlib-node fs

2. Directory Structure

project/
โ”œโ”€ vector_store/          # pre-built embeddings from PDFs
โ”œโ”€ eval_dataset.json      # dataset of queries, reference answers, relevant chunks
โ”œโ”€ eval_rag.js            # evaluation script

3. Example val_dataset.json

[
  {
    "query": "How do I reset the smartwatch?",
    "reference_answer": "Hold the power button for 10 seconds.",
    "relevant_chunks": [
      "To reset the smartwatch, hold the power button for 10 seconds."
    ]
  },
  {
    "query": "What is the battery life of the smartwatch?",
    "reference_answer": "Battery lasts up to 48 hours with normal use.",
    "relevant_chunks": [
      "Battery life lasts up to 48 hours with normal use."
    ]
  }
]

4. eval_rag.js โ€” Full Evaluation Script

import fs from "fs";
import { OpenAI } from "@langchain/openai";
import { OpenAIEmbeddings } from "@langchain/openai";
import { HNSWLib } from "langchain/vectorstores/hnswlib";
import { RetrievalQAChain } from "langchain/chains";

// --- Load vector store ---
const vectorStore = await HNSWLib.load("./vector_store", new OpenAIEmbeddings());
const retriever = vectorStore.asRetriever();
// --- Setup LLM ---
const llm = new OpenAI({ modelName: "gpt-4o-mini", temperature: 0 });
const ragChain = RetrievalQAChain.fromLLM(llm, retriever);
// --- Load dataset ---
const dataset = JSON.parse(fs.readFileSync("./eval_dataset.json", "utf-8"));
// --- Metrics Functions ---
// Retrieval metrics
function precisionAtK(retrieved, relevant, k) {
  const topK = retrieved.slice(0, k);
  const hits = topK.filter(doc => relevant.some(r => doc.pageContent.includes(r))).length;
  return hits / k;
}
function recallAtK(retrieved, relevant, k) {
  const topK = retrieved.slice(0, k);
  const hits = topK.filter(doc => relevant.some(r => doc.pageContent.includes(r))).length;
  return hits / relevant.length;
}
function f1AtK(retrieved, relevant, k) {
  const p = precisionAtK(retrieved, relevant, k);
  const r = recallAtK(retrieved, relevant, k);
  return p + r === 0 ? 0 : (2 * p * r) / (p + r);
}
function meanReciprocalRank(retrieved, relevant) {
  for (let i = 0; i < retrieved.length; i++) {
    if (relevant.some(r => retrieved[i].pageContent.includes(r))) {
      return 1 / (i + 1);
    }
  }
  return 0;
}
function dcg(scores) {
  return scores.reduce((sum, rel, i) => sum + (Math.pow(2, rel) - 1) / Math.log2(i + 2), 0);
}
function ndcg(retrieved, relevant) {
  const relScores = retrieved.map(doc => relevant.some(r => doc.pageContent.includes(r)) ? 1 : 0);
  const ideal = [...relScores].sort((a,b) => b-a);
  return dcg(relScores) / dcg(ideal);
}
// Generation metrics
function checkCorrectness(answer, reference) {
  return answer.toLowerCase().includes(reference.toLowerCase());
}
// Optional: Semantic similarity (cosine) using embeddings
async function semanticSimilarity(text1, text2) {
  const embeddings = new OpenAIEmbeddings();
  const vec1 = await embeddings.embedQuery(text1);
  const vec2 = await embeddings.embedQuery(text2);
  const dot = vec1.reduce((sum, i, idx) => sum + i * vec2[idx], 0);
  const mag1 = Math.sqrt(vec1.reduce((sum, i) => sum + i*i, 0));
  const mag2 = Math.sqrt(vec2.reduce((sum, i) => sum + i*i, 0));
  return dot / (mag1 * mag2); // cosine similarity
}
// LLM-as-judge for Faithfulness, Fluency, Relevance
async function evaluateWithLLM(query, answer, reference) {
  const evalPrompt = `
  You are an evaluator.
  Question: ${query}
  Generated Answer: ${answer}
  Reference Answer: ${reference}
  Evaluate the answer (0-1):
  - Correctness
  - Faithfulness
  - Fluency
  - Relevance
  Respond in JSON.
  `;
  return JSON.parse(await llm.call(evalPrompt));
}
// --- Evaluation Loop ---
for (const item of dataset) {
  console.log(`\nQ: ${item.query}`);
  const start = Date.now();
  // Run RAG pipeline
  const res = await ragChain.call({ query: item.query });
  const latency = Date.now() - start;
  console.log(`Answer: ${res.text}`);
  console.log(`Latency (ms): ${latency}`);
  // --- Retrieval metrics ---
  const retrievedDocs = await retriever.getRelevantDocuments(item.query);
  console.log("Precision@3:", precisionAtK(retrievedDocs, item.relevant_chunks, 3));
  console.log("Recall@3:", recallAtK(retrievedDocs, item.relevant_chunks, 3));
  console.log("F1@3:", f1AtK(retrievedDocs, item.relevant_chunks, 3));
  console.log("MRR:", meanReciprocalRank(retrievedDocs, item.relevant_chunks));
  console.log("nDCG:", ndcg(retrievedDocs, item.relevant_chunks));
  // --- Generation metrics ---
  console.log("Correctness:", checkCorrectness(res.text, item.reference_answer) ? "โœ…" : "โŒ");
  const sim = await semanticSimilarity(res.text, item.reference_answer);
  console.log("Semantic Similarity:", sim.toFixed(3));
  const llmEval = await evaluateWithLLM(item.query, res.text, item.reference_answer);
  console.log("LLM Evaluation:", llmEval);
}

โœ… What This Script Does

  1. Retrieval Metrics: Precision@K, Recall@K, F1@K, MRR, nDCG

  2. Generation Metrics: Correctness, Semantic similarity, LLM evaluation (Faithfulness, Relevance, Fluency)

  3. System Metrics: Latency (you can extend for Throughput & Token usage)

This gives a complete evaluation framework for PDF RAG pipelines.

๐Ÿ”น Key Takeaways

โœ… Retrieval โ†’ Are we fetching the right knowledge? (Precision, Recall, nDCG)
โœ… Generation โ†’ Are answers faithful, correct, and fluent?
โœ… System โ†’ Is it fast & cost-efficient for production?

RAG evaluation isnโ€™t just about correctness โ€” itโ€™s about balancing accuracy, speed, and cost for real-world use.

๐Ÿš€ Final Thoughts

When you deploy a RAG pipeline for your business docs, support center, or research papers, evaluation will help you:

  • Spot hallucinations before users do ๐Ÿ”

  • Optimize retrievers & embeddings ๐Ÿ“Š

  • Keep costs under control ๐Ÿ’ธ

  • Build trustworthy AI assistants ๐Ÿค

So donโ€™t just build RAG. Evaluate it. Improve it. Scale it.