๐ค How to Evaluate a RAG Pipeline with Node.js (with Real Examples!)
RAG (Retrieval-Augmented Generation) is the backbone of AI-powered apps like document Q&A, chatbots, customer support copilots, and even legal research assistants.
But hereโs the truth ๐ just building a RAG pipeline is not enough.
You need to evaluate it properly. Otherwise, you risk deploying a chatbot that:
โ misses key information,
โ hallucinates facts, or
โ costs you a fortune in tokens.
In this blog, weโll go deep into how to evaluate a RAG pipeline in Node.js, with real PDF examples, code snippets, and all the metrics that matter.
๐น What is a RAG Pipeline?
A RAG pipeline has two main components:
Retriever โ Finds the most relevant chunks from your knowledge base (PDFs, docs, wikis).
Generator (LLM) โ Uses those chunks to craft a final answer.
Think of it like a student answering exam questions:
The retriever is the student searching their notes ๐
The generator is how well they explain the answer โ๏ธ
If retrieval fails โ wrong notes.
If generation fails โ the explanation is nonsense.
Thatโs why evaluation is critical. โ
๐น Real-Life Example: Smartwatch PDF RAG
Imagine you uploaded a 50-page smartwatch user manual into your RAG system.
You want it to answer questions like:
โโฑ How do I reset the smartwatch?โ
โ๐ Whatโs the battery life?โ
Hereโs how we build an evaluation dataset for testing:
[
{
"query": "How do I reset the smartwatch?",
"reference_answer": "Hold the power button for 10 seconds.",
"relevant_chunks": [
"To reset the smartwatch, hold the power button for 10 seconds."
]
},
{
"query": "What is the battery life of the smartwatch?",
"reference_answer": "Battery lasts up to 48 hours with normal use.",
"relevant_chunks": [
"Battery life lasts up to 48 hours with normal use."
]
}
]
Now, we can check if our pipeline:
Retrieves the right passages โ
Generates correct, fluent answers โ
Runs within acceptable latency & cost โ
๐น The Metrics That Matter ๐ฏ
We evaluate RAG across three categories:
1๏ธโฃ Retrieval Metrics ๐
a) Precision@K
Definition: % of retrieved documents that are actually relevant.
- Example:
Query: โHow do I reset the smartwatch?โ
- Here are the retrieved top 3 chunks:
โTo reset the smartwatch, hold the power button for 10 seconds.โ โ
โBattery lasts 48 hours.โ โ
โTo connect to Bluetooth, go to settings.โ โ
- Precision@3 = 1 relevant / 3 retrieved = 33%
b) Recall@K
Definition: % of all relevant documents that appear in top-K retrieved documents.
- Example:
Relevant chunks in PDF: 2 chunks (reset instructions + troubleshooting reset tips)
Retrieved top 3: only the main reset instruction is retrieved
- Recall@3 = 1 retrieved relevant / 2 total relevant = 50%
c) MRR (Mean Reciprocal Rank)
Definition: How early the first relevant document appears in the retrieval list.
Example:
First relevant doc is in position 2 (out of 3)
MRR = 1 / 2 = 0.5
d) nDCG (Normalized Discounted Cumulative Gain)
Definition: Rewards higher-ranked relevant documents more than lower-ranked ones.
Example:
Relevance scores of retrieved docs (top 3):
[1, 1, 0]
Doc1: relevant โ 1
Doc2: relevant โ 1
Doc3: irrelevant โ 0
- nDCG =
(1/ log2(1+1) + 1/ log2(2+1)) / ideal DCG = 1(perfect if ranked optimally)
2๏ธโฃ Generation Metrics โ๏ธ
a) Faithfulness
Definition: Answer is grounded in retrieved chunks (no hallucination).
Example:
Retrieved chunk: โHold the power button for 10 seconds to reset.โ
Generated answer: โPress both buttons for 30 secondsโ โ
Faithfulness score: 0 / Not faithful
b) Correctness
Definition: Matches the reference answer from the PDF.
Example:
Reference answer: โHold the power button for 10 seconds.โ
Generated answer: โHold the power button for 10 seconds.โ โ
Correctness = 1 / Correct
c) Relevance
Definition: Does the answer address the userโs query directly?
Example:
Query: โHow do I reset the smartwatch?โ
Generated answer: โThe battery lasts 48 hours.โ โ
Relevance = 0 / Not relevant
d) Fluency
Definition: Is the answer readable, coherent, and grammatically correct?
Example:
Generated answer: โHold button for 10 second reset deviceโ โ (grammatically incorrect)
Fluency score: 0.3 / 1
Correct version: โHold the power button for 10 seconds to reset the device.โ โ
e) Conciseness
Definition: Is the answer appropriately short & clear?
Example:
Overly verbose: โTo reset your smartwatch, first you need to find the power button, then press it and hold it for a sufficient amount of time, which is generally 10 seconds, and then release it.โ โ
Concise: โHold the power button for 10 seconds to reset.โ โ
f) Semantic Similarity
Definition: Cosine similarity between generated answer and reference answer embeddings.
Example:
Generated: โPress the power button for ten seconds to restart the smartwatch.โ
Reference: โHold the power button for 10 seconds.โ
Cosine similarity = 0.95 โ high semantic overlap โ
3๏ธโฃ System Metrics โก
a) Latency
Definition: Time taken per query.
Example:
Retrieval: 120 ms
Generation: 800 ms
Total latency = 920 ms โ ~1 second per query
b) Throughput
Definition: Number of queries the system can handle per second.
Example:
If latency is 1s per query โ throughput = 1 query/sec
With batching / async retrieval โ can improve to 5โ10 queries/sec
c) Cost per query
Definition: Token usage ร API cost
Example:
Retrieval embeddings: 50 tokens
Generation: 300 tokens
Cost: $0.0004 per 1000 tokens โ Total โ $0.00015/query
d) Context usage
Definition: Tokens used for retrieved docs + query + answer
Example:
Query tokens: 10
Retrieved chunk tokens: 500
Generated answer tokens: 50
Total context length: 560 tokens
๐น Node.js Code: Evaluating a RAG Pipeline
Letโs roll up our sleeves ๐งโ๐ป
1. Setup
npm install langchain @langchain/openai hnswlib-node fs
2. Directory Structure
project/
โโ vector_store/ # pre-built embeddings from PDFs
โโ eval_dataset.json # dataset of queries, reference answers, relevant chunks
โโ eval_rag.js # evaluation script
3. Example val_dataset.json
[
{
"query": "How do I reset the smartwatch?",
"reference_answer": "Hold the power button for 10 seconds.",
"relevant_chunks": [
"To reset the smartwatch, hold the power button for 10 seconds."
]
},
{
"query": "What is the battery life of the smartwatch?",
"reference_answer": "Battery lasts up to 48 hours with normal use.",
"relevant_chunks": [
"Battery life lasts up to 48 hours with normal use."
]
}
]
4. eval_rag.js โ Full Evaluation Script
import fs from "fs";
import { OpenAI } from "@langchain/openai";
import { OpenAIEmbeddings } from "@langchain/openai";
import { HNSWLib } from "langchain/vectorstores/hnswlib";
import { RetrievalQAChain } from "langchain/chains";
// --- Load vector store ---
const vectorStore = await HNSWLib.load("./vector_store", new OpenAIEmbeddings());
const retriever = vectorStore.asRetriever();
// --- Setup LLM ---
const llm = new OpenAI({ modelName: "gpt-4o-mini", temperature: 0 });
const ragChain = RetrievalQAChain.fromLLM(llm, retriever);
// --- Load dataset ---
const dataset = JSON.parse(fs.readFileSync("./eval_dataset.json", "utf-8"));
// --- Metrics Functions ---
// Retrieval metrics
function precisionAtK(retrieved, relevant, k) {
const topK = retrieved.slice(0, k);
const hits = topK.filter(doc => relevant.some(r => doc.pageContent.includes(r))).length;
return hits / k;
}
function recallAtK(retrieved, relevant, k) {
const topK = retrieved.slice(0, k);
const hits = topK.filter(doc => relevant.some(r => doc.pageContent.includes(r))).length;
return hits / relevant.length;
}
function f1AtK(retrieved, relevant, k) {
const p = precisionAtK(retrieved, relevant, k);
const r = recallAtK(retrieved, relevant, k);
return p + r === 0 ? 0 : (2 * p * r) / (p + r);
}
function meanReciprocalRank(retrieved, relevant) {
for (let i = 0; i < retrieved.length; i++) {
if (relevant.some(r => retrieved[i].pageContent.includes(r))) {
return 1 / (i + 1);
}
}
return 0;
}
function dcg(scores) {
return scores.reduce((sum, rel, i) => sum + (Math.pow(2, rel) - 1) / Math.log2(i + 2), 0);
}
function ndcg(retrieved, relevant) {
const relScores = retrieved.map(doc => relevant.some(r => doc.pageContent.includes(r)) ? 1 : 0);
const ideal = [...relScores].sort((a,b) => b-a);
return dcg(relScores) / dcg(ideal);
}
// Generation metrics
function checkCorrectness(answer, reference) {
return answer.toLowerCase().includes(reference.toLowerCase());
}
// Optional: Semantic similarity (cosine) using embeddings
async function semanticSimilarity(text1, text2) {
const embeddings = new OpenAIEmbeddings();
const vec1 = await embeddings.embedQuery(text1);
const vec2 = await embeddings.embedQuery(text2);
const dot = vec1.reduce((sum, i, idx) => sum + i * vec2[idx], 0);
const mag1 = Math.sqrt(vec1.reduce((sum, i) => sum + i*i, 0));
const mag2 = Math.sqrt(vec2.reduce((sum, i) => sum + i*i, 0));
return dot / (mag1 * mag2); // cosine similarity
}
// LLM-as-judge for Faithfulness, Fluency, Relevance
async function evaluateWithLLM(query, answer, reference) {
const evalPrompt = `
You are an evaluator.
Question: ${query}
Generated Answer: ${answer}
Reference Answer: ${reference}
Evaluate the answer (0-1):
- Correctness
- Faithfulness
- Fluency
- Relevance
Respond in JSON.
`;
return JSON.parse(await llm.call(evalPrompt));
}
// --- Evaluation Loop ---
for (const item of dataset) {
console.log(`\nQ: ${item.query}`);
const start = Date.now();
// Run RAG pipeline
const res = await ragChain.call({ query: item.query });
const latency = Date.now() - start;
console.log(`Answer: ${res.text}`);
console.log(`Latency (ms): ${latency}`);
// --- Retrieval metrics ---
const retrievedDocs = await retriever.getRelevantDocuments(item.query);
console.log("Precision@3:", precisionAtK(retrievedDocs, item.relevant_chunks, 3));
console.log("Recall@3:", recallAtK(retrievedDocs, item.relevant_chunks, 3));
console.log("F1@3:", f1AtK(retrievedDocs, item.relevant_chunks, 3));
console.log("MRR:", meanReciprocalRank(retrievedDocs, item.relevant_chunks));
console.log("nDCG:", ndcg(retrievedDocs, item.relevant_chunks));
// --- Generation metrics ---
console.log("Correctness:", checkCorrectness(res.text, item.reference_answer) ? "โ
" : "โ");
const sim = await semanticSimilarity(res.text, item.reference_answer);
console.log("Semantic Similarity:", sim.toFixed(3));
const llmEval = await evaluateWithLLM(item.query, res.text, item.reference_answer);
console.log("LLM Evaluation:", llmEval);
}
โ What This Script Does
Retrieval Metrics: Precision@K, Recall@K, F1@K, MRR, nDCG
Generation Metrics: Correctness, Semantic similarity, LLM evaluation (Faithfulness, Relevance, Fluency)
System Metrics: Latency (you can extend for Throughput & Token usage)
This gives a complete evaluation framework for PDF RAG pipelines.
๐น Key Takeaways
โ
Retrieval โ Are we fetching the right knowledge? (Precision, Recall, nDCG)
โ
Generation โ Are answers faithful, correct, and fluent?
โ
System โ Is it fast & cost-efficient for production?
RAG evaluation isnโt just about correctness โ itโs about balancing accuracy, speed, and cost for real-world use.
๐ Final Thoughts
When you deploy a RAG pipeline for your business docs, support center, or research papers, evaluation will help you:
Spot hallucinations before users do ๐
Optimize retrievers & embeddings ๐
Keep costs under control ๐ธ
Build trustworthy AI assistants ๐ค
So donโt just build RAG. Evaluate it. Improve it. Scale it.