> ๐ **Translation**: This article was translated from [Korean](https://beomanro.com/?p=353).
We’ve finally reached the last day of the RAG series. Having covered embeddings, vector databases, and search optimization so far, today we’ll explore RAG evaluation and monitoring strategies essential for production operations. Cost optimization and building a reliable monitoring system are the keys to successful production deployment.
TL;DR
- RAG evaluation measures three core metrics: accuracy, relevance, and faithfulness
- Cost optimization relies on embedding caching and smart model selection as key strategies
- Production deployment requires robust error handling and monitoring as essential components
- Monitoring systems automate RAG evaluation and ensure ongoing quality management
- Logging and metrics secure operational stability after production deployment
๐ก I plan to use this system as a chatbot to help newcomers make a softer landing when taking on new responsibilities.
The goal is to assist new employee onboarding based on internal company documents. For this purpose, production deployment, cost optimization, and continuous monitoring were extremely important.
1. RAG Evaluation Metrics
1.1 Core Evaluation Metrics
Here are the three core metrics for measuring RAG system quality:
interface RAGEvaluationMetrics {
// 1. Accuracy: Does the answer match the question?
accuracy: number;
// 2. Relevance: Are the retrieved documents related to the question?
relevance: number;
// 3. Faithfulness: Is the answer faithful to the document content?
faithfulness: number;
}
// RAG evaluation score calculation
function calculateRAGScore(metrics: RAGEvaluationMetrics): number {
const weights = {
accuracy: 0.4,
relevance: 0.3,
faithfulness: 0.3,
};
return (
metrics.accuracy * weights.accuracy +
metrics.relevance * weights.relevance +
metrics.faithfulness * weights.faithfulness
);
}
1.2 RAG Evaluation Implementation
interface EvaluationCase {
question: string;
expectedAnswer: string;
relevantDocIds: string[];
}
class RAGEvaluator {
private rag: RAGGenerator;
private testCases: EvaluationCase[];
constructor(rag: RAGGenerator, testCases: EvaluationCase[]) {
this.rag = rag;
this.testCases = testCases;
}
async evaluate(): Promise<RAGEvaluationMetrics> {
const results = await Promise.all(
this.testCases.map(tc => this.evaluateCase(tc))
);
// Calculate average scores
return {
accuracy: this.average(results.map(r => r.accuracy)),
relevance: this.average(results.map(r => r.relevance)),
faithfulness: this.average(results.map(r => r.faithfulness)),
};
}
private async evaluateCase(testCase: EvaluationCase) {
const result = await this.rag.generate(testCase.question, []);
return {
accuracy: this.scoreAccuracy(result.content, testCase.expectedAnswer),
relevance: this.scoreRelevance(result.citations, testCase.relevantDocIds),
faithfulness: this.scoreFaithfulness(result.content, result.citations),
};
}
private scoreAccuracy(answer: string, expected: string): number {
// Calculate semantic similarity (in practice, use embedding comparison)
const commonWords = this.getCommonWords(answer, expected);
const totalWords = new Set([...answer.split(' '), ...expected.split(' ')]).size;
return commonWords / totalWords;
}
private scoreRelevance(citations: Citation[], expectedDocIds: string[]): number {
if (expectedDocIds.length === 0) return 1;
const citedIds = citations.map(c => c.source);
const matches = citedIds.filter(id => expectedDocIds.includes(id));
return matches.length / expectedDocIds.length;
}
private scoreFaithfulness(answer: string, citations: Citation[]): number {
// Check if the answer has citations
if (citations.length === 0) return 0;
const hasCitations = answer.includes('[Document');
return hasCitations ? 1 : 0.5;
}
private getCommonWords(a: string, b: string): number {
const wordsA = new Set(a.toLowerCase().split(/\s+/));
const wordsB = new Set(b.toLowerCase().split(/\s+/));
return [...wordsA].filter(w => wordsB.has(w)).length;
}
private average(nums: number[]): number {
return nums.reduce((a, b) => a + b, 0) / nums.length;
}
}
1.3 Building Test Datasets
You must build test datasets before production deployment:
// Test case examples
const testCases: EvaluationCase[] = [
{
question: 'How do I request time off?',
expectedAnswer: 'Submit your request through the leave request menu in the HR system.',
relevantDocIds: ['hr-policy.md', 'leave-guide.md'],
},
{
question: 'What is the new hire training schedule?',
expectedAnswer: 'Orientation is held during your first week.',
relevantDocIds: ['onboarding-guide.md'],
},
{
question: 'What is the code review process?',
expectedAnswer: 'After creating a PR, you need at least one approval.',
relevantDocIds: ['dev-guide.md', 'code-review.md'],
},
];
// Run RAG evaluation
async function runEvaluation() {
const rag = new RAGGenerator({ anthropicApiKey: process.env.ANTHROPIC_API_KEY! });
const evaluator = new RAGEvaluator(rag, testCases);
const metrics = await evaluator.evaluate();
console.log('=== RAG Evaluation Results ===');
console.log(`Accuracy: ${(metrics.accuracy * 100).toFixed(1)}%`);
console.log(`Relevance: ${(metrics.relevance * 100).toFixed(1)}%`);
console.log(`Faithfulness: ${(metrics.faithfulness * 100).toFixed(1)}%`);
console.log(`Overall Score: ${(calculateRAGScore(metrics) * 100).toFixed(1)}%`);
}
2. Cost Optimization
Cost optimization in production deployment is crucial for sustainable service operations. You need to track costs through monitoring and identify optimization opportunities.
2.1 Understanding Cost Structure
To optimize costs, you must first understand the system’s cost structure:
interface CostBreakdown {
embedding: number; // Embedding API costs
vectorStorage: number; // Vector DB storage costs
llmGeneration: number; // LLM answer generation costs
total: number;
}
function estimateMonthlyCost(
documentsCount: number,
queriesPerDay: number
): CostBreakdown {
// Voyage AI embeddings: $0.10 / 1M tokens
const avgTokensPerDoc = 500;
const embeddingCost = (documentsCount * avgTokensPerDoc / 1_000_000) * 0.10;
// Supabase: Free tier or $25/month
const vectorStorageCost = documentsCount > 500_000 ? 25 : 0;
// Claude Sonnet: $3 / 1M input, $15 / 1M output
const avgInputTokens = 2000; // Including context
const avgOutputTokens = 500;
const monthlyQueries = queriesPerDay * 30;
const llmCost =
(monthlyQueries * avgInputTokens / 1_000_000) * 3 +
(monthlyQueries * avgOutputTokens / 1_000_000) * 15;
return {
embedding: embeddingCost,
vectorStorage: vectorStorageCost,
llmGeneration: llmCost,
total: embeddingCost + vectorStorageCost + llmCost,
};
}
2.2 Embedding Caching
The key to cost optimization is embedding caching:
interface CacheEntry {
embedding: number[];
timestamp: number;
ttl: number;
}
class EmbeddingCache {
private cache: Map<string, CacheEntry> = new Map();
private maxSize: number;
private defaultTTL: number;
constructor(maxSize = 10000, defaultTTL = 86400000) { // 24 hours
this.maxSize = maxSize;
this.defaultTTL = defaultTTL;
}
private getKey(text: string): string {
// Simple hash (in practice, use crypto.createHash)
return Buffer.from(text).toString('base64').slice(0, 32);
}
get(text: string): number[] | null {
const key = this.getKey(text);
const entry = this.cache.get(key);
if (!entry) return null;
// Check TTL
if (Date.now() - entry.timestamp > entry.ttl) {
this.cache.delete(key);
return null;
}
return entry.embedding;
}
set(text: string, embedding: number[], ttl = this.defaultTTL): void {
// Enforce cache size limit
if (this.cache.size >= this.maxSize) {
this.evictOldest();
}
const key = this.getKey(text);
this.cache.set(key, {
embedding,
timestamp: Date.now(),
ttl,
});
}
private evictOldest(): void {
let oldestKey: string | null = null;
let oldestTime = Infinity;
for (const [key, entry] of this.cache) {
if (entry.timestamp < oldestTime) {
oldestTime = entry.timestamp;
oldestKey = key;
}
}
if (oldestKey) {
this.cache.delete(oldestKey);
}
}
getStats() {
return {
size: this.cache.size,
maxSize: this.maxSize,
hitRate: 0, // Requires hit/miss tracking in actual implementation
};
}
}
// Embedding function with caching applied
class CachedEmbedder {
private cache: EmbeddingCache;
private embedder: (text: string) => Promise<number[]>;
private hits = 0;
private misses = 0;
constructor(
embedder: (text: string) => Promise<number[]>,
cacheSize = 10000
) {
this.cache = new EmbeddingCache(cacheSize);
this.embedder = embedder;
}
async embed(text: string): Promise<number[]> {
// Check cache
const cached = this.cache.get(text);
if (cached) {
this.hits++;
return cached;
}
// Cache miss - API call
this.misses++;
const embedding = await this.embedder(text);
this.cache.set(text, embedding);
return embedding;
}
getHitRate(): number {
const total = this.hits + this.misses;
return total > 0 ? this.hits / total : 0;
}
}
2.3 Model Selection Strategy
Here’s a model selection strategy for cost optimization:
interface ModelConfig {
name: string;
inputCostPer1M: number;
outputCostPer1M: number;
maxTokens: number;
speed: 'fast' | 'medium' | 'slow';
quality: 'high' | 'medium' | 'low';
}
const MODELS: Record<string, ModelConfig> = {
'claude-3-haiku': {
name: 'claude-3-haiku-20240307',
inputCostPer1M: 0.25,
outputCostPer1M: 1.25,
maxTokens: 200000,
speed: 'fast',
quality: 'medium',
},
'claude-sonnet-4': {
name: 'claude-sonnet-4-20250514',
inputCostPer1M: 3,
outputCostPer1M: 15,
maxTokens: 200000,
speed: 'medium',
quality: 'high',
},
'claude-opus-4': {
name: 'claude-opus-4-20250514',
inputCostPer1M: 15,
outputCostPer1M: 75,
maxTokens: 200000,
speed: 'slow',
quality: 'high',
},
};
// Select model based on use case
function selectModel(useCase: 'simple' | 'complex' | 'critical'): ModelConfig {
switch (useCase) {
case 'simple':
// Simple questions: Haiku (cost optimized)
return MODELS['claude-3-haiku'];
case 'complex':
// Complex analysis: Sonnet (balanced)
return MODELS['claude-sonnet-4'];
case 'critical':
// Critical decisions: Opus (highest quality)
return MODELS['claude-opus-4'];
}
}
3. Production Deployment
Production deployment is the critical step of transitioning from development to live service. You need an API server, error handling, and monitoring systems for stable production deployment.
3.1 Building the API Server
Here’s an Express-based API server for production deployment:
import express from 'express';
import rateLimit from 'express-rate-limit';
import helmet from 'helmet';
const app = express();
// Security middleware
app.use(helmet());
app.use(express.json({ limit: '10kb' }));
// Rate limiting (for cost optimization and security)
const limiter = rateLimit({
windowMs: 60 * 1000, // 1 minute
max: 20, // 20 requests per minute
message: { error: 'Too many requests, please try again later.' },
});
app.use('/api/', limiter);
// RAG instance
const rag = new RAGGenerator({
anthropicApiKey: process.env.ANTHROPIC_API_KEY!,
model: 'claude-sonnet-4-20250514',
});
// Metrics collection
const metrics = {
totalRequests: 0,
successfulRequests: 0,
failedRequests: 0,
avgResponseTime: 0,
responseTimes: [] as number[],
};
// RAG query endpoint
app.post('/api/query', async (req, res) => {
const startTime = Date.now();
metrics.totalRequests++;
try {
const { question } = req.body;
if (!question || typeof question !== 'string') {
return res.status(400).json({ error: 'Invalid question' });
}
// Search (in practice, from vector DB)
const documents = await searchDocuments(question);
// Generate answer
const answer = await rag.generate(question, documents);
// Update metrics
const responseTime = Date.now() - startTime;
metrics.successfulRequests++;
metrics.responseTimes.push(responseTime);
metrics.avgResponseTime = average(metrics.responseTimes);
res.json({
answer: answer.content,
citations: answer.citations,
metadata: {
responseTime,
documentsUsed: answer.metadata.documentsUsed,
},
});
} catch (error) {
metrics.failedRequests++;
console.error('RAG query error:', error);
res.status(500).json({
error: 'Internal server error',
message: error instanceof Error ? error.message : 'Unknown error',
});
}
});
// Health check
app.get('/api/health', (req, res) => {
res.json({
status: 'healthy',
uptime: process.uptime(),
metrics: {
totalRequests: metrics.totalRequests,
successRate: metrics.totalRequests > 0
? (metrics.successfulRequests / metrics.totalRequests * 100).toFixed(1) + '%'
: 'N/A',
avgResponseTime: metrics.avgResponseTime.toFixed(0) + 'ms',
},
});
});
// Metrics endpoint
app.get('/api/metrics', (req, res) => {
res.json(metrics);
});
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`RAG API server running on port ${PORT}`);
});
3.2 Error Handling
Robust error handling is essential in production deployment:
// Custom error class
class RAGError extends Error {
constructor(
message: string,
public code: string,
public statusCode: number = 500,
public retryable: boolean = false
) {
super(message);
this.name = 'RAGError';
}
}
// Error type definitions
const RAGErrors = {
EMBEDDING_FAILED: (msg: string) =>
new RAGError(msg, 'EMBEDDING_FAILED', 503, true),
SEARCH_FAILED: (msg: string) =>
new RAGError(msg, 'SEARCH_FAILED', 503, true),
GENERATION_FAILED: (msg: string) =>
new RAGError(msg, 'GENERATION_FAILED', 503, true),
RATE_LIMITED: () =>
new RAGError('Rate limit exceeded', 'RATE_LIMITED', 429, true),
INVALID_INPUT: (msg: string) =>
new RAGError(msg, 'INVALID_INPUT', 400, false),
};
// Retry logic
async function withRetry<T>(
fn: () => Promise<T>,
maxRetries: number = 3,
delay: number = 1000
): Promise<T> {
let lastError: Error | null = null;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
// Check if error is retryable
if (error instanceof RAGError && !error.retryable) {
throw error;
}
// Wait longer for rate limits
const waitTime = error instanceof RAGError && error.code === 'RATE_LIMITED'
? delay * attempt * 2
: delay * attempt;
console.warn(`Attempt ${attempt} failed, retrying in ${waitTime}ms...`);
await sleep(waitTime);
}
}
throw lastError;
}
// Safe RAG query
async function safeQuery(
rag: RAGGenerator,
question: string,
documents: Document[]
): Promise<FormattedAnswer | null> {
try {
return await withRetry(
() => rag.generate(question, documents),
3,
1000
);
} catch (error) {
if (error instanceof RAGError) {
console.error(`RAG Error [${error.code}]: ${error.message}`);
} else {
console.error('Unexpected error:', error);
}
return null;
}
}
3.3 Monitoring and Logging
Monitoring is the cornerstone of system stability after production deployment:
import winston from 'winston';
// Logger configuration
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' }),
],
});
// Console output for non-production environments
if (process.env.NODE_ENV !== 'production') {
logger.add(new winston.transports.Console({
format: winston.format.simple(),
}));
}
// RAG request logging
interface RAGLogEntry {
requestId: string;
question: string;
documentsRetrieved: number;
responseTime: number;
success: boolean;
error?: string;
citations: number;
}
function logRAGRequest(entry: RAGLogEntry) {
logger.info('RAG Request', entry);
}
// Metrics collector
class MetricsCollector {
private metrics = {
requests: {
total: 0,
success: 0,
failed: 0,
},
latency: {
p50: 0,
p95: 0,
p99: 0,
},
cache: {
hits: 0,
misses: 0,
},
};
private latencies: number[] = [];
recordRequest(success: boolean, latency: number) {
this.metrics.requests.total++;
if (success) {
this.metrics.requests.success++;
} else {
this.metrics.requests.failed++;
}
this.latencies.push(latency);
this.updateLatencyPercentiles();
}
recordCacheHit(hit: boolean) {
if (hit) {
this.metrics.cache.hits++;
} else {
this.metrics.cache.misses++;
}
}
private updateLatencyPercentiles() {
const sorted = [...this.latencies].sort((a, b) => a - b);
const len = sorted.length;
this.metrics.latency.p50 = sorted[Math.floor(len * 0.5)] || 0;
this.metrics.latency.p95 = sorted[Math.floor(len * 0.95)] || 0;
this.metrics.latency.p99 = sorted[Math.floor(len * 0.99)] || 0;
}
getMetrics() {
return {
...this.metrics,
successRate: this.metrics.requests.total > 0
? this.metrics.requests.success / this.metrics.requests.total
: 0,
cacheHitRate: (this.metrics.cache.hits + this.metrics.cache.misses) > 0
? this.metrics.cache.hits / (this.metrics.cache.hits + this.metrics.cache.misses)
: 0,
};
}
}
// Alert system
interface Alert {
level: 'warning' | 'critical';
message: string;
timestamp: Date;
}
class AlertManager {
private alerts: Alert[] = [];
private thresholds = {
errorRate: 0.1, // 10% error rate
latencyP95: 5000, // 5 seconds
cacheHitRate: 0.5, // Below 50%
};
checkMetrics(metrics: ReturnType<MetricsCollector['getMetrics']>) {
// Check error rate
const errorRate = 1 - metrics.successRate;
if (errorRate > this.thresholds.errorRate) {
this.addAlert('critical', `High error rate: ${(errorRate * 100).toFixed(1)}%`);
}
// Check latency
if (metrics.latency.p95 > this.thresholds.latencyP95) {
this.addAlert('warning', `High latency P95: ${metrics.latency.p95}ms`);
}
// Check cache hit rate
if (metrics.cacheHitRate < this.thresholds.cacheHitRate) {
this.addAlert('warning', `Low cache hit rate: ${(metrics.cacheHitRate * 100).toFixed(1)}%`);
}
}
private addAlert(level: Alert['level'], message: string) {
const alert = { level, message, timestamp: new Date() };
this.alerts.push(alert);
logger.warn('Alert triggered', alert);
// In practice, send notifications via Slack, PagerDuty, etc.
this.sendNotification(alert);
}
private sendNotification(alert: Alert) {
// Send notification via Slack webhook, email, etc.
console.log(`[ALERT] ${alert.level.toUpperCase()}: ${alert.message}`);
}
getAlerts(): Alert[] {
return this.alerts;
}
}
4. Complete Production RAG System
4.1 Integrated Implementation
Here’s a complete RAG system for production deployment:
interface ProductionRAGConfig {
anthropicApiKey: string;
embeddingApiKey: string;
vectorDbUrl: string;
cacheSize?: number;
maxRetries?: number;
model?: string;
}
class ProductionRAG {
private rag: RAGGenerator;
private embedder: CachedEmbedder;
private metrics: MetricsCollector;
private alertManager: AlertManager;
constructor(config: ProductionRAGConfig) {
this.rag = new RAGGenerator({
anthropicApiKey: config.anthropicApiKey,
model: config.model || 'claude-sonnet-4-20250514',
});
this.embedder = new CachedEmbedder(
async (text) => this.callEmbeddingAPI(text, config.embeddingApiKey),
config.cacheSize || 10000
);
this.metrics = new MetricsCollector();
this.alertManager = new AlertManager();
}
async query(question: string): Promise<FormattedAnswer | null> {
const startTime = Date.now();
const requestId = this.generateRequestId();
try {
// 1. Embed question (with caching)
const queryEmbedding = await this.embedder.embed(question);
// 2. Vector search
const documents = await this.vectorSearch(queryEmbedding);
// 3. Generate answer (with retry logic)
const answer = await withRetry(
() => this.rag.generate(question, documents),
3,
1000
);
// 4. Record metrics
const latency = Date.now() - startTime;
this.metrics.recordRequest(true, latency);
logRAGRequest({
requestId,
question,
documentsRetrieved: documents.length,
responseTime: latency,
success: true,
citations: answer.citations.length,
});
return answer;
} catch (error) {
const latency = Date.now() - startTime;
this.metrics.recordRequest(false, latency);
logRAGRequest({
requestId,
question,
documentsRetrieved: 0,
responseTime: latency,
success: false,
error: error instanceof Error ? error.message : 'Unknown error',
citations: 0,
});
// Check alerts
this.alertManager.checkMetrics(this.metrics.getMetrics());
return null;
}
}
getMetrics() {
return this.metrics.getMetrics();
}
private async callEmbeddingAPI(text: string, apiKey: string): Promise<number[]> {
// Voyage AI embedding API call
const response = await fetch('https://api.voyageai.com/v1/embeddings', {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'voyage-3',
input: text,
}),
});
const data = await response.json();
return data.data[0].embedding;
}
private async vectorSearch(embedding: number[]): Promise<Document[]> {
// In practice, search from Supabase/Pinecone, etc.
return [];
}
private generateRequestId(): string {
return `req_${Date.now()}_${Math.random().toString(36).slice(2, 9)}`;
}
}
4.2 Usage Example
// Initialize production RAG system
const productionRAG = new ProductionRAG({
anthropicApiKey: process.env.ANTHROPIC_API_KEY!,
embeddingApiKey: process.env.VOYAGE_API_KEY!,
vectorDbUrl: process.env.SUPABASE_URL!,
cacheSize: 10000,
model: 'claude-sonnet-4-20250514',
});
// Use in API server
app.post('/api/query', async (req, res) => {
const { question } = req.body;
const answer = await productionRAG.query(question);
if (answer) {
res.json({ success: true, answer });
} else {
res.status(500).json({ success: false, error: 'Failed to generate answer' });
}
});
// Periodic monitoring
setInterval(() => {
const metrics = productionRAG.getMetrics();
console.log('=== RAG System Metrics ===');
console.log(`Success Rate: ${(metrics.successRate * 100).toFixed(1)}%`);
console.log(`Cache Hit Rate: ${(metrics.cacheHitRate * 100).toFixed(1)}%`);
console.log(`Latency P95: ${metrics.latency.p95}ms`);
}, 60000); // Every minute
Wrapping Up
Throughout this series, we’ve covered the complete process of building a RAG system:
- Day 1: RAG concepts and architecture
- Day 2: Document processing and chunking strategies
- Day 3: Embeddings and vector databases
- Day 4: Search optimization and reranking
- Day 5: Claude integration and answer generation
- Day 6: Production deployment and cost optimization
The most important aspects of production deployment are:
- RAG evaluation for continuous quality management
- Cost optimization for sustainable operations
- Monitoring for reliable service
You’re now ready to build your own RAG system and deploy it to production!
๐ Series Index
RAG (6/6)
- Day 1: RAG Concepts and Architecture
- Day 2: Document Processing and Chunking
- Day 3: Embeddings and Vector Database
- Day 4: Search Optimization and Reranking
- Day 5: Claude Integration and Answer Generation
- ๐ Day 6: Production Deployment and Optimization (Current)
๐ GitHub Repository
Leave A Comment