Production Patterns
Threshold tuning, uncertain hit strategies, per-category overrides, TTL management, invalidation, and built-in Prometheus metrics.
Pattern 1: Threshold Tuning
The defaultThreshold controls the trade-off between hit rate and answer quality. It is a cosine distance (0–2 scale).
| Threshold | Hit Rate | Quality Risk | Best For |
|---|---|---|---|
0.05 |
Low (~20%) | Very low | Medical, legal, financial |
0.10 |
Medium (~35%) | Low | High-accuracy chatbots |
0.15 |
Balanced (~50%) | Low | General purpose - start here |
0.20 |
High (~65%) | Medium | FAQ bots, support |
0.30+ |
Very high (~75%+) | High | Not recommended unless monitored |
To find the right threshold, run a sample of your real queries through the cache and inspect the similarity scores:
import { SemanticCache } from '@betterdb/semantic-cache';
// Use a low threshold (wide net) to collect score data
const evalCache = new SemanticCache({
client,
embedFn,
defaultThreshold: 0.50, // catch everything for analysis
});
await evalCache.initialize();
const testPrompts = [
'What is Valkey?',
'Can you explain Valkey?', // should match → expect low score
'How do I set a key in Redis?', // different intent → expect high score
'Valkey vs Redis comparison', // related but different → borderline
];
for (const prompt of testPrompts) {
const result = await evalCache.check(prompt);
if (result.similarity !== undefined) {
console.log(`score=${result.similarity.toFixed(4)} prompt="${prompt}"`);
}
}
// score=0.0821 prompt="Can you explain Valkey?"
// score=0.3142 prompt="How do I set a key in Redis?"
// score=0.2205 prompt="Valkey vs Redis comparison"
Set your threshold between the highest acceptable hit and the lowest unacceptable hit.
Pattern 2: Handling Uncertain Hits
When a prompt falls in the uncertainty band - slightly above the threshold - cache.check() returns confidence: 'uncertain'. Three strategies:
Accept and monitor - return the cached response but track it separately via the result: 'uncertain_hit' Prometheus label. Review periodically.
const result = await cache.check(prompt);
if (result.hit) {
if (result.confidence === 'uncertain') {
metrics.increment('cache.uncertain_hit'); // your own counter
}
return result.response!;
}
Fall back to LLM - treat uncertain hits as misses. Use the fresh LLM response to overwrite the cache entry.
const result = await cache.check(prompt);
if (result.hit && result.confidence === 'high') {
return result.response!;
}
// Miss or uncertain - call LLM
const response = await callLlm(prompt);
await cache.store(prompt, response); // overwrites uncertain entry
return response;
Prompt for feedback - in user-facing applications, show the cached response but collect a signal.
if (result.hit && result.confidence === 'uncertain') {
return {
response: result.response,
showFeedback: true, // render thumbs up/down in the UI
};
}
A high rate of uncertain hits (visible in semantic_cache_requests_total{result="uncertain_hit"}) indicates the threshold may be too loose.
Pattern 3: Per-Category Thresholds
Different query categories can have different accuracy requirements. Use categoryThresholds to override the default per-category:
const cache = new SemanticCache({
client,
embedFn,
defaultThreshold: 0.15,
categoryThresholds: {
'medical': 0.05, // very strict - health information must be accurate
'faq': 0.25, // relaxed - FAQ answers are safe to generalize
'support': 0.20, // moderate - support answers are usually reusable
},
});
// Pass category on each call
const result = await cache.check(prompt, { category: 'medical' });
await cache.store(prompt, response, { category: 'medical' });
The category is stored as a TAG field in Valkey and is also emitted as a label on all Prometheus metrics.
Pattern 4: TTL Strategies
// Fixed TTL - simple, applied at store time
await cache.store(prompt, response, { ttl: 3600 }); // 1 hour
// Use defaultTtl in the constructor for a global default
const cache = new SemanticCache({
client, embedFn,
defaultTtl: 86400, // 24 hours
});
// Per-category TTL (set at store time, overrides default)
await cache.store(prompt, response, {
category: 'real-time',
ttl: 300, // 5 minutes for time-sensitive data
});
| Category | Recommended TTL | Reason |
|---|---|---|
| General facts | 86400 (24h) | Stable information |
| Product info | 3600 (1h) | Changes occasionally |
| Real-time data | 300 (5min) | Prices, weather, status |
| Conversation | 1800 (30min) | Session-scoped |
Pattern 5: Cache Invalidation
// Invalidate a specific entry by its exact prompt
const key = await cache.store('What is the price of product X?', response);
await cache.invalidate(`@key:{${key}}`);
// Invalidate all entries for a category
await cache.invalidate('@category:{pricing}');
// Invalidate all entries for a model (if you store model as a custom field)
// Use cache.flush() to drop everything and rebuild from scratch
await cache.flush();
await cache.initialize(); // rebuild the index
Note:
invalidate()accepts anyvalkey-searchfilter expression. See the Valkey search query syntax for full filter options.
Pattern 6: Prometheus Metrics
All metrics are prefixed with semantic_cache_ by default (configurable via telemetry.metricsPrefix).
| Metric | Type | Labels | Description |
|---|---|---|---|
semantic_cache_requests_total |
Counter | cache_name, result, category |
Total requests. result: hit, miss, uncertain_hit |
semantic_cache_similarity_score |
Histogram | cache_name, category |
Cosine distance scores for lookups with candidates |
semantic_cache_operation_duration_seconds |
Histogram | cache_name, operation |
Duration per operation (check, store, invalidate, initialize) |
semantic_cache_embedding_duration_seconds |
Histogram | cache_name |
Embedding function call duration |
Expose them via a /metrics endpoint:
import { register } from 'prom-client';
import express from 'express';
const app = express();
app.get('/metrics', async (_req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Key dashboards to build:
# Hit rate over time
rate(semantic_cache_requests_total{result="hit"}[5m])
/ rate(semantic_cache_requests_total[5m])
# P95 check latency
histogram_quantile(0.95, rate(semantic_cache_operation_duration_seconds_bucket{operation="check"}[5m]))
# Uncertain hit fraction
rate(semantic_cache_requests_total{result="uncertain_hit"}[5m])
/ rate(semantic_cache_requests_total[5m])
Production Checklist
| Area | Recommendation |
|---|---|
| Threshold | Start at 0.15, tune with real query samples |
| Uncertain hits | Track the uncertain_hit label; adjust threshold if > 10% |
| TTL | Set defaultTtl; override per-category for time-sensitive data |
| Memory | Set maxmemory-policy allkeys-lru in Valkey config |
| Invalidation | Use category TAGs at write time for targeted invalidation |
| Metrics | Expose /metrics and alert on hit rate drop > 20% |
| Cluster | Avoid flush() in cluster mode - SCAN only covers one node |
| Streaming | Accumulate full response before calling store() |