AdvancedTypeScript~20 min

Production Patterns

Threshold tuning, uncertain hit strategies, per-category overrides, TTL management, invalidation, and built-in Prometheus metrics.

Pattern 1: Threshold Tuning

The defaultThreshold controls the trade-off between hit rate and answer quality. It is a cosine distance (0–2 scale).

Threshold	Hit Rate	Quality Risk	Best For
`0.05`	Low (~20%)	Very low	Medical, legal, financial
`0.10`	Medium (~35%)	Low	High-accuracy chatbots
`0.15`	Balanced (~50%)	Low	General purpose - start here
`0.20`	High (~65%)	Medium	FAQ bots, support
`0.30+`	Very high (~75%+)	High	Not recommended unless monitored

To find the right threshold, run a sample of your real queries through the cache and inspect the similarity scores:

import { SemanticCache } from '@betterdb/semantic-cache';

// Use a low threshold (wide net) to collect score data
const evalCache = new SemanticCache({
  client,
  embedFn,
  defaultThreshold: 0.50, // catch everything for analysis
});
await evalCache.initialize();

const testPrompts = [
  'What is Valkey?',
  'Can you explain Valkey?',     // should match → expect low score
  'How do I set a key in Redis?', // different intent → expect high score
  'Valkey vs Redis comparison',   // related but different → borderline
];

for (const prompt of testPrompts) {
  const result = await evalCache.check(prompt);
  if (result.similarity !== undefined) {
    console.log(`score=${result.similarity.toFixed(4)}  prompt="${prompt}"`);
  }
}
// score=0.0821  prompt="Can you explain Valkey?"
// score=0.3142  prompt="How do I set a key in Redis?"
// score=0.2205  prompt="Valkey vs Redis comparison"

Set your threshold between the highest acceptable hit and the lowest unacceptable hit.

Pattern 2: Handling Uncertain Hits

When a prompt falls in the uncertainty band - slightly above the threshold - cache.check() returns confidence: 'uncertain'. Three strategies:

Accept and monitor - return the cached response but track it separately via the result: 'uncertain_hit' Prometheus label. Review periodically.

const result = await cache.check(prompt);

if (result.hit) {
  if (result.confidence === 'uncertain') {
    metrics.increment('cache.uncertain_hit'); // your own counter
  }
  return result.response!;
}

Fall back to LLM - treat uncertain hits as misses. Use the fresh LLM response to overwrite the cache entry.

const result = await cache.check(prompt);

if (result.hit && result.confidence === 'high') {
  return result.response!;
}

// Miss or uncertain - call LLM
const response = await callLlm(prompt);
await cache.store(prompt, response); // overwrites uncertain entry
return response;

Prompt for feedback - in user-facing applications, show the cached response but collect a signal.

if (result.hit && result.confidence === 'uncertain') {
  return {
    response: result.response,
    showFeedback: true, // render thumbs up/down in the UI
  };
}

A high rate of uncertain hits (visible in semantic_cache_requests_total{result="uncertain_hit"}) indicates the threshold may be too loose.

Pattern 3: Per-Category Thresholds

Different query categories can have different accuracy requirements. Use categoryThresholds to override the default per-category:

const cache = new SemanticCache({
  client,
  embedFn,
  defaultThreshold: 0.15,
  categoryThresholds: {
    'medical': 0.05,    // very strict - health information must be accurate
    'faq':     0.25,    // relaxed - FAQ answers are safe to generalize
    'support': 0.20,    // moderate - support answers are usually reusable
  },
});

// Pass category on each call
const result = await cache.check(prompt, { category: 'medical' });
await cache.store(prompt, response, { category: 'medical' });

The category is stored as a TAG field in Valkey and is also emitted as a label on all Prometheus metrics.

Pattern 4: TTL Strategies

// Fixed TTL - simple, applied at store time
await cache.store(prompt, response, { ttl: 3600 }); // 1 hour

// Use defaultTtl in the constructor for a global default
const cache = new SemanticCache({
  client, embedFn,
  defaultTtl: 86400, // 24 hours
});

// Per-category TTL (set at store time, overrides default)
await cache.store(prompt, response, {
  category: 'real-time',
  ttl: 300, // 5 minutes for time-sensitive data
});

Category	Recommended TTL	Reason
General facts	86400 (24h)	Stable information
Product info	3600 (1h)	Changes occasionally
Real-time data	300 (5min)	Prices, weather, status
Conversation	1800 (30min)	Session-scoped

Pattern 5: Cache Invalidation

// Invalidate a specific entry by its exact prompt
const key = await cache.store('What is the price of product X?', response);
await cache.invalidate(`@key:{${key}}`);

// Invalidate all entries for a category
await cache.invalidate('@category:{pricing}');

// Invalidate all entries for a model (if you store model as a custom field)
// Use cache.flush() to drop everything and rebuild from scratch
await cache.flush();
await cache.initialize(); // rebuild the index

Note: invalidate() accepts any valkey-search filter expression. See the Valkey search query syntax for full filter options.

Pattern 6: Prometheus Metrics

All metrics are prefixed with semantic_cache_ by default (configurable via telemetry.metricsPrefix).

Metric	Type	Labels	Description
`semantic_cache_requests_total`	Counter	`cache_name`, `result`, `category`	Total requests. `result`: `hit`, `miss`, `uncertain_hit`
`semantic_cache_similarity_score`	Histogram	`cache_name`, `category`	Cosine distance scores for lookups with candidates
`semantic_cache_operation_duration_seconds`	Histogram	`cache_name`, `operation`	Duration per operation (`check`, `store`, `invalidate`, `initialize`)
`semantic_cache_embedding_duration_seconds`	Histogram	`cache_name`	Embedding function call duration

Expose them via a /metrics endpoint:

import { register } from 'prom-client';
import express from 'express';

const app = express();

app.get('/metrics', async (_req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Key dashboards to build:

# Hit rate over time
rate(semantic_cache_requests_total{result="hit"}[5m])
  / rate(semantic_cache_requests_total[5m])

# P95 check latency
histogram_quantile(0.95, rate(semantic_cache_operation_duration_seconds_bucket{operation="check"}[5m]))

# Uncertain hit fraction
rate(semantic_cache_requests_total{result="uncertain_hit"}[5m])
  / rate(semantic_cache_requests_total[5m])

Production Checklist

Area	Recommendation
Threshold	Start at `0.15`, tune with real query samples
Uncertain hits	Track the `uncertain_hit` label; adjust threshold if > 10%
TTL	Set `defaultTtl`; override per-category for time-sensitive data
Memory	Set `maxmemory-policy allkeys-lru` in Valkey config
Invalidation	Use category TAGs at write time for targeted invalidation
Metrics	Expose `/metrics` and alert on hit rate drop > 20%
Cluster	Avoid `flush()` in cluster mode - SCAN only covers one node
Streaming	Accumulate full response before calling `store()`

← Previous02 - Adapters