feedstock

Cache Freshness

Multi-signal cache freshness evaluation using sitemap, HTTP headers, and content hashing.

Instead of naive TTL-based cache invalidation, Feedstock can combine multiple noisy signals to decide whether cached content is stale. Based on Scalable Crawling with Noisy Change-Indicating Signals.

The Problem

A simple TTL cache makes binary decisions: fresh if under 24 hours, stale if over. This misses cases like:

  • Page updated 6 hours after caching (sitemap says so) — TTL says "fresh" but it's stale
  • Page is 25 hours old but ETag confirms no change — TTL says "stale" but it's fresh

Multi-Signal Evaluation

import { CacheFreshnessEvaluator } from "feedstock";

const evaluator = new CacheFreshnessEvaluator();

const result = evaluator.evaluate(
  { url: "https://example.com", cachedAt: Date.now() - 3600_000, etag: '"abc"' },
  {
    cachedAt: Date.now() - 3600_000,
    etag: '"abc"',           // same ETag — content unchanged
    cachedEtag: '"abc"',
    sitemapLastmod: new Date(Date.now() - 7200_000).toISOString(),
    cacheControl: "max-age=7200",
  },
);

console.log(result.isStale);         // false
console.log(result.score);           // 0.18 (low = fresh)
console.log(result.recommendation);  // "use_cache"

Signals

Four signal types are evaluated, each with configurable trust weight:

Time Decay

How old is the cached entry? Staleness increases linearly with age.

HTTP Headers

  • ETag: If new ETag differs from cached → stale (confidence 0.95)
  • Last-Modified: If newer than cached → stale (confidence 0.8)
  • Cache-Control: If max-age expired → stale (confidence 0.9). If no-cache → stale (confidence 0.7)

Sitemap

  • lastmod: If newer than cache timestamp → stale (confidence 0.7)
  • changefreq: "always"/"hourly" biases toward stale. "never"/"yearly" biases toward fresh.

Content Hash

Direct comparison of content hashes. Most reliable signal when available.

Recommendations

Based on the combined staleness score:

ScoreRecommendationAction
< 0.3use_cacheServe cached content directly
0.3 – 0.7revalidateSend conditional GET (If-None-Match / If-Modified-Since)
> 0.7refetchFull re-crawl

Configuration

const evaluator = new CacheFreshnessEvaluator({
  maxAgeMs: 86400_000,        // max age before forced refresh (default: 24h)
  sitemapWeight: 0.6,         // trust weight for sitemap signals
  httpHeaderWeight: 0.8,      // trust weight for HTTP headers
  contentHashWeight: 1.0,     // trust weight for content hash
  timeDecayWeight: 0.4,       // trust weight for time-based decay
  staleThreshold: 0.5,        // score above which content is stale
});

Sitemap Parsing

Built-in sitemap parser for fetching freshness signals:

import { parseSitemap, fetchSitemap, buildSitemapIndex } from "feedstock";

// Parse XML string
const entries = parseSitemap(xml);
// [{ loc: "https://...", lastmod: "2024-01-15", changefreq: "weekly", priority: 0.8 }]

// Fetch from a site
const entries = await fetchSitemap("https://example.com");

// Build lookup for fast URL → entry access
const index = buildSitemapIndex(entries);
const entry = index.get("https://example.com/page");

Inspecting Signals

Every evaluation includes the individual signal breakdown:

for (const signal of result.signals) {
  console.log(`${signal.name}: stale=${signal.stale}, confidence=${signal.confidence}, reason=${signal.reason}`);
}
// time_decay: stale=false, confidence=0.52, reason="Cache age 3600s (4% of max)"
// http_etag: stale=false, confidence=0.95, reason="ETag matches cached value"
// sitemap_lastmod: stale=false, confidence=0.70, reason="Sitemap lastmod is before cache time"
Edit on GitHub

Last updated on

On this page