Changelog

v0.5.0

Research-driven intelligence — 8 features from arxiv papers

Crawl Intelligence

Bandit URL frontier — UCB1-based online learning scorer (BanditScorer) that groups URLs by structural pattern and learns which groups yield valuable content. BanditDeepCrawlStrategy handles the full learn-while-crawling loop. Based on arxiv 2602.11874.
Neural quality estimation — NeuralQualityScorer predicts page quality from URL structure, anchor text, and parent context. Learns online via gradient descent, with quality propagation from parent pages. Based on arxiv 2506.16146.
Focused crawling with RL — FocusedDeepCrawlStrategy uses Q-learning to maximize on-topic page discovery. Epsilon-greedy exploration with link group discretization. Based on arxiv 2112.07620.

Content Processing

DOM downsampling — DomDownsampler strips boilerplate tags, filters attributes, collapses single-child containers, and removes empty nodes. 30-85% HTML size reduction before extraction. Based on arxiv 2508.04412.
Composite multi-strategy extraction — CompositeExtractionStrategy detects content regions (prose, tables, code, lists, media) and dispatches each to a specialized extractor. New ProseExtractionStrategy and CodeExtractionStrategy. Based on arxiv 2602.19548.

Cache & Freshness

Multi-signal cache freshness — CacheFreshnessEvaluator combines sitemap lastmod, HTTP headers (ETag, Last-Modified, Cache-Control), content hash, and time decay into a weighted staleness score. Three-tier recommendations: use_cache, revalidate, refetch. Built-in sitemap parser. Based on arxiv 2502.02430.

Browser & Stealth

Hydration-aware readiness — waitForHydration() detects SPA frameworks (React, Vue, Next, Nuxt, Svelte, Angular) and waits for content stability instead of fixed timeouts. 90-100% wait time reduction on most sites. Based on arxiv 2504.03884.
Fingerprint consistency — applyEnhancedStealth() generates spatially-consistent browser profiles where UA, platform, WebGL, screen, canvas, and hardware all agree. Three presets (chrome-windows, chrome-mac, chrome-linux). Based on arxiv 2406.07647.

Testing

173 new tests across 8 feature branches, all with zero regressions against existing 408-test suite.
Integration demos against real websites (Wikipedia, Hacker News, React.dev, Vue.js).

v0.3.0

Bun-native performance

Gzip-compressed cache — cached HTML is now compressed with Bun.gzipSync() before storage, cutting cache DB size by 5-10x. Backward compatible with uncompressed legacy data.
Bun.hash() content hashing — replaced SHA-256 with wyhash via Bun.hash() for ~10-50x faster change detection. Non-cryptographic, ideal for cache invalidation.
Bun.sleep() — replaced setTimeout promise pattern in 5 locations (retry, rate limiter, anti-bot, CDP readiness polling) with native Bun.sleep().
Bun.file() / Bun.write() — storage state persistence now uses Bun-native file I/O instead of node:fs. loadStorageState and getStorageStatePath are now async.
Bun.env — config loader uses Bun.env instead of process.env.
MonitorDashboard — new Bun.serve()-powered live monitoring server. HTTP JSON API (/stats, /health) and WebSocket (/ws) for real-time crawl stats.
349 tests (24 new), 148 dogfood checks passing.

v0.2.0

Agent-browser patterns

Patterns inspired by vercel-labs/agent-browser:

New browser backends

Generic CDP backend — { kind: "cdp", wsUrl: "..." } connects to any cloud browser provider (Browserbase, Browserless, etc.) via chromium.connectOverCDP()
Connection resilience — BrowserManager.start() retries transient errors (ECONNREFUSED, ETIMEDOUT, WebSocket errors) with exponential backoff (default 3 retries, 500ms base, 30s cap)
Lightpanda CDP readiness polling — polls /json/version endpoint before connecting, preventing race conditions on slow starts
Failed start cleanup — kills leaked Lightpanda processes and partial browser connections between retry attempts

New extraction & detection

AccessibilityExtractionStrategy — extracts semantic content (headings, links, buttons, inputs, images) using the static Cheerio snapshot builder. Filterable by role, works without a live browser.
detectCursorInteractiveElements() — finds hidden interactive elements via page.evaluate() (cursor:pointer, onclick, tabindex, contenteditable, ARIA roles)

Incremental crawling

Content hashing in cache — contentHash() and cache.hasChanged() for detecting unchanged pages
Schema migration — existing cache databases automatically gain the content_hash column via ALTER TABLE
Crawler integration — crawler.crawl() now stores content hash on every cache write

Resource blocking profiles

Named profiles — "fast" (images/fonts/media), "minimal" (everything except HTML/JS), "media-only" (images/video/audio)
Custom config — { patterns: ["**/*.woff2"], resourceTypes: ["font"] }
Backward compatible — blockResources: true maps to "fast" profile
Environment variable — FEEDSTOCK_BLOCK_RESOURCES=fast

Layered configuration

feedstock.json project config file (searches cwd and parent directories)
FEEDSTOCK_* environment variables — FEEDSTOCK_CDP_URL, FEEDSTOCK_HEADLESS, FEEDSTOCK_PROXY, FEEDSTOCK_BLOCK_RESOURCES, FEEDSTOCK_PAGE_TIMEOUT, and more
loadConfig() — merges all layers with precedence: programmatic > env vars > project file > defaults

Benchmarking

Benchmark suite (benchmarks/bench.ts) — 6 scenarios: cache write (100/1000), read, hasChanged, contentHash (10kb/100kb)
Warmup iterations, p50/stddev/min/max stats, --json output, name filtering

Testing

325 tests (92 new), including regression suite for config defaults, cache operations, extraction strategies, snapshot structure, and model factories
Schema migration tested against old-format SQLite databases
Resource blocker tested with mock BrowserContext

v0.1.6

Auto-escalation on anti-bot blocks

Auto-escalate on 403/429/503 — when FetchEngine gets a blocked response (Cloudflare challenge, "Access Denied", rate limit), automatically escalates to PlaywrightEngine with stealth. No manual withRetry() needed.
autoEscalateOnBlock config option on EngineManager (default: true). Disable with { autoEscalateOnBlock: false }.
SPA detection scoped to 2xx responses — prevents false escalation on error pages with short bodies.
7 new unit tests for escalation logic (403, 429, 503, disable flag, fallback without browser, SPA + block sequence).

v0.1.5

Built-in stealth mode

stealth: true on BrowserConfig — one flag enables random user-agents (9 realistic browser UAs), navigator.webdriver override, Chrome runtime spoofing, plugin/language spoofing via addInitScript
simulateUser: true on CrawlerRunConfig — automatically runs random mouse movements and scrolling after navigation. No manual hooks needed.
Stealth is applied at context creation, so every page in the session gets it automatically

v0.1.4

Production hardening

Input validation — crawl(), crawlMany(), deepCrawl() validate URLs upfront with friendly error messages
FetchEngine retry — retries transient network errors (ECONNRESET, ETIMEDOUT) up to 2 times with backoff
Session limits — BrowserManager caps at 20 concurrent sessions with LRU eviction
Cache pruning — cache.pruneOlderThan(ms) and cache.size for TTL-based cleanup
Graceful shutdown — SIGINT/SIGTERM handlers auto-close browser on process exit
User-agent rotation — UserAgentRotator with 9 realistic browser user-agents, getRandomUserAgent()
Static interactive detection — detectInteractiveElementsStatic() works without browser for processHtml
processHtml wired — detectInteractiveElements config option now works on raw HTML
prepublishOnly — typecheck + lint + unit tests run before every npm publish

v0.1.3

Performance optimizations and Playwright-native features

Performance

HTMLRewriter streaming extraction — links, media, and metadata extracted via Bun's native streaming parser instead of Cheerio DOM. Zero DOM allocation for the extraction path.
scrapeAll() single-pass — reduced Cheerio parses from 4x to 1x per page
BM25 frequency maps — O(1) term lookup via pre-computed frequency maps and Sets
Static cheerio import — removed dynamic require() in snapshot module
Batch pathToRegex — single regex replace instead of char-by-char loop (-10 lines)
Simplified engine selection — removed dead code in selectEngines (-6 lines)
LCS hash-based default — lowered DP threshold to 10K for faster diffs on large pages

New Features

blockResources — abort images, CSS, fonts, and media during browser crawls via context.route(). Can cut page load 50-80%.
navigationWaitUntil — configurable navigation strategy: "commit" (fastest), "domcontentloaded" (default), "load", "networkidle"
extractInPage() — extract links/media/metadata directly inside the browser via page.evaluate(), eliminating HTML serialization round-trip
extractAllStreaming() — Bun HTMLRewriter-based extraction as a standalone function
Service worker blocking — serviceWorkers: 'block' on every browser context ensures route interception works on all sites

v0.1.2

Change tracking and dogfood validation

Change tracking — ChangeTracker detects new/changed/unchanged/removed pages between crawl runs using SHA-256 content hashing and LCS-based text diffing
Snapshot management: listSnapshots(), deleteSnapshot(), pruneOlderThan()
Text diffs with addition/deletion counts and grouped chunks
Configurable: diff markdown vs HTML, max diff chunks, custom DB path
115 dogfood checks against real websites (example.com, Hacker News, Wikipedia)

v0.1.1

Agent-browser features and fetch-first engine system

Engine System

Fetch-first architecture — FetchEngine (HTTP) tried before PlaywrightEngine (browser)
Auto-escalation: detects SPA shells (React, Next.js, Nuxt) and switches to browser
EngineManager with quality-scored engine selection and fallback chain
likelyNeedsJavaScript() heuristic for SPA detection

Accessibility Snapshots

buildStaticSnapshot() — Cheerio-based semantic tree extraction (works with any engine)
takeSnapshot() — CDP-based Accessibility.getFullAXTree for browser-precise trees
Node categorization: interactive (button, link, input), content (heading, paragraph, img), structural (filtered)
@e ref system for deterministic element identification
New snapshot field on CrawlResult, enabled via config.snapshot = true

Rich Metadata (50+ fields)

Full Open Graph (12 fields), Twitter Card (7), Dublin Core (7), Article (5)
JSON-LD parsing, favicons, RSS/Atom feeds, alternate hreflang links
Charset, viewport, theme-color, robots, referrer, generator
Null values auto-stripped for cleaner output

Filter Denial Reasons

applyWithReason() on every filter returns { allowed, reason, filter }
Specific reasons: "Domain X is blocked", "Matched exclude pattern: Y", "File extension .pdf is blocked"
FilterChain.getDenials() and getDenialsByFilter() for aggregate tracking
Fully backward compatible — apply() still returns boolean

Browser Utilities

Interactive element detection — detectInteractiveElements() finds all clickable elements via single JS evaluation (cursor:pointer, onclick, tabindex, ARIA roles)
Iframe content inlining — extractIframeContent() + inlineIframeContent()
Storage state persistence — saveStorageState() / loadStorageState() for cookies + localStorage

AI-Friendly Errors

toFriendlyError() converts 20+ error patterns (DNS, timeout, SSL, element interaction, browser crashes) into actionable messages
withFriendlyErrors() wrapper for any async operation
Auto-applied in crawler.crawl() error handler

Testing

277 unit/integration tests (up from 191)
115 dogfood checks against real websites
Battle tests: engine fallback, redirects, timeouts, 404s, cache modes, screenshots, custom JS, network capture

v0.1.0

Initial release — full-featured web crawler for TypeScript/Bun

Core Crawling

WebCrawler with crawl(), crawlMany(), processHtml() methods
BrowserConfig and CrawlerRunConfig with typed defaults
Concurrent crawling with configurable concurrency limit
Context manager pattern with start()/close() lifecycle

Engine System

Fetch-first architecture: tries lightweight HTTP fetch before launching a browser
FetchEngine (quality 5) — simple HTTP, no browser overhead
PlaywrightEngine (quality 50) — full Chromium/Firefox/WebKit
Auto-escalation: detects SPA shells (React, Next.js, Nuxt) and switches to browser
EngineManager with quality-scored engine selection

Browser Backends

Playwright — Chromium, Firefox, WebKit
Lightpanda — local mode via @lightpanda/browser, cloud mode via CDP WebSocket

Content Processing

HTML cleaning via Cheerio (strips scripts, styles, noise tags)
Link extraction with internal/external classification
Media extraction (images, videos, audio) with quality scoring
Rich metadata extraction — 50+ fields: Open Graph, Twitter Cards, Dublin Core, JSON-LD, article tags, favicons, feeds, alternates
Markdown generation via Turndown with citation support
Accessibility tree snapshots — compact semantic page representation with @e refs

Extraction Strategies

CSS selector extraction — map selectors to JSON fields
Regex extraction — pattern matching with named capture groups
XPath extraction — XPath-to-CSS conversion
Table extraction — structured headers, rows, captions

Deep Crawling

BFS (Breadth-First Search) — level-by-level with concurrent batching
DFS (Depth-First Search) — single-path depth exploration
BestFirst — score-based priority queue using composite scorers
Streaming mode via deepCrawlStream() async generator
maxDepth, maxPages, concurrency controls

URL Filtering

URLPatternFilter — glob/regex include/exclude patterns
DomainFilter — whitelist/blacklist domains
ContentTypeFilter — extension-based filtering
MaxDepthFilter — depth limit per URL
FilterChain — composable, short-circuit evaluation
Denial reasons — track why each URL was rejected with getDenials() / getDenialsByFilter()

URL Scoring

KeywordRelevanceScorer — match keywords in URL and anchor text
PathDepthScorer — shallower paths score higher
FreshnessScorer — URLs with recent dates score higher
DomainAuthorityScorer — preferred domains score highest
CompositeScorer — weighted averaging of multiple scorers

Caching

SQLite-based cache via bun:sqlite with WAL mode
5 cache modes: Enabled, Disabled, ReadOnly, WriteOnly, Bypass
CacheValidator — HTTP HEAD requests with ETag/Last-Modified
Batch insert via setMany() (atomic transactions)

Rate Limiting & Compliance

Per-domain rate limiter with exponential backoff on 429/503
Gradual recovery on success
Configurable jitter, max delay, backoff/recovery factors
Robots.txt parser — User-agent matching, Allow/Disallow, Crawl-delay, Sitemap discovery, wildcard patterns

Anti-Bot

isBlocked() — detects Cloudflare challenges, CAPTCHAs, 403/429/503 blocks
applyStealthMode() — overrides navigator.webdriver, plugins, languages
simulateUser() — random mouse movements and scrolling
withRetry() — automatic retry with escalating delays

Content Filtering

PruningContentFilter — rule-based boilerplate removal
BM25ContentFilter — relevance-based filtering by query

Chunking

RegexChunking — split by patterns (default: paragraphs)
SlidingWindowChunking — word-count windows with overlap
FixedSizeChunking — character-count chunks with overlap
IdentityChunking — no splitting

Browser Utilities

Interactive element detection — finds all clickable elements including cursor:pointer, onclick, tabindex
Iframe content inlining — extracts and merges iframe content into parent HTML
Storage state persistence — save/load cookies and localStorage between sessions
Hooks — onPageCreated, beforeGoto, afterGoto, onExecutionStarted, beforeReturnHtml

Infrastructure

Proxy rotation — round-robin with health tracking and auto-recovery
URL seeder — sitemap discovery via robots.txt chain
Crawler monitor — real-time stats (pages/sec, success rates, data volume)
AI-friendly errors — converts 20+ error patterns into actionable messages
Logging — ConsoleLogger with level filtering, SilentLogger, pluggable Logger interface

Developer Experience

Native TypeScript execution via Bun (no build step)
260 tests via bun:test
Biome for linting and formatting
GitHub Actions CI
Apache-2.0 license

On this page