Changelog
Release history for feedstock
v0.3.0
Bun-native performance
- Gzip-compressed cache — cached HTML is now compressed with
Bun.gzipSync()before storage, cutting cache DB size by 5-10x. Backward compatible with uncompressed legacy data. Bun.hash()content hashing — replaced SHA-256 with wyhash viaBun.hash()for ~10-50x faster change detection. Non-cryptographic, ideal for cache invalidation.Bun.sleep()— replacedsetTimeoutpromise pattern in 5 locations (retry, rate limiter, anti-bot, CDP readiness polling) with nativeBun.sleep().Bun.file()/Bun.write()— storage state persistence now uses Bun-native file I/O instead ofnode:fs.loadStorageStateandgetStorageStatePathare now async.Bun.env— config loader usesBun.envinstead ofprocess.env.MonitorDashboard— newBun.serve()-powered live monitoring server. HTTP JSON API (/stats,/health) and WebSocket (/ws) for real-time crawl stats.- 349 tests (24 new), 148 dogfood checks passing.
v0.2.0
Agent-browser patterns
Patterns inspired by vercel-labs/agent-browser:
New browser backends
- Generic CDP backend —
{ kind: "cdp", wsUrl: "..." }connects to any cloud browser provider (Browserbase, Browserless, etc.) viachromium.connectOverCDP() - Connection resilience —
BrowserManager.start()retries transient errors (ECONNREFUSED, ETIMEDOUT, WebSocket errors) with exponential backoff (default 3 retries, 500ms base, 30s cap) - Lightpanda CDP readiness polling — polls
/json/versionendpoint before connecting, preventing race conditions on slow starts - Failed start cleanup — kills leaked Lightpanda processes and partial browser connections between retry attempts
New extraction & detection
AccessibilityExtractionStrategy— extracts semantic content (headings, links, buttons, inputs, images) using the static Cheerio snapshot builder. Filterable by role, works without a live browser.detectCursorInteractiveElements()— finds hidden interactive elements viapage.evaluate()(cursor:pointer, onclick, tabindex, contenteditable, ARIA roles)
Incremental crawling
- Content hashing in cache —
contentHash()andcache.hasChanged()for detecting unchanged pages - Schema migration — existing cache databases automatically gain the
content_hashcolumn viaALTER TABLE - Crawler integration —
crawler.crawl()now stores content hash on every cache write
Resource blocking profiles
- Named profiles —
"fast"(images/fonts/media),"minimal"(everything except HTML/JS),"media-only"(images/video/audio) - Custom config —
{ patterns: ["**/*.woff2"], resourceTypes: ["font"] } - Backward compatible —
blockResources: truemaps to"fast"profile - Environment variable —
FEEDSTOCK_BLOCK_RESOURCES=fast
Layered configuration
feedstock.jsonproject config file (searches cwd and parent directories)FEEDSTOCK_*environment variables —FEEDSTOCK_CDP_URL,FEEDSTOCK_HEADLESS,FEEDSTOCK_PROXY,FEEDSTOCK_BLOCK_RESOURCES,FEEDSTOCK_PAGE_TIMEOUT, and moreloadConfig()— merges all layers with precedence: programmatic > env vars > project file > defaults
Benchmarking
- Benchmark suite (
benchmarks/bench.ts) — 6 scenarios: cache write (100/1000), read, hasChanged, contentHash (10kb/100kb) - Warmup iterations, p50/stddev/min/max stats,
--jsonoutput, name filtering
Testing
- 325 tests (92 new), including regression suite for config defaults, cache operations, extraction strategies, snapshot structure, and model factories
- Schema migration tested against old-format SQLite databases
- Resource blocker tested with mock BrowserContext
v0.1.6
Auto-escalation on anti-bot blocks
- Auto-escalate on 403/429/503 — when FetchEngine gets a blocked response (Cloudflare challenge, "Access Denied", rate limit), automatically escalates to PlaywrightEngine with stealth. No manual
withRetry()needed. autoEscalateOnBlockconfig option onEngineManager(default: true). Disable with{ autoEscalateOnBlock: false }.- SPA detection scoped to 2xx responses — prevents false escalation on error pages with short bodies.
- 7 new unit tests for escalation logic (403, 429, 503, disable flag, fallback without browser, SPA + block sequence).
v0.1.5
Built-in stealth mode
stealth: trueonBrowserConfig— one flag enables random user-agents (9 realistic browser UAs),navigator.webdriveroverride, Chrome runtime spoofing, plugin/language spoofing viaaddInitScriptsimulateUser: trueonCrawlerRunConfig— automatically runs random mouse movements and scrolling after navigation. No manual hooks needed.- Stealth is applied at context creation, so every page in the session gets it automatically
v0.1.4
Production hardening
- Input validation —
crawl(),crawlMany(),deepCrawl()validate URLs upfront with friendly error messages - FetchEngine retry — retries transient network errors (ECONNRESET, ETIMEDOUT) up to 2 times with backoff
- Session limits —
BrowserManagercaps at 20 concurrent sessions with LRU eviction - Cache pruning —
cache.pruneOlderThan(ms)andcache.sizefor TTL-based cleanup - Graceful shutdown — SIGINT/SIGTERM handlers auto-close browser on process exit
- User-agent rotation —
UserAgentRotatorwith 9 realistic browser user-agents,getRandomUserAgent() - Static interactive detection —
detectInteractiveElementsStatic()works without browser forprocessHtml processHtmlwired —detectInteractiveElementsconfig option now works on raw HTML- prepublishOnly — typecheck + lint + unit tests run before every
npm publish
v0.1.3
Performance optimizations and Playwright-native features
Performance
- HTMLRewriter streaming extraction — links, media, and metadata extracted via Bun's native streaming parser instead of Cheerio DOM. Zero DOM allocation for the extraction path.
scrapeAll()single-pass — reduced Cheerio parses from 4x to 1x per page- BM25 frequency maps — O(1) term lookup via pre-computed frequency maps and Sets
- Static cheerio import — removed dynamic
require()in snapshot module - Batch
pathToRegex— single regex replace instead of char-by-char loop (-10 lines) - Simplified engine selection — removed dead code in
selectEngines(-6 lines) - LCS hash-based default — lowered DP threshold to 10K for faster diffs on large pages
New Features
blockResources— abort images, CSS, fonts, and media during browser crawls viacontext.route(). Can cut page load 50-80%.navigationWaitUntil— configurable navigation strategy:"commit"(fastest),"domcontentloaded"(default),"load","networkidle"extractInPage()— extract links/media/metadata directly inside the browser viapage.evaluate(), eliminating HTML serialization round-tripextractAllStreaming()— Bun HTMLRewriter-based extraction as a standalone function- Service worker blocking —
serviceWorkers: 'block'on every browser context ensures route interception works on all sites
v0.1.2
Change tracking and dogfood validation
- Change tracking —
ChangeTrackerdetects new/changed/unchanged/removed pages between crawl runs using SHA-256 content hashing and LCS-based text diffing - Snapshot management:
listSnapshots(),deleteSnapshot(),pruneOlderThan() - Text diffs with addition/deletion counts and grouped chunks
- Configurable: diff markdown vs HTML, max diff chunks, custom DB path
- 115 dogfood checks against real websites (example.com, Hacker News, Wikipedia)
v0.1.1
Agent-browser features and fetch-first engine system
Engine System
- Fetch-first architecture —
FetchEngine(HTTP) tried beforePlaywrightEngine(browser) - Auto-escalation: detects SPA shells (React, Next.js, Nuxt) and switches to browser
EngineManagerwith quality-scored engine selection and fallback chainlikelyNeedsJavaScript()heuristic for SPA detection
Accessibility Snapshots
buildStaticSnapshot()— Cheerio-based semantic tree extraction (works with any engine)takeSnapshot()— CDP-basedAccessibility.getFullAXTreefor browser-precise trees- Node categorization: interactive (button, link, input), content (heading, paragraph, img), structural (filtered)
@eref system for deterministic element identification- New
snapshotfield onCrawlResult, enabled viaconfig.snapshot = true
Rich Metadata (50+ fields)
- Full Open Graph (12 fields), Twitter Card (7), Dublin Core (7), Article (5)
- JSON-LD parsing, favicons, RSS/Atom feeds, alternate hreflang links
- Charset, viewport, theme-color, robots, referrer, generator
- Null values auto-stripped for cleaner output
Filter Denial Reasons
applyWithReason()on every filter returns{ allowed, reason, filter }- Specific reasons: "Domain X is blocked", "Matched exclude pattern: Y", "File extension .pdf is blocked"
FilterChain.getDenials()andgetDenialsByFilter()for aggregate tracking- Fully backward compatible —
apply()still returns boolean
Browser Utilities
- Interactive element detection —
detectInteractiveElements()finds all clickable elements via single JS evaluation (cursor:pointer, onclick, tabindex, ARIA roles) - Iframe content inlining —
extractIframeContent()+inlineIframeContent() - Storage state persistence —
saveStorageState()/loadStorageState()for cookies + localStorage
AI-Friendly Errors
toFriendlyError()converts 20+ error patterns (DNS, timeout, SSL, element interaction, browser crashes) into actionable messageswithFriendlyErrors()wrapper for any async operation- Auto-applied in
crawler.crawl()error handler
Testing
- 277 unit/integration tests (up from 191)
- 115 dogfood checks against real websites
- Battle tests: engine fallback, redirects, timeouts, 404s, cache modes, screenshots, custom JS, network capture
v0.1.0
Initial release — full-featured web crawler for TypeScript/Bun
Core Crawling
WebCrawlerwithcrawl(),crawlMany(),processHtml()methodsBrowserConfigandCrawlerRunConfigwith typed defaults- Concurrent crawling with configurable concurrency limit
- Context manager pattern with
start()/close()lifecycle
Engine System
- Fetch-first architecture: tries lightweight HTTP fetch before launching a browser
FetchEngine(quality 5) — simple HTTP, no browser overheadPlaywrightEngine(quality 50) — full Chromium/Firefox/WebKit- Auto-escalation: detects SPA shells (React, Next.js, Nuxt) and switches to browser
EngineManagerwith quality-scored engine selection
Browser Backends
- Playwright — Chromium, Firefox, WebKit
- Lightpanda — local mode via
@lightpanda/browser, cloud mode via CDP WebSocket
Content Processing
- HTML cleaning via Cheerio (strips scripts, styles, noise tags)
- Link extraction with internal/external classification
- Media extraction (images, videos, audio) with quality scoring
- Rich metadata extraction — 50+ fields: Open Graph, Twitter Cards, Dublin Core, JSON-LD, article tags, favicons, feeds, alternates
- Markdown generation via Turndown with citation support
- Accessibility tree snapshots — compact semantic page representation with
@erefs
Extraction Strategies
- CSS selector extraction — map selectors to JSON fields
- Regex extraction — pattern matching with named capture groups
- XPath extraction — XPath-to-CSS conversion
- Table extraction — structured headers, rows, captions
Deep Crawling
- BFS (Breadth-First Search) — level-by-level with concurrent batching
- DFS (Depth-First Search) — single-path depth exploration
- BestFirst — score-based priority queue using composite scorers
- Streaming mode via
deepCrawlStream()async generator maxDepth,maxPages,concurrencycontrols
URL Filtering
URLPatternFilter— glob/regex include/exclude patternsDomainFilter— whitelist/blacklist domainsContentTypeFilter— extension-based filteringMaxDepthFilter— depth limit per URLFilterChain— composable, short-circuit evaluation- Denial reasons — track why each URL was rejected with
getDenials()/getDenialsByFilter()
URL Scoring
KeywordRelevanceScorer— match keywords in URL and anchor textPathDepthScorer— shallower paths score higherFreshnessScorer— URLs with recent dates score higherDomainAuthorityScorer— preferred domains score highestCompositeScorer— weighted averaging of multiple scorers
Caching
- SQLite-based cache via
bun:sqlitewith WAL mode - 5 cache modes: Enabled, Disabled, ReadOnly, WriteOnly, Bypass
CacheValidator— HTTP HEAD requests with ETag/Last-Modified- Batch insert via
setMany()(atomic transactions)
Rate Limiting & Compliance
- Per-domain rate limiter with exponential backoff on 429/503
- Gradual recovery on success
- Configurable jitter, max delay, backoff/recovery factors
- Robots.txt parser — User-agent matching, Allow/Disallow, Crawl-delay, Sitemap discovery, wildcard patterns
Anti-Bot
isBlocked()— detects Cloudflare challenges, CAPTCHAs, 403/429/503 blocksapplyStealthMode()— overrides navigator.webdriver, plugins, languagessimulateUser()— random mouse movements and scrollingwithRetry()— automatic retry with escalating delays
Content Filtering
PruningContentFilter— rule-based boilerplate removalBM25ContentFilter— relevance-based filtering by query
Chunking
RegexChunking— split by patterns (default: paragraphs)SlidingWindowChunking— word-count windows with overlapFixedSizeChunking— character-count chunks with overlapIdentityChunking— no splitting
Browser Utilities
- Interactive element detection — finds all clickable elements including cursor:pointer, onclick, tabindex
- Iframe content inlining — extracts and merges iframe content into parent HTML
- Storage state persistence — save/load cookies and localStorage between sessions
- Hooks — onPageCreated, beforeGoto, afterGoto, onExecutionStarted, beforeReturnHtml
Infrastructure
- Proxy rotation — round-robin with health tracking and auto-recovery
- URL seeder — sitemap discovery via robots.txt chain
- Crawler monitor — real-time stats (pages/sec, success rates, data volume)
- AI-friendly errors — converts 20+ error patterns into actionable messages
- Logging — ConsoleLogger with level filtering, SilentLogger, pluggable Logger interface
Developer Experience
- Native TypeScript execution via Bun (no build step)
- 260 tests via
bun:test - Biome for linting and formatting
- GitHub Actions CI
- Apache-2.0 license
Edit on GitHub
Last updated on