feedstock

Changelog

Release history for feedstock

v0.3.0

Bun-native performance

  • Gzip-compressed cache — cached HTML is now compressed with Bun.gzipSync() before storage, cutting cache DB size by 5-10x. Backward compatible with uncompressed legacy data.
  • Bun.hash() content hashing — replaced SHA-256 with wyhash via Bun.hash() for ~10-50x faster change detection. Non-cryptographic, ideal for cache invalidation.
  • Bun.sleep() — replaced setTimeout promise pattern in 5 locations (retry, rate limiter, anti-bot, CDP readiness polling) with native Bun.sleep().
  • Bun.file() / Bun.write() — storage state persistence now uses Bun-native file I/O instead of node:fs. loadStorageState and getStorageStatePath are now async.
  • Bun.env — config loader uses Bun.env instead of process.env.
  • MonitorDashboard — new Bun.serve()-powered live monitoring server. HTTP JSON API (/stats, /health) and WebSocket (/ws) for real-time crawl stats.
  • 349 tests (24 new), 148 dogfood checks passing.

v0.2.0

Agent-browser patterns

Patterns inspired by vercel-labs/agent-browser:

New browser backends

  • Generic CDP backend{ kind: "cdp", wsUrl: "..." } connects to any cloud browser provider (Browserbase, Browserless, etc.) via chromium.connectOverCDP()
  • Connection resilienceBrowserManager.start() retries transient errors (ECONNREFUSED, ETIMEDOUT, WebSocket errors) with exponential backoff (default 3 retries, 500ms base, 30s cap)
  • Lightpanda CDP readiness polling — polls /json/version endpoint before connecting, preventing race conditions on slow starts
  • Failed start cleanup — kills leaked Lightpanda processes and partial browser connections between retry attempts

New extraction & detection

  • AccessibilityExtractionStrategy — extracts semantic content (headings, links, buttons, inputs, images) using the static Cheerio snapshot builder. Filterable by role, works without a live browser.
  • detectCursorInteractiveElements() — finds hidden interactive elements via page.evaluate() (cursor:pointer, onclick, tabindex, contenteditable, ARIA roles)

Incremental crawling

  • Content hashing in cachecontentHash() and cache.hasChanged() for detecting unchanged pages
  • Schema migration — existing cache databases automatically gain the content_hash column via ALTER TABLE
  • Crawler integrationcrawler.crawl() now stores content hash on every cache write

Resource blocking profiles

  • Named profiles"fast" (images/fonts/media), "minimal" (everything except HTML/JS), "media-only" (images/video/audio)
  • Custom config{ patterns: ["**/*.woff2"], resourceTypes: ["font"] }
  • Backward compatibleblockResources: true maps to "fast" profile
  • Environment variableFEEDSTOCK_BLOCK_RESOURCES=fast

Layered configuration

  • feedstock.json project config file (searches cwd and parent directories)
  • FEEDSTOCK_* environment variablesFEEDSTOCK_CDP_URL, FEEDSTOCK_HEADLESS, FEEDSTOCK_PROXY, FEEDSTOCK_BLOCK_RESOURCES, FEEDSTOCK_PAGE_TIMEOUT, and more
  • loadConfig() — merges all layers with precedence: programmatic > env vars > project file > defaults

Benchmarking

  • Benchmark suite (benchmarks/bench.ts) — 6 scenarios: cache write (100/1000), read, hasChanged, contentHash (10kb/100kb)
  • Warmup iterations, p50/stddev/min/max stats, --json output, name filtering

Testing

  • 325 tests (92 new), including regression suite for config defaults, cache operations, extraction strategies, snapshot structure, and model factories
  • Schema migration tested against old-format SQLite databases
  • Resource blocker tested with mock BrowserContext

v0.1.6

Auto-escalation on anti-bot blocks

  • Auto-escalate on 403/429/503 — when FetchEngine gets a blocked response (Cloudflare challenge, "Access Denied", rate limit), automatically escalates to PlaywrightEngine with stealth. No manual withRetry() needed.
  • autoEscalateOnBlock config option on EngineManager (default: true). Disable with { autoEscalateOnBlock: false }.
  • SPA detection scoped to 2xx responses — prevents false escalation on error pages with short bodies.
  • 7 new unit tests for escalation logic (403, 429, 503, disable flag, fallback without browser, SPA + block sequence).

v0.1.5

Built-in stealth mode

  • stealth: true on BrowserConfig — one flag enables random user-agents (9 realistic browser UAs), navigator.webdriver override, Chrome runtime spoofing, plugin/language spoofing via addInitScript
  • simulateUser: true on CrawlerRunConfig — automatically runs random mouse movements and scrolling after navigation. No manual hooks needed.
  • Stealth is applied at context creation, so every page in the session gets it automatically

v0.1.4

Production hardening

  • Input validationcrawl(), crawlMany(), deepCrawl() validate URLs upfront with friendly error messages
  • FetchEngine retry — retries transient network errors (ECONNRESET, ETIMEDOUT) up to 2 times with backoff
  • Session limitsBrowserManager caps at 20 concurrent sessions with LRU eviction
  • Cache pruningcache.pruneOlderThan(ms) and cache.size for TTL-based cleanup
  • Graceful shutdown — SIGINT/SIGTERM handlers auto-close browser on process exit
  • User-agent rotationUserAgentRotator with 9 realistic browser user-agents, getRandomUserAgent()
  • Static interactive detectiondetectInteractiveElementsStatic() works without browser for processHtml
  • processHtml wireddetectInteractiveElements config option now works on raw HTML
  • prepublishOnly — typecheck + lint + unit tests run before every npm publish

v0.1.3

Performance optimizations and Playwright-native features

Performance

  • HTMLRewriter streaming extraction — links, media, and metadata extracted via Bun's native streaming parser instead of Cheerio DOM. Zero DOM allocation for the extraction path.
  • scrapeAll() single-pass — reduced Cheerio parses from 4x to 1x per page
  • BM25 frequency maps — O(1) term lookup via pre-computed frequency maps and Sets
  • Static cheerio import — removed dynamic require() in snapshot module
  • Batch pathToRegex — single regex replace instead of char-by-char loop (-10 lines)
  • Simplified engine selection — removed dead code in selectEngines (-6 lines)
  • LCS hash-based default — lowered DP threshold to 10K for faster diffs on large pages

New Features

  • blockResources — abort images, CSS, fonts, and media during browser crawls via context.route(). Can cut page load 50-80%.
  • navigationWaitUntil — configurable navigation strategy: "commit" (fastest), "domcontentloaded" (default), "load", "networkidle"
  • extractInPage() — extract links/media/metadata directly inside the browser via page.evaluate(), eliminating HTML serialization round-trip
  • extractAllStreaming() — Bun HTMLRewriter-based extraction as a standalone function
  • Service worker blockingserviceWorkers: 'block' on every browser context ensures route interception works on all sites

v0.1.2

Change tracking and dogfood validation

  • Change trackingChangeTracker detects new/changed/unchanged/removed pages between crawl runs using SHA-256 content hashing and LCS-based text diffing
  • Snapshot management: listSnapshots(), deleteSnapshot(), pruneOlderThan()
  • Text diffs with addition/deletion counts and grouped chunks
  • Configurable: diff markdown vs HTML, max diff chunks, custom DB path
  • 115 dogfood checks against real websites (example.com, Hacker News, Wikipedia)

v0.1.1

Agent-browser features and fetch-first engine system

Engine System

  • Fetch-first architectureFetchEngine (HTTP) tried before PlaywrightEngine (browser)
  • Auto-escalation: detects SPA shells (React, Next.js, Nuxt) and switches to browser
  • EngineManager with quality-scored engine selection and fallback chain
  • likelyNeedsJavaScript() heuristic for SPA detection

Accessibility Snapshots

  • buildStaticSnapshot() — Cheerio-based semantic tree extraction (works with any engine)
  • takeSnapshot() — CDP-based Accessibility.getFullAXTree for browser-precise trees
  • Node categorization: interactive (button, link, input), content (heading, paragraph, img), structural (filtered)
  • @e ref system for deterministic element identification
  • New snapshot field on CrawlResult, enabled via config.snapshot = true

Rich Metadata (50+ fields)

  • Full Open Graph (12 fields), Twitter Card (7), Dublin Core (7), Article (5)
  • JSON-LD parsing, favicons, RSS/Atom feeds, alternate hreflang links
  • Charset, viewport, theme-color, robots, referrer, generator
  • Null values auto-stripped for cleaner output

Filter Denial Reasons

  • applyWithReason() on every filter returns { allowed, reason, filter }
  • Specific reasons: "Domain X is blocked", "Matched exclude pattern: Y", "File extension .pdf is blocked"
  • FilterChain.getDenials() and getDenialsByFilter() for aggregate tracking
  • Fully backward compatible — apply() still returns boolean

Browser Utilities

  • Interactive element detectiondetectInteractiveElements() finds all clickable elements via single JS evaluation (cursor:pointer, onclick, tabindex, ARIA roles)
  • Iframe content inliningextractIframeContent() + inlineIframeContent()
  • Storage state persistencesaveStorageState() / loadStorageState() for cookies + localStorage

AI-Friendly Errors

  • toFriendlyError() converts 20+ error patterns (DNS, timeout, SSL, element interaction, browser crashes) into actionable messages
  • withFriendlyErrors() wrapper for any async operation
  • Auto-applied in crawler.crawl() error handler

Testing

  • 277 unit/integration tests (up from 191)
  • 115 dogfood checks against real websites
  • Battle tests: engine fallback, redirects, timeouts, 404s, cache modes, screenshots, custom JS, network capture

v0.1.0

Initial release — full-featured web crawler for TypeScript/Bun

Core Crawling

  • WebCrawler with crawl(), crawlMany(), processHtml() methods
  • BrowserConfig and CrawlerRunConfig with typed defaults
  • Concurrent crawling with configurable concurrency limit
  • Context manager pattern with start()/close() lifecycle

Engine System

  • Fetch-first architecture: tries lightweight HTTP fetch before launching a browser
  • FetchEngine (quality 5) — simple HTTP, no browser overhead
  • PlaywrightEngine (quality 50) — full Chromium/Firefox/WebKit
  • Auto-escalation: detects SPA shells (React, Next.js, Nuxt) and switches to browser
  • EngineManager with quality-scored engine selection

Browser Backends

  • Playwright — Chromium, Firefox, WebKit
  • Lightpanda — local mode via @lightpanda/browser, cloud mode via CDP WebSocket

Content Processing

  • HTML cleaning via Cheerio (strips scripts, styles, noise tags)
  • Link extraction with internal/external classification
  • Media extraction (images, videos, audio) with quality scoring
  • Rich metadata extraction — 50+ fields: Open Graph, Twitter Cards, Dublin Core, JSON-LD, article tags, favicons, feeds, alternates
  • Markdown generation via Turndown with citation support
  • Accessibility tree snapshots — compact semantic page representation with @e refs

Extraction Strategies

  • CSS selector extraction — map selectors to JSON fields
  • Regex extraction — pattern matching with named capture groups
  • XPath extraction — XPath-to-CSS conversion
  • Table extraction — structured headers, rows, captions

Deep Crawling

  • BFS (Breadth-First Search) — level-by-level with concurrent batching
  • DFS (Depth-First Search) — single-path depth exploration
  • BestFirst — score-based priority queue using composite scorers
  • Streaming mode via deepCrawlStream() async generator
  • maxDepth, maxPages, concurrency controls

URL Filtering

  • URLPatternFilter — glob/regex include/exclude patterns
  • DomainFilter — whitelist/blacklist domains
  • ContentTypeFilter — extension-based filtering
  • MaxDepthFilter — depth limit per URL
  • FilterChain — composable, short-circuit evaluation
  • Denial reasons — track why each URL was rejected with getDenials() / getDenialsByFilter()

URL Scoring

  • KeywordRelevanceScorer — match keywords in URL and anchor text
  • PathDepthScorer — shallower paths score higher
  • FreshnessScorer — URLs with recent dates score higher
  • DomainAuthorityScorer — preferred domains score highest
  • CompositeScorer — weighted averaging of multiple scorers

Caching

  • SQLite-based cache via bun:sqlite with WAL mode
  • 5 cache modes: Enabled, Disabled, ReadOnly, WriteOnly, Bypass
  • CacheValidator — HTTP HEAD requests with ETag/Last-Modified
  • Batch insert via setMany() (atomic transactions)

Rate Limiting & Compliance

  • Per-domain rate limiter with exponential backoff on 429/503
  • Gradual recovery on success
  • Configurable jitter, max delay, backoff/recovery factors
  • Robots.txt parser — User-agent matching, Allow/Disallow, Crawl-delay, Sitemap discovery, wildcard patterns

Anti-Bot

  • isBlocked() — detects Cloudflare challenges, CAPTCHAs, 403/429/503 blocks
  • applyStealthMode() — overrides navigator.webdriver, plugins, languages
  • simulateUser() — random mouse movements and scrolling
  • withRetry() — automatic retry with escalating delays

Content Filtering

  • PruningContentFilter — rule-based boilerplate removal
  • BM25ContentFilter — relevance-based filtering by query

Chunking

  • RegexChunking — split by patterns (default: paragraphs)
  • SlidingWindowChunking — word-count windows with overlap
  • FixedSizeChunking — character-count chunks with overlap
  • IdentityChunking — no splitting

Browser Utilities

  • Interactive element detection — finds all clickable elements including cursor:pointer, onclick, tabindex
  • Iframe content inlining — extracts and merges iframe content into parent HTML
  • Storage state persistence — save/load cookies and localStorage between sessions
  • Hooks — onPageCreated, beforeGoto, afterGoto, onExecutionStarted, beforeReturnHtml

Infrastructure

  • Proxy rotation — round-robin with health tracking and auto-recovery
  • URL seeder — sitemap discovery via robots.txt chain
  • Crawler monitor — real-time stats (pages/sec, success rates, data volume)
  • AI-friendly errors — converts 20+ error patterns into actionable messages
  • Logging — ConsoleLogger with level filtering, SilentLogger, pluggable Logger interface

Developer Experience

  • Native TypeScript execution via Bun (no build step)
  • 260 tests via bun:test
  • Biome for linting and formatting
  • GitHub Actions CI
  • Apache-2.0 license
Edit on GitHub

Last updated on

On this page