DOM Downsampling

DOM downsampling preprocesses HTML to dramatically reduce size while preserving semantic content. Inspired by the D2Snap paper, it applies six processing passes to strip noise before the scraping/extraction pipeline.

Quick Start

import { DomDownsampler } from "feedstock";

const downsampler = new DomDownsampler();
const cleanHtml = downsampler.downsample(rawHtml);
// 30-85% smaller depending on page complexity

What It Removes

Boilerplate tags — <script>, <style>, <noscript>, <svg>, <iframe>, HTML comments
Non-semantic attributes — data-*, onclick, tracking attributes. Keeps href, src, alt, id, class, role, aria-label
Single-child container chains — <div><div><div><p>text</p></div></div></div> becomes <p>text</p>
Empty nodes — elements with no text or meaningful children (preserves <img>, <input>, <br>, <hr>)
Excessive whitespace — collapsed to single spaces
Long text nodes — optional truncation for very long text blocks

Configuration

import { createDomDownsamplingConfig, DomDownsampler } from "feedstock";

const config = createDomDownsamplingConfig({
  maxTextLength: 1000,       // truncate text nodes beyond this (0 = no truncation)
  collapseContainers: true,  // collapse single-child chains (default: true)
  removeEmptyNodes: true,    // remove empty elements (default: true)
  preserveAttributes: [      // attributes to keep (has sensible defaults)
    "href", "src", "alt", "title", "role", "aria-label",
    "id", "class", "type", "name", "value", "action", "method",
  ],
});

const downsampler = new DomDownsampler(config);

With Crawler Config

const result = await crawler.crawl("https://example.com", {
  domDownsampling: {
    enabled: true,
    maxTextLength: 500,
  },
});

Real-World Results

Page	Before	After	Reduction
Wikipedia article	250 KB	172 KB	31%
Hacker News	34 KB	24 KB	29%
Typical SPA (inline scripts/styles)	2.7 KB	0.4 KB	84%

Clean sites like Wikipedia see moderate reduction. Sites with inline scripts, tracking pixels, and style blocks see 80%+ reduction.

When to Use

Before feeding HTML to extraction strategies (reduces processing time)
Before markdown generation (cleaner output)
When crawling at scale (less memory per page)
Before LLM processing (fewer tokens)