feedstock

Content Filters

Remove low-quality or irrelevant content from scraped pages.

Content filters post-process scraped text to remove noise and keep only relevant content.

PruningContentFilter

Rule-based filter that removes short blocks, boilerplate, and low-quality patterns.

import { PruningContentFilter } from "feedstock";

const filter = new PruningContentFilter({
  minWords: 5,  // blocks with fewer words are removed
});

const cleaned = filter.filter(rawContent);

Automatically removes blocks matching patterns like:

  • "Share", "Tweet", "Subscribe", "Sign up"
  • "Copyright", "All rights reserved"
  • "Advertisement", "Sponsored"
  • "Loading", "Please wait"

BM25ContentFilter

Relevance-based filter using BM25 scoring. Keeps blocks that are relevant to a search query.

import { BM25ContentFilter } from "feedstock";

const filter = new BM25ContentFilter({
  k1: 1.5,        // term frequency saturation
  b: 0.75,         // document length normalization
  threshold: 0.1,  // minimum relevance score (0-1)
});

const relevant = filter.filter(content, "TypeScript web crawler");

Returns only content blocks that score above the threshold for the given query. Falls back to the full content if nothing matches.

Custom Filter

Extend ContentFilterStrategy:

import { ContentFilterStrategy } from "feedstock";

class LanguageFilter extends ContentFilterStrategy {
  filter(content: string, query?: string): string {
    // Keep only English-looking blocks
    return content.split("\n\n")
      .filter(block => /^[a-zA-Z\s.,!?]+$/.test(block))
      .join("\n\n");
  }
}
Edit on GitHub

Last updated on

On this page