feedstock

Extraction Strategies

Extract structured data from crawled pages using CSS selectors or regex.

Extraction strategies transform cleaned HTML into structured data. Feedstock ships with CSS selector and regex strategies, plus a base class for custom implementations.

How It Works

Set extractionStrategy in your crawl config:

const result = await crawler.crawl("https://example.com/products", {
  extractionStrategy: {
    type: "css",
    params: {
      name: "products",
      baseSelector: ".product",
      fields: [
        { name: "title", selector: "h2", type: "text" },
        { name: "price", selector: ".price", type: "text" },
      ],
    },
  },
});

const items = JSON.parse(result.extractedContent!);
// [{ index: 0, content: '{"title":"Widget","price":"$9.99"}', metadata: {...} }]

Available Strategies

No-Op Strategy

The NoExtractionStrategy returns HTML as-is. This is the default when no strategy is configured.

import { NoExtractionStrategy } from "feedstock";

const strategy = new NoExtractionStrategy();
const items = await strategy.extract(url, html);
// [{ index: 0, content: html }]

Custom Strategies

Extend ExtractionStrategy to build your own:

import { ExtractionStrategy, type ExtractedItem } from "feedstock";

class JsonApiExtractor extends ExtractionStrategy {
  async extract(url: string, html: string): Promise<ExtractedItem[]> {
    // Parse embedded JSON-LD, microdata, etc.
    const scripts = html.match(/<script type="application\/ld\+json">(.*?)<\/script>/gs);
    return (scripts ?? []).map((s, i) => ({
      index: i,
      content: s.replace(/<\/?script[^>]*>/g, "").trim(),
    }));
  }
}

On this page