feedstock

Regex Extraction

Extract data from HTML using regular expression patterns.

The RegexExtractionStrategy applies regex patterns to HTML content, returning all matches with capture groups.

Basic Usage

const result = await crawler.crawl("https://example.com", {
  extractionStrategy: {
    type: "regex",
    params: {
      patterns: [/\$\d+\.\d{2}/g],
    },
  },
});

const prices = JSON.parse(result.extractedContent!);
// [{ index: 0, content: "$9.99", metadata: { fullMatch: "$9.99", groups: {}, captures: [] } }]

Named Capture Groups

import { RegexExtractionStrategy } from "feedstock";

const strategy = new RegexExtractionStrategy([
  /(?<currency>\$|EUR|GBP)(?<amount>\d+(?:\.\d{2})?)/g,
]);

const items = await strategy.extract(url, html);
// items[0].metadata.groups = { currency: "$", amount: "9.99" }

Multiple Patterns

const strategy = new RegexExtractionStrategy([
  /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z]{2,}\b/gi,  // emails
  /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g,                       // phone numbers
  /https?:\/\/[^\s<>"]+/g,                                  // URLs
]);

Result Structure

Each match returns:

{
  index: number;           // Sequential index
  content: string;         // Full match text
  metadata: {
    fullMatch: string;     // Same as content
    groups: Record<string, string>;  // Named capture groups
    captures: string[];    // Positional captures
  }
}

Patterns should use the g (global) flag to find all matches. Without it, only the first match is returned.

On this page