Scrape a Protected E-Commerce Site
Bypass bot detection with stealth mode, proxy rotation, retry logic, session persistence, and consent popup removal.
This guide walks through scraping a protected e-commerce site that uses bot detection, CAPTCHAs, cookie consent banners, and IP-based rate limiting.
The Goal
Reliably extract product data from a site that blocks automated browsers. We will combine stealth mode, rotating proxies, retry logic, session persistence for logged-in scraping, and consent popup dismissal into a single working pipeline.
Step 1: Enable Stealth Mode
Stealth mode patches common browser fingerprinting signals: it overrides navigator.webdriver, spoofs the Chrome runtime object, randomizes plugins, and sets realistic language headers.
import { WebCrawler, createBrowserConfig } from "feedstock";
const crawler = new WebCrawler({
config: createBrowserConfig({
stealth: true,
headless: true,
}),
});Stealth mode works by injecting init scripts before any page JavaScript runs. It is only effective with the Playwright backend -- the fetch engine does not execute JavaScript, so stealth patches have no effect there.
Step 2: Configure Proxy Rotation
Use ProxyRotationStrategy to round-robin through a pool of proxies. It tracks failures per proxy and automatically marks unhealthy ones, recovering them after a configurable interval.
import { ProxyRotationStrategy } from "feedstock";
const proxies = new ProxyRotationStrategy(
[
{ server: "http://proxy-us.example.com:8080", username: "user", password: "pass" },
{ server: "http://proxy-eu.example.com:8080", username: "user", password: "pass" },
{ server: "http://proxy-ap.example.com:8080", username: "user", password: "pass" },
],
{
maxFailures: 3, // mark unhealthy after 3 consecutive failures
recoveryInterval: 60_000, // retry unhealthy proxy after 60 seconds
},
);Before each request, call proxies.getProxy() to get the next healthy proxy and pass it to a fresh crawler instance (since the proxy is set at browser launch):
const proxy = proxies.getProxy();
const crawler = new WebCrawler({
config: createBrowserConfig({
stealth: true,
proxy,
}),
});After each crawl, report the result so the rotation strategy can track health:
const result = await crawler.crawl(url, config);
proxies.reportResult(proxy, result.success);Step 3: Retry on Blocks
The withRetry helper re-runs a crawl when the response looks like a block page. The isBlocked function checks for common indicators: HTTP 403/429/503 status codes combined with text like "access denied", "captcha", or "checking your browser".
import { withRetry, isBlocked, CacheMode } from "feedstock";
const { result, retries } = await withRetry(
() =>
crawler.crawl("https://store.example.com/products", {
cacheMode: CacheMode.Bypass,
simulateUser: true,
}),
(res) => isBlocked(res.html, res.statusCode ?? 0),
{
maxRetries: 3,
retryDelay: 2000, // doubles each attempt: 2s, 4s, 6s
},
);
if (retries > 0) {
console.log(`Succeeded after ${retries} retries`);
}Setting simulateUser: true adds random mouse movements and scrolling before content is captured. This can help bypass behavioral fingerprinting but adds 300-800ms of overhead per page.
Step 4: Persist Login Sessions
For sites that require authentication, use the storage state utilities to save and restore cookies and localStorage between runs.
First, log in once with a visible browser:
import {
WebCrawler,
createBrowserConfig,
saveStorageState,
loadStorageState,
applyStorageState,
} from "feedstock";
// One-time login: run headful so you can interact
const loginCrawler = new WebCrawler({
config: createBrowserConfig({ headless: false, stealth: true }),
});
// Navigate to login page -- after this, manually log in in the browser window
await loginCrawler.crawl("https://store.example.com/login", {
waitFor: { kind: "selector", value: ".account-dashboard" },
sessionId: "login-session",
});The saveStorageState and loadStorageState functions persist cookies to ~/.feedstock/storage/state.json by default. You can also provide a custom path:
// Save after login
await saveStorageState(browserContext, "./session/store-auth.json");
// On subsequent runs, check for saved state
const state = loadStorageState("./session/store-auth.json");
if (state) {
await applyStorageState(browserContext, state);
}For automated login flows (no manual interaction), use the jsCode option:
const result = await crawler.crawl("https://store.example.com/login", {
jsCode: `
document.querySelector('#email').value = 'user@example.com';
document.querySelector('#password').value = 'secret';
document.querySelector('form').submit();
`,
waitFor: { kind: "selector", value: ".account-dashboard" },
sessionId: "auth-session",
});Step 5: Remove Consent Popups
Set removeConsentPopups: true to automatically dismiss cookie consent banners and overlay dialogs before scraping content:
const result = await crawler.crawl("https://store.example.com/products", {
removeConsentPopups: true,
removeOverlayElements: true,
});The removeOverlayElements option also strips fixed-position overlays (newsletter modals, interstitials) from the DOM before content extraction.
Full Working Example
Putting it all together -- a complete script that scrapes product pages from a protected e-commerce site:
import {
WebCrawler,
createBrowserConfig,
CacheMode,
ProxyRotationStrategy,
withRetry,
isBlocked,
loadStorageState,
applyStorageState,
} from "feedstock";
// -- Proxy pool --
const proxies = new ProxyRotationStrategy([
{ server: "http://proxy-us.example.com:8080", username: "u", password: "p" },
{ server: "http://proxy-eu.example.com:8080", username: "u", password: "p" },
]);
// -- Product schema --
const schema = {
name: "products",
baseSelector: ".product-card",
fields: [
{ name: "title", selector: ".product-name", type: "text" as const },
{ name: "price", selector: ".price", type: "text" as const },
{ name: "image", selector: "img", type: "attribute" as const, attribute: "src" },
],
};
// -- URLs to scrape --
const urls = Array.from({ length: 10 }, (_, i) =>
`https://store.example.com/products?page=${i + 1}`
);
// -- Crawl each page --
const allProducts = [];
for (const url of urls) {
const proxy = proxies.getProxy();
const crawler = new WebCrawler({
config: createBrowserConfig({
stealth: true,
headless: true,
proxy,
}),
});
const { result, retries } = await withRetry(
() =>
crawler.crawl(url, {
cacheMode: CacheMode.Bypass,
simulateUser: true,
removeConsentPopups: true,
removeOverlayElements: true,
blockResources: "fast",
extractionStrategy: { type: "css", params: schema },
waitFor: { kind: "selector", value: ".product-card" },
}),
(res) => isBlocked(res.html, res.statusCode ?? 0),
{ maxRetries: 3, retryDelay: 2000 },
);
proxies.reportResult(proxy, result.success && retries === 0);
if (result.extractedContent) {
const items = JSON.parse(result.extractedContent)
.map((item: { content: string }) => JSON.parse(item.content));
allProducts.push(...items);
}
await crawler.close();
}
console.log(`Extracted ${allProducts.length} products from ${urls.length} pages`);For high-volume scraping, create one crawler per proxy rather than per URL. This avoids the overhead of launching a new browser for every page. Rotate which crawler handles each request using the proxy rotation strategy.