Robots.txt
Parse and respect robots.txt directives.
The RobotsParser fetches, parses, and caches robots.txt files. It supports Allow, Disallow, Crawl-delay, Sitemap, wildcard patterns, and end-of-URL anchors.
Usage
import { RobotsParser } from "feedstock";
const parser = new RobotsParser("feedstock"); // user-agent name
// Fetch and parse robots.txt for a URL's origin
const directives = await parser.fetch("https://example.com/page");
// Check if a specific URL is allowed
if (parser.isAllowed("https://example.com/admin", directives)) {
// OK to crawl
}
// Access crawl delay
if (directives.crawlDelay) {
rateLimiter.setDelay("https://example.com/", directives.crawlDelay * 1000);
}
// Discover sitemaps
console.log(directives.sitemaps);
// ["https://example.com/sitemap.xml"]With Deep Crawling
const results = await crawler.deepCrawl(
"https://example.com",
{},
{
robotsParser: new RobotsParser("my-crawler"),
},
);The deep crawl strategies automatically check robots.txt before crawling each discovered URL.
Parsing Rules
The parser follows the Robots Exclusion Protocol:
- User-agent matching — matches your bot name, falls back to
* - Allow/Disallow — longest match wins (more specific rules take priority)
- Wildcards —
*matches any sequence,$anchors to end of URL - Crawl-delay — per-agent delay in seconds
- Sitemap — sitemap URLs (regardless of user-agent section)
Example robots.txt
User-agent: *
Disallow: /private/
Disallow: /admin
Allow: /admin/public
Crawl-delay: 2
User-agent: feedstock
Disallow: /secret/
Allow: /secret/public/
Crawl-delay: 1
Sitemap: https://example.com/sitemap.xmlCaching
Results are cached per-origin after the first fetch. Call parser.clearCache() to reset.
Edit on GitHub
Last updated on