Scraping 7,000+ Recipes Without Being a Menace
How RecipeScrape scrapes 7 recipe sites ethically with robots.txt compliance, rate limiting, exponential backoff, and zero login-required content.
Web scraping gets a bad reputation, and often deservedly so. But scraping public recipe data for a non-commercial API doesn't have to be rude. RecipeScrape was built with a simple philosophy: take only what's public, be polite about it, and always attribute the source.
The Ethical Ruleset
Every spider in the project enforces the same rules, baked into the BaseScraper class and its utility layer:
1. robots.txt is law. Before a spider fetches a single URL, it checks robots.txt for the target domain. Disallowed paths are silently skipped. No arguments, no workarounds.
2. A respectful User-Agent. Every request identifies itself as RecipeScrapeBot/1.0 (+https://recipescrape.dev/bot). If a site operator wants to block us or reach out, they can.
3. Rate limiting with teeth. Minimum 1.5-second delay between requests to the same domain, plus random jitter (0–1.5s). Max 2 concurrent requests per domain via asyncio.Semaphore(2). This keeps the load on any single site well below what a human browsing with a few open tabs would generate.
4. Exponential backoff on 429. If a server responds with HTTP 429 (Too Many Requests), the scraper backs off with exponential delay, up to 5 retries via the tenacity library. No hammering.
Why Not Just Use an API?
Most recipe sites don't have public APIs, and the ones that do are usually gated behind expensive licensing. The recipe-scrapers Python library already handles structured parsing of schema.org/Recipe ld+json markup — the same structured data that Google uses for rich search results. We're not parsing HTML soup; we're reading the data the site already publishes in machine-readable form.
Attribution Is Non-Negotiable
Every recipe in the database carries its source_url and source_site. These fields are returned in every API response and displayed prominently on every recipe page. The API docs say it clearly: "Attribution is non-negotiable."
The Spider Architecture
Each target site (allrecipes, foodnetwork, simplyrecipes, etc.) gets its own spider subclass. Spiders only need to implement one method: get_recipe_urls(), which crawls sitemaps or category pages to discover individual recipe URLs. The actual parsing is handled by recipe-scrapers, which extracts structured data from the page's embedded JSON-LD.
The orchestrator — ScrapeEngine — runs all spiders with a shared semaphore, deduplicates URLs already in the database, tracks progress per site, and writes a ScrapeRun record to audit every job.
The Result
7,166 recipes across 7 sites, collected over multiple scheduled runs. Zero blocks, zero complaints, and a fully open API that any developer can use to build cooking apps.