Normalizing Recipe Data Across 7 Different Websites
How RecipeScrape transforms messy, inconsistent recipe data from 7 different sources into clean, structured JSON — every single day.
Every recipe site publishes structured data using schema.org/Recipe markup — but "structured" is a generous term. One site puts cook time in PT30M format, another uses "30 minutes", and a third just leaves it blank. Some tag cuisines; others don't. Nutrition data might be a complete breakdown or just a calorie count.
RecipeScrape's normalizer layer handles these inconsistencies so the API always returns predictable shapes.
The Normalizer Pipeline
After recipe-scrapers extracts raw data from a page's JSON-LD, every recipe passes through a normalizer that applies the same rules regardless of source:
Timings. All duration fields (prep, cook, total) are parsed from ISO 8601 (PT30M), human-readable strings ("1 hour 15 minutes"), or raw numbers. The normalizer converts everything to integer minutes. If a value can't be parsed, it's null — never a string.
Servings. Yield fields arrive as "4 servings", 4, or "4-6". The normalizer extracts the lower bound as an integer. Range information is preserved for future use but the API always returns a single number.
Nutrition. The schema allows 20+ nutrition fields but most recipes only include 5–8. The normalizer fills missing fields as null rather than zero — there's a big difference between "zero fat" and "unknown fat."
Ingredients. Raw ingredient strings like "1 cup all-purpose flour, sifted" are parsed into {name, quantity, unit, notes} using a heuristic parser. The quantity becomes a float, the unit maps to a standard set (cup, tbsp, g, ml, etc.), and anything in parentheses or after a comma goes into notes.
Source Profiles
Each source gets a YAML profile that tells the normalizer about site-specific quirks:
allrecipes:
nutrition_mapping:
calories: "calories"
protein: "proteinContent"
time_format: "mixed" # ISO 8601 and human-readable
cuisine_confidence: "high" # uses a cuisine taxonomy
recipetineats:
nutrition_mapping:
calories: "calories"
protein: "proteinContent"
sodium: "sodiumContent"
time_format: "iso8601"
cuisine_confidence: "medium" # free-text cuisine fieldHandling Missing Data
Not every recipe has every field. Dessert recipes rarely have a cuisine tag. Some authors skip prep time. The rule is simple: missing fields are null, never empty strings or sentinel values like -1. This lets API consumers write clean code:
const prepTime = recipe.prepTimeMinutes ?? "Not specified";The Result
After normalization, a recipe from Food Network and a recipe from Simply Recipes look identical in structure. The sourceSite field is the only hint of origin — and the sourceUrl field links back for attribution. Developers building on the API never need to write a single site-specific parser.