Engineering
Web scraping at scale: the complete field guide to reliable data pipelines
Zenith Automate | March 12, 2026 · 16 min read

Starting a scraper is easy. Keeping one reliable in production is the real work. A deep, practical guide to the four stages of a data pipeline, anti-bot handling, headless browsers, proxies, parsing that survives change, validation, scheduling, monitoring, scaling, the legal questions, and the mistakes that quietly poison your data.
Almost anyone can write a script that pulls data off a web page once. You open the dev tools, find the right selector, fetch the HTML, and print the result. Ten minutes of work and it feels like magic.
Then you try to run it every morning, across a hundred thousand records, behind a login, on a site that quietly changes its layout every few weeks, while another team makes purchasing decisions based on the numbers you produce. The magic becomes a maintenance problem, and the maintenance problem becomes a trust problem the first time a wrong number slips through.
The gap between that first ten-minute script and a pipeline a business can actually run on is enormous, and it is where essentially all of the real engineering lives. This guide walks the whole distance. It is the same playbook behind the production scrapers I run, like the pharmacy price monitor that cut purchasing costs by 25% and the Gemini.pl tracker covering 100k+ products.
Key takeaways
- A production scraper is four stages: fetch, parse, validate, deliver. Each fails in its own way and needs its own defences.
- Reliability beats cleverness. A scraper that is wrong 5% of the time is worse than useless, because you can't tell which 5% is wrong.
- The hard parts are rarely the fetch. They are anti-bot handling, parsing that survives layout drift, and validation that catches bad data before it spreads.
- The real cost of a scraper is maintenance, because the sources you depend on never stop changing. Budget for it or watch it rot.
Why scraping is easy to start and hard to keep
A one-off scrape and a production pipeline look similar for about ten minutes and then diverge completely. The difference is everything that turns "it worked when I ran it" into "it works every morning, at scale, and tells me when it doesn't."
The one-off script assumes the happy path: the page loads, the structure is what you expect, the data is clean, and you are watching when it runs. Production assumes none of that. Pages time out, get blocked, or return a logged-out error shell. Structures drift. Data arrives malformed. And nobody is watching, which means the pipeline has to watch itself.
Simplicity is the ultimate sophistication.
The most reliable scrapers I have built are also the most boring. No clever tricks, just disciplined handling of every way the real world deviates from the demo. That discipline is the whole craft.
The four stages of a real pipeline
Every reliable scraper, no matter the language or the site, is built from the same four stages. Naming them is useful, because each one fails in its own way, needs its own defences, and should be testable on its own.
- 1
Fetch
Get the raw data off the source, handling authentication, pagination, anti-bot protection, rate limits, and retries without getting blocked or banned.
- 2
Parse
Turn messy HTML, JSON, or XML into structured rows, with selectors and a schema that survive small layout changes instead of shattering on them.
- 3
Validate
Reject obviously-bad data before anyone sees it. Is the price in a sane range? Did the page actually load? Is the record still the thing you think it is?
- 4
Deliver
Write clean, validated, deduplicated data where it is needed: Google Sheets, Excel, a database, or an API, on a reliable schedule.
Keep these stages separate in your code. When something breaks at 9am, you want to know in seconds whether the fetch failed, the parse drifted, or the data is simply wrong. A pipeline with blurred boundaries forces you to debug all three at once; a pipeline with clean boundaries tells you exactly where to look.
Stage one: fetching without getting blocked
The interesting data is rarely on a static public page. It is behind a login, spread across paginated results, rendered by JavaScript, and often protected by systems specifically designed to stop automated access. Handling all of that politely and reliably is most of the fetch stage, and most of the difficulty.
HTTP requests versus a real browser
The first decision is how you fetch at all. A plain HTTP request (with a library like httpx or requests) is fast, cheap, and uses almost no memory. If the data you need is in the initial HTML or available from a JSON endpoint the page calls, this is always the right choice.
But many modern sites render their content with JavaScript after the page loads, or defend themselves in ways a bare HTTP client cannot satisfy. For those, a headless browser like Playwright renders the page exactly as a real user's browser would, executing the JavaScript and producing the final DOM. It is far heavier (it launches a real Chromium), so the rule is simple: use HTTP requests by default, and reach for a headless browser only where the site genuinely forces you to.
Sessions, cookies, and staying logged in
Authenticated scraping is the norm, not the exception. The trick is to treat the session like a real one: log in once, persist the cookies, reuse them across requests, and refresh the session before it expires rather than discovering it has expired halfway through a run.
Anti-bot defences and how to be a good citizen
Sites defend themselves with rate limiting, IP reputation checks, browser fingerprinting, and CAPTCHAs. Getting past these is partly technical and partly behavioural, and the behavioural part matters more than people think.
- Pace yourself. Hammering a server with hundreds of requests a second is both rude and the single fastest way to get blocked. Add delays, add jitter so requests are not perfectly periodic, and respect any rate limits the site signals.
- Back off on trouble. When you see a 429, a 503, or a sudden block, slow down or pause rather than retrying instantly in a tight loop. Exponential backoff with a cap is the standard, and it works.
- Rotate where genuinely needed. Proxy rotation and varied request patterns help at real scale, but they are a tool, not a default. Reach for them when a specific site needs it, not as a reflex.
- Look like a real client. A realistic user agent and the headers a browser actually sends go a long way. With a headless browser you get most of this for free.
Pagination, retries, and idempotency
Real datasets are paginated, and pagination is where naive scrapers silently lose data. Walk every page, handle the "next" link or the page parameter explicitly, and detect the end condition rather than guessing a page count. Wrap each fetch in a retry with backoff so a single transient failure does not kill a run. And make the whole thing idempotent: running it twice should not create duplicate or corrupted data, because at some point you will run it twice.
Stage two: parsing that survives change
Sites change their markup. That is not an edge case, it is the permanent background condition of the work. The goal of the parse stage is to extract the data you need in a way that bends when the layout shifts instead of shattering.
Prefer stable anchors over brittle ones
Some ways of locating data on a page are far more durable than others:
- Stable: semantic structure, ARIA roles,
data-attributes, visible text labels you can anchor to ("find the value next to the cell that says 'Price'"), and any structured data the site already publishes. - Brittle: deep CSS paths,
nth-childchains, auto-generated class names, and absolute XPath expressions. These break the moment a designer adds a wrapping<div>.
A scraper built on stable anchors survives most redesigns untouched. A scraper built on div > div > div:nth-child(3) > span breaks weekly.
Use the structured data the site already gives you
Many sites embed clean, machine-readable data right in the page: JSON-LD blocks, microdata, Open Graph tags, or a JSON API the front-end consumes. When that exists, parse it rather than scraping the rendered HTML. It is more stable, more complete, and far less likely to change than the visual layout.
Normalise into one schema early
This is the step that quietly creates most of the value, and the one most people skip. When you scrape the same kind of thing from three different sources, "which source is cheapest?" only has an answer if all three are mapped onto the same fields, with the same units, the same date formats, and the same product identity. That normalisation layer is where scattered scraped fragments become a dataset you can actually reason about.
# Parse, then normalise onto one shared schema, immediately.
from dataclasses import dataclass
@dataclass
class Product:
sku: str # a stable identity, mapped across sources
name: str
price_cents: int # always minor units, never floats for money
currency: str
in_stock: bool
source: str
scraped_at: str # ISO 8601, with timezone
def normalize(raw: dict, source: str) -> Product:
return Product(
sku=canonical_sku(raw), # the hard, valuable part
name=raw["title"].strip(),
price_cents=to_cents(raw["price"]), # parse "12,99 zł" -> 1299
currency=raw.get("currency", "PLN"),
in_stock=parse_stock(raw["availability"]),
source=source,
scraped_at=now_iso(),
)
Stage three: validation, the part that separates a demo from a tool
This is the stage that separates a weekend project from something a business runs on, and it is the one beginners skip entirely. Before a single row is stored, it gets checked. Is the price within a plausible range? Did the page actually return content, or an empty shell? Is the record still the thing you think it is, or did the structure shift so you are now reading the wrong field?
The reason this matters so much is that bad data is invisible until it costs you. A scraper that is right 95% of the time sounds fine, until you realise you have no way of knowing which 5% is wrong, so you cannot trust any of it. One mispriced product in a purchasing decision can cost more than the entire scraper.
What to validate
- Range and sanity checks. Prices within plausible bounds, counts non-negative, dates not in the future, strings not empty where they must not be.
- Schema validation. Every field is the right type and shape. A library like
pydanticturns this into a few lines and rejects malformed records at the boundary. - Page-level checks. Did you actually get a product page, or a login wall, a 404 dressed as a 200, or a "we are doing maintenance" placeholder? Check for the markers of a real result before trusting any of it.
- Volume checks. If yesterday returned 98,000 rows and today returns 12, something is broken upstream even if every individual row is valid. Compare against the recent norm.
Without data, you're just another person with an opinion. But with bad data, you're worse off than both.
Stage four: delivery, scheduling, and storage
Once data is clean and validated, delivery should be almost boring, and that is the goal. A scheduled run writes the results to wherever the humans actually work: a Google Sheet they already open every morning, an Excel export, a database your app reads, or an API another system calls.
The disciplines here are quiet but real:
- Deliver on a predictable rhythm. Cron is perfectly good for most cases; a task queue earns its complexity only when you need it. Match the schedule to how the data is used: daily for purchasing, hourly for fast-moving listings, on-demand for ad-hoc questions.
- Store history, not just the latest state. A dated record of every run is what makes trend analysis and "when did this change?" possible. It is also your audit trail when someone questions a number.
- Use the right type for money and identity. Store money as integer minor units, never floats. Keep a stable identity for each record so you can track it over time and deduplicate reliably.
- Make output match the workflow. The best pipeline is one nobody has to think about, because the right numbers are simply there, in the format they expect, when they start work.
Monitoring: the part everyone skips and everyone regrets
Here is the uncomfortable truth: a scraper is never finished, because the sources it depends on never stop changing. A site redesigns, an API tweaks a field, a login form adds a step. Without monitoring, you find out weeks later, when someone notices the numbers have looked wrong for a while and nobody can say since when.
So treat monitoring as a first-class part of the system, not an afterthought bolted on later:
- Alert on failure and on silence. A run that crashes is easy to catch. A run that "succeeds" but returns far too few records, or a sudden spike in validation rejections, is the dangerous one. Alert on both.
- Watch the shape of the data over time, not just whether the job exited zero. A canary record you know the expected value of is a cheap, powerful early warning.
- Make a broken source page a human within a day, not a quarter. The cost of a broken scraper is not the fix, it is every decision made on stale or wrong data in the meantime.
This is exactly why I treat scraping as an ongoing support and maintenance relationship rather than a one-off delivery. The Run phase is not padding, it is what keeps the data trustworthy month after month.
Scaling: from one site to millions of records
Most scrapers start small and grow. Scaling well is mostly about doing less, not doing more:
- Crawl incrementally. Re-fetching everything every run is wasteful and rude. Track what you have seen and fetch only what changed where the source lets you.
- Add concurrency carefully. Parallel requests speed things up until they get you blocked. Concurrency with a sane limit and politeness beats raw parallelism every time.
- Separate the stages at scale. A queue between fetch and parse lets each scale independently and makes retries clean. Reach for this when volume demands it, not before.
The legal and ethical questions, briefly
Scraping sits in a real legal and ethical context, and it is worth being deliberate about it. The honest, practical position I work from:
- Favour public, non-personal data. Publicly available information is the safest ground. Personal data brings GDPR and similar regimes into play and needs a lawful basis and care.
- Respect the site. Honour reasonable rate limits, do not degrade the service for real users, and pay attention to a site's terms where they apply to your use.
- Scope it together up front. For anything sensitive or borderline, the right move is to decide the boundaries before you build, not after. That is part of what the audit is for.
This is not legal advice, and the specifics depend on your jurisdiction and use case. But "is this public, am I being polite, and is there a lawful basis" is the right starting question every time.
The mistakes that quietly poison your data
After enough production scrapers, the same failures show up again and again. Avoiding these puts you ahead of most:
- Silent failure. Writing error pages, logged-out shells, or partial results as if they were real data. Always verify you got a real result first.
- Floats for money.
0.1 + 0.2is not0.3. Store money as integer minor units and convert only at display. - No history. Overwriting yesterday's data with today's, so you can never answer "when did this change?" or recover from a bad run.
- Brittle selectors. Building on auto-generated classes and deep paths that break on the next redesign.
- No volume check. Trusting a run that returned 12 rows because each of the 12 was valid.
- Treating it as one-off. Shipping a scraper and walking away, then being surprised when it is dead three months later.
A reliable run, end to end
Putting the stages together, a production run looks like this. Notice how much of it is about not trusting the happy path.
# The shape of a reliable daily run: stages honest and separate, failures loud.
def daily_run(source):
session = authenticate(source) # fetch: login + anti-bot
raw = crawl(session, source.catalog) # fetch: paginate fully, retry
parsed = [parse(p) for p in raw] # parse: extract
rows = [normalize(p, source) for p in parsed] # parse: one schema
clean = [r for r in rows if validate(r)] # validate: reject bad data
if len(clean) < source.min_expected: # monitor: volume sanity check
alert(f"{source.name}: only {len(clean)} rows, expected ~{source.min_expected}")
return # do NOT deliver a broken run
rejected = len(rows) - len(clean)
if rejected > len(rows) * 0.05: # monitor: rejection spike
alert(f"{source.name}: {rejected} rows failed validation, investigate")
store_history(clean, source) # deliver: dated history
export(clean, source.destination) # deliver: Sheets / DB / API
Frequently asked questions
Where to go from here
If you are checking prices, stock, listings, or competitor data by hand today, there is almost certainly a version of this that runs while you sleep, and pays for itself quickly. The hard part is not the first script. It is everything that makes it reliable, and that is exactly the part I take on.
See how this fits into a wider engagement in how every project runs, understand the commercials on the pricing page, read more on when to reach for code versus no-code, or tell me what you want to collect and I will come back with specifics within 24 hours.
Have a process worth automating?
Tell me about it, I’ll reply within 24 hours.

