Products update product platform startup

Scraped 300 pages successfully. Site updated robots.txt at page 187 and blocked me.

DEV Communityby Nico ReyesApril 3, 20262 min read1 views

Scraped 300 pages successfully. Site updated robots.txt at page 187 and blocked me. Building a price tracker for electronics. Target: 300 product pages across an ecommerce site. Tested first 20 pages, everything worked. Ran the full scraper overnight. Woke up to find 187 products scraped, then nothing. Zero errors in my logs. What happened The site admin updated their robots.txt while I was sleeping. Added Disallow: /products/* between page 187 and 188. My scraper checks robots.txt once at startup, then runs. By page 188, their server started returning 403 Forbidden. Fun times. The mess I made First attempt: Just scraped the remaining 113 pages ignoring robots.txt. Got IP banned within 15 minutes. Smart. Second attempt: Added 5 second delays between requests. Still banned. Slower this time

Scraped 300 pages successfully. Site updated robots.txt at page 187 and blocked me.

Building a price tracker for electronics. Target: 300 product pages across an ecommerce site. Tested first 20 pages, everything worked. Ran the full scraper overnight.

Woke up to find 187 products scraped, then nothing. Zero errors in my logs.

What happened

The site admin updated their robots.txt while I was sleeping. Added Disallow: /products/* between page 187 and 188. My scraper checks robots.txt once at startup, then runs. By page 188, their server started returning 403 Forbidden.*

Fun times.

The mess I made

First attempt: Just scraped the remaining 113 pages ignoring robots.txt.

Got IP banned within 15 minutes. Smart.

Second attempt: Added 5 second delays between requests.

Still banned. Slower this time, but same result.

Third attempt: Residential proxies.

This worked but cost $40 for what should've been free data.

What I changed

import requests from urllib.robotparser import RobotFileParser import time

import requests from urllib.robotparser import RobotFileParser import time

class RobotChecker: def init(self, base_url): self.base_url = base_url self.last_check = 0 self.cache_duration = 300 # 5 minutes self.parser = RobotFileParser()

def can_fetch(self, url):

Refresh robots.txt every 5 min instead of once

if time.time() - self.last_check > self.cache_duration: self.parser.set_url(f"{self.base_url}/robots.txt") self.parser.read() self.last_check = time.time()

return self.parser.can_fetch("", url)

In scraper loop

robot = RobotChecker("https://example.com") for page in pages: if not robot.can_fetch(page): print(f"Robots.txt changed, stopping at {page}") break

scrape page`

Enter fullscreen mode

Exit fullscreen mode

Checking robots.txt every 5 minutes caught changes before getting banned. Saved me proxy costs when sites decide to block partway through.

Platform quirks

Some ecommerce platforms update robots.txt dynamically when traffic spikes. Shopify stores do this sometimes. Big sites like Amazon never change theirs, smaller ones panic and lock everything down.

If your scraper runs longer than 10 minutes, periodic robot checks matter. Most tutorials skip this because test runs finish fast.

Still annoying when sites block you halfway through.

Original source

DEV Community

https://dev.to/nicodev__/scraped-300-pages-successfully-site-updated-robotstxt-at-page-187-and-blocked-me-28k

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

updateproductplatform

ModelsFresh

Meta paused its work with AI training startup Mercor after a data breach - Business Insider

Meta paused its work with AI training startup Mercor after a data breach Business Insider

GNews AI Meta

1mabout 6 hours ago

ProductsLive

🦀 Rust Foundations — The Stuff That Finally Made Things Click

"Rust compiler and Clippy are the biggest tsunderes — they'll shout at you for every small mistake, but in the end… they just want your code to be perfect." Why I Even Started Rust I didn't pick Rust out of curiosity or hype. I had to. I'm working as a Rust dev at Garden Finance , where I built part of a Wallet-as-a-Service infrastructure. Along with an Axum backend, we had this core Rust crate ( standard-rs ) handling signing and broadcasting transactions across: Bitcoin EVM chains Sui Solana Starknet And suddenly… memory safety wasn't "nice to have" anymore. It was everything. Rust wasn't just a language — it was a guarantee . But yeah… in the beginning? It felt like the compiler hated me :( So I'm writing this to explain Rust foundations in the simplest way possible — from my personal n

DEV Community

11mabout 1 hour ago

ReleasesLive

OpenClaw 2026.3.31: Task Flows, Locked-Down Installs, and the Security Release Your Agent Needed

OpenClaw 2026.3.31 dropped yesterday, and this one's different. Where the last few releases added capabilities — new channels, new models, new tools — this release is about control . Specifically: controlling what your agent installs, what your nodes can access, and how background work is tracked. If you run agents in production, this is the update you've been waiting for. Task Flows: Your Agent's Work Finally Has a Paper Trail This is the headline feature and it's been a long time coming. Background tasks — sub-agents, cron jobs, ACP sessions, CLI background runs — were all tracked separately. Different systems, different lifecycle management, different ways things could silently break. Not anymore. Everything now lives under one SQLite-backed ledger . You can run openclaw flows list , op

DEV Community

6mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 319 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsLive

🦀 Rust Foundations — The Stuff That Finally Made Things Click

DEV Community

11mabout 1 hour ago

ProductsLive

Why Standard HTTP Libraries Are Dead for Web Scraping (And How to Fix It)

If you are building a data extraction pipeline in 2026 and your core network request looks like Ruby’s Net::HTTP.get(URI(url)) or Python's requests.get(url) , you are already blocked. The era of bypassing bot detection by rotating datacenter IPs and pasting a fake Mozilla/5.0 User-Agent string is long gone. Modern Web Application Firewalls (WAFs) like Cloudflare, Akamai, and DataDome don’t just read your headers anymore—they interrogate the cryptographic foundation of your connection. Here is a deep dive into why standard HTTP libraries actively sabotage your scraping infrastructure, and how I built a polyglot sidecar architecture to bypass Layer 4–7 fingerprinting entirely. The Fingerprint You Didn’t Know You Had When your code opens a secure connection to a server, long before the first

DEV Community

5mabout 1 hour ago

ProductsLive

Tired of Zillow Blocking Scrapers — Here's What Actually Works in 2026

If you've ever tried scraping Zillow with BeautifulSoup or Selenium, you know the pain. CAPTCHAs, IP bans, constantly changing HTML selectors, headless browser detection — it's an arms race you're not going to win. I spent way too long fighting anti-bot systems before switching to an API-based approach. This post walks through how to pull Zillow property data, search listings, get Zestimates, and export everything to CSV/Excel — all with plain Python and zero browser automation. What You'll Need Python 3.7+ The requests library ( pip install requests ) A free API key from RealtyAPI That's it. No Selenium. No Playwright. No proxy rotation. Getting Started: Your First Property Lookup Let's start simple — get full property details for a single address: import requests url = " https://zillow.r

DEV Community

13mabout 1 hour ago

ProductsFresh

Why Gaussian Diffusion Models Fail on Discrete Data?

arXiv:2604.02028v1 Announce Type: new Abstract: Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate

arXiv cs.CL

1mabout 6 hours ago