Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessAI’s next frontier is the real worldFortune TechDebris from aerial interception strikes Oracle building in Dubai, UAE saysCNBC TechnologyI Audited 30+ Small Businesses on Their AI Visibility. Here's What Most Are Getting Wrong.Dev.to AIHow to Actually Monitor Your LLM Costs (Without a Spreadsheet)Dev.to AIОдин промпт приносит мне $500 в неделю на фрилансеDev.to AINetflix AI Team Just Open-Sourced VOID: an AI Model That Erases Objects From Videos — Physics and AllMarkTechPostUnderstanding Data Modeling in Power BI: Joins, Relationships, and Schemas Explained.DEV CommunityHow to Supercharge Your AI Coding Workflow with Oh My CodexDev.to AIThe 11 steps that run every time you press Enter in Claude CodeDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIOptimizing Claude Code token usage: lessons learnedDEV CommunityAgents Bedrock AgentCore en mode VPC : attention aux coûts de NAT Gateway !DEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessAI’s next frontier is the real worldFortune TechDebris from aerial interception strikes Oracle building in Dubai, UAE saysCNBC TechnologyI Audited 30+ Small Businesses on Their AI Visibility. Here's What Most Are Getting Wrong.Dev.to AIHow to Actually Monitor Your LLM Costs (Without a Spreadsheet)Dev.to AIОдин промпт приносит мне $500 в неделю на фрилансеDev.to AINetflix AI Team Just Open-Sourced VOID: an AI Model That Erases Objects From Videos — Physics and AllMarkTechPostUnderstanding Data Modeling in Power BI: Joins, Relationships, and Schemas Explained.DEV CommunityHow to Supercharge Your AI Coding Workflow with Oh My CodexDev.to AIThe 11 steps that run every time you press Enter in Claude CodeDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIOptimizing Claude Code token usage: lessons learnedDEV CommunityAgents Bedrock AgentCore en mode VPC : attention aux coûts de NAT Gateway !DEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

Scraped 300 pages successfully. Site updated robots.txt at page 187 and blocked me.

DEV Communityby Nico ReyesApril 3, 20262 min read1 views
Source Quiz

Scraped 300 pages successfully. Site updated robots.txt at page 187 and blocked me. Building a price tracker for electronics. Target: 300 product pages across an ecommerce site. Tested first 20 pages, everything worked. Ran the full scraper overnight. Woke up to find 187 products scraped, then nothing. Zero errors in my logs. What happened The site admin updated their robots.txt while I was sleeping. Added Disallow: /products/* between page 187 and 188. My scraper checks robots.txt once at startup, then runs. By page 188, their server started returning 403 Forbidden. Fun times. The mess I made First attempt: Just scraped the remaining 113 pages ignoring robots.txt. Got IP banned within 15 minutes. Smart. Second attempt: Added 5 second delays between requests. Still banned. Slower this time

Scraped 300 pages successfully. Site updated robots.txt at page 187 and blocked me.

Building a price tracker for electronics. Target: 300 product pages across an ecommerce site. Tested first 20 pages, everything worked. Ran the full scraper overnight.

Woke up to find 187 products scraped, then nothing. Zero errors in my logs.

What happened

The site admin updated their robots.txt while I was sleeping. Added Disallow: /products/* between page 187 and 188. My scraper checks robots.txt once at startup, then runs. By page 188, their server started returning 403 Forbidden.*

Fun times.

The mess I made

First attempt: Just scraped the remaining 113 pages ignoring robots.txt.

Got IP banned within 15 minutes. Smart.

Second attempt: Added 5 second delays between requests.

Still banned. Slower this time, but same result.

Third attempt: Residential proxies.

This worked but cost $40 for what should've been free data.

What I changed

import requests from urllib.robotparser import RobotFileParser import time

class RobotChecker: def init(self, base_url): self.base_url = base_url self.last_check = 0 self.cache_duration = 300 # 5 minutes self.parser = RobotFileParser()

def can_fetch(self, url):

Refresh robots.txt every 5 min instead of once

if time.time() - self.last_check > self.cache_duration: self.parser.set_url(f"{self.base_url}/robots.txt") self.parser.read() self.last_check = time.time()

return self.parser.can_fetch("", url)

In scraper loop

robot = RobotChecker("https://example.com") for page in pages: if not robot.can_fetch(page): print(f"Robots.txt changed, stopping at {page}") break

scrape page`

Enter fullscreen mode

Exit fullscreen mode

Checking robots.txt every 5 minutes caught changes before getting banned. Saved me proxy costs when sites decide to block partway through.

Platform quirks

Some ecommerce platforms update robots.txt dynamically when traffic spikes. Shopify stores do this sometimes. Big sites like Amazon never change theirs, smaller ones panic and lock everything down.

If your scraper runs longer than 10 minutes, periodic robot checks matter. Most tutorials skip this because test runs finish fast.

Still annoying when sites block you halfway through.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

updateproductplatform

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Scraped 300…updateproductplatformstartupDEV Communi…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 319 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Products