Preloader
Others
  • Estimated reading time: 3 Minutes

How to Debug 403 and 429 Blocks in Web Scrapers (and Fix Them with Proxies)

How to Debug 403 and 429 Blocks in Web Scrapers (and Fix Them with Proxies)

Most scrapers fail for two boring reasons: you send a request that looks wrong, or you send too many. The server then answers with 403, 429, or a soft block page. You can fix this fast if you treat it like a debug task, not a guess.

This guide shows a repeatable way to find the real cause. It also shows where proxies help, and where they make things worse. The steps fit the hands-on style you see on Our Code World: short checks, small code, and clear outputs.

Start by proving what kind of block you hit

Do not add proxies first. Capture one failing request and make it easy to replay. Save the URL, method, headers, and body.

Check the response shape

Look at status code, headers, and body size. Many “403” pages return HTML with a vendor banner. Many “429” pages return a short text body and a Retry-After header.

Log these fields per request: status, content-type, content-length, and final URL. Add the first 200 bytes of the body. This single log often tells you if you got a real page or a block page.

import hashlib

def short_fingerprint(resp):
    ct = resp.headers.get("content-type", "")
    cl = resp.headers.get("content-length", "")

    head = (resp.text or "")[:200].encode("utf-8", "ignore")
    sig = hashlib.sha1(head).hexdigest()[:10]

    return resp.status_code, ct, cl, sig

Replay with curl to rule out your client

Some blocks key on TLS, HTTP/2, or header order. A curl replay gives you a second client with a known stack. If curl works and your code fails, you likely hit a client fingerprint issue.

curl -i "https://example.com/path" \

-H "User-Agent: Mozilla/5.0" \

-H "Accept-Language: en-US,en;q=0.9"

Fix request identity before you scale up

Many sites expect a normal browser flow. Your scraper must look consistent across requests. Keep headers stable and keep cookies when the site uses them.

Use a session, and stop random header spam

Random headers often reduce trust. Pick a real browser set and keep it. Use a cookie jar so you do not start “fresh” on each hit.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()

session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "en-US,en;q=0.9",
})

retry = Retry(
    total=5,
    backoff_factor=0.8,
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["GET", "HEAD"],
)

session.mount("https://", HTTPAdapter(max_retries=retry))

Respect pacing with backoff you can explain

Rate limits exist to protect systems. HTTP 429 means “Too Many Requests” by spec, so treat it as a signal, not a bug. Backoff with jitter cuts repeat hits and helps you learn the real limit.

Keep a per-host cap, even when you use many IPs. Many blocks key on account, cookie, or TLS traits, not just IP.

Pick the right proxy type for the job

Proxies help when a site enforces geo rules, per-IP caps, or ISP-based trust. They do not fix broken headers or bad pacing. They also add cost and new failure modes.

Match proxy to target behavior

Datacenter IPs work for many public pages and bulk fetches. They also show up in block lists more often. Residential IPs look closer to real users, but you must handle slower speed and higher churn.

Use mobile proxies. when the site ties access to carrier ranges or when the same page works on phones but fails on desktop IPs. They also help with some app-like flows that expect mobile nets.

Route by host, not by “all traffic”

Do not proxy everything. Split traffic by domain and risk. Send only blocked hosts through proxies, and keep the rest direct for speed and cost control.

Implement simple routing in code. Keep a map of host to proxy pool name. This makes tests clean and keeps proxy use traceable.

Build a “block report” so fixes stay fixed

A scraper that runs daily needs proof when blocks rise. Add a small report with counts per status code, plus the top fingerprints from earlier. This helps you spot a new block page fast.

Track these signals per run

Count 200, 301, 403, 404, 429, and 5xx. Track median response size and median fetch time. Sudden size drops often mean you started saving a short block page.

Keep samples of failed HTML for each fingerprint. Store them with a hash key and a timestamp. This makes vendor swaps easy to spot.

Stay compliant and reduce risk

Read the site terms and follow robots rules where they apply to your use case. Avoid logins you do not own, and do not bypass paywalls. Keep a contact email in your user agent when you run at scale.

Use least data. Only fetch fields you need, and cache stable pages. Fewer hits mean fewer blocks, lower cost, and less risk.

Related articles
What Makes a Portable Power Station Worth the Investment?
17 Jun, 2026
  • Estimated reading time: 4 Minutes
Building a Modern Lead Generation Workflow With Public Social Data
17 Jun, 2026
  • Estimated reading time: 3 Minutes
Best Instant Crypto Exchange for Lazy Investors
17 Jun, 2026
  • Estimated reading time: 6 Minutes
Weekly trending
What Makes a Portable Power Station Worth the Investment?
17 Jun, 2026
  • Estimated reading time: 4 Minutes
Building a Modern Lead Generation Workflow With Public Social Data
17 Jun, 2026
  • Estimated reading time: 3 Minutes
Best Instant Crypto Exchange for Lazy Investors
17 Jun, 2026
  • Estimated reading time: 6 Minutes
Our Sponsors

Our blog is proudly supported by industry-leading sponsors.