Proxy hygiene for scraping: a quickref for fast checks, leaks, and rotation

Scrapers fail for plain reasons. A proxy times out. DNS leaks. TLS breaks. A pool mixes data center and home IPs. You then chase bugs in code that works fine.

This page reads like a quickref. It gives short tests you can copy into a shell. It also shows what to log so you can fix issues fast.

What to test before you crawl

Test each proxy on four axes: reach, route, trust, and speed. Reach means it can open TCP and pass auth. Route means your traffic exits on the IP you expect. Trust means it supports modern TLS and does not mangle headers. Speed means the proxy fits your crawl budget.

Know your pool math

Pool size limits your safe rate. IPv4 has 2^32 addresses, or 4,294,967,296 total. You never get close to that in a pool. A /24 block gives 256 IPs, and you often see far less in real stock.

Rate math stays simple. Requests per IP per site must stay low. A small pool forces longer gaps and more cache use.

Fast smoke tests with curl

Start with one command per proxy. Keep it boring and repeatable. Use a stable endpoint you control, if you can. If you cannot, use a public site with care.

# HTTP proxy, basic reach test

curl -sS –proxy http://USER:PASS@HOST:PORT https://example.com/ -o /dev/null -w “code=%{http_code} time=%{time_total}\n”

# SOCKS5 proxy

curl -sS –proxy socks5h://USER:PASS@HOST:PORT https://example.com/ -o /dev/null -w “code=%{http_code} time=%{time_total}\n”

Track three numbers. You need HTTP code, total time, and connect time. If connect time spikes, you face network or auth issues. If total time spikes, the exit link or the site slows you down.

If you validate many IPs, run a proxy tester. Keep the same target URL and the same timeout across runs.

Pin timeouts so tests stay honest

Default timeouts hide pain. Pin them so bad nodes fail fast. A test that hangs ruins your queue and your logs.

# fail fast: 5s connect, 10s total

curl -sS –connect-timeout 5 –max-time 10 –proxy http://HOST:PORT https://example.com/ -o /dev/null

Catch leaks and mixed routes

Leaks break geo rules and get you flagged. The most common leak comes from DNS. Use socks5h so curl sends DNS through the proxy. If you use socks5 without h, the local host may resolve names.

Header drift also matters. Some proxies add or change headers. Watch for Via, X-Forwarded-For, and odd Accept-Encoding changes. You can log request headers you send, then diff them at the edge.

# show response headers

curl -sS -D – –proxy http://HOST:PORT https://example.com/ -o /dev/null

Mixed routing shows up as split traits. Your IP geo says one place, but RTT says another. Your TLS hand shake fails on some sites, but works on others. Treat that as a pool quality issue, not an app bug.

TLS and HTTP quirks that trigger blocks

Many sites key on TLS and HTTP shape. Old ciphers, stale SNI, and odd ALPN picks stand out. Keep your client stack up to date. A modern curl and OpenSSL set helps more than proxy churn.

Test TLS reach on the proxy path. If you see hand shake errors, swap the node. Do not patch around it with retries.

# quick TLS check (direct)

openssl s_client -servername example.com -connect example.com:443 < /dev/null

# if your proxy tool supports CONNECT tests, run them too

Rotation rules that match real sites

Rotate for a reason, not by habit. Use sticky sessions for carts, logins, and flows with tokens. Rotate on hard fails, risk flags, and per-site caps. Rotate less on sites that allow high fetch rates and cache well.

Pick a key for stickiness. Use account id, cookie jar id, or task id. Keep that key stable for the full flow, then release it.

Watch error codes. HTTP 401 often means bad auth to the proxy or target. HTTP 403 often means a block at the site. HTTP 429 means you hit a rate cap and need a slower pace or more IPs.

Logs you need for fast triage

Log each fetch with a small set of fields. Include proxy id, exit IP, target host, status code, bytes, connect time, and total time. Add a short tag for the crawl job and the retry count.

Keep raw pages only when you need them. Most ops work needs metrics, not full HTML. If you must store pages, set a short TTL and strip personal data.

Pasteable preflight block

Run this before a new crawl or a new proxy batch. It keeps your checks in one spot. It also fits well in a CI job or a cron task.

# set once

export PXY=”http://USER:PASS@HOST:PORT”

export URL=”https://example.com/”

# reach and speed

curl -sS –connect-timeout 5 –max-time 10 –proxy “$PXY” “$URL” -o /dev/null -w “code=%{http_code} connect=%{time_connect} total=%{time_total}\n”

# header drift

curl -sS -D – –proxy “$PXY” “$URL” -o /dev/null | head -n 20

# DNS via proxy (SOCKS only)

# use socks5h:// in PXY when you need remote DNS

Leave a Reply

Your email address will not be published. Required fields are marked *