crawler: drop sleep rates 10x for Pro API, 2x for oshwhub detail

Calibrated against ladder probes on 2026-04-29. Findings in
docs/sources/probe_rate_limit_results.md.

  SLEEP_PRO     5.0 -> 0.5  (pro.lceda.cn API)
  SLEEP_BETWEEN 2.0 -> 1.0  (oshwhub detail/listing)
  SLEEP_SOURCE  5.0 unchanged (lceda.cn Std endpoints — not yet probed)
  SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized)

The original 5s rate for Pro API was set out of caution because Pro
requires a logged-in cookie. Empirical sustained-burst probe (25
distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median
latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was
wrong — server tolerates QPS=2 cleanly.

oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to
p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe
water mark.

Net effect on batch-50 estimate: ~1.5h -> ~30min.

scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable
for new endpoints (Std source still owes a probe). Designed for safety:
30s tier recovery, low rep counts on auth hosts, bail on first non-200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 00:45:34 +08:00
parent 3c00edf6db
commit cb868988b9
3 changed files with 358 additions and 9 deletions

View File

@@ -44,15 +44,24 @@ BROWSER_UA = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/147.0.0.0 Safari/537.36"
)
SLEEP_BETWEEN = 2.0 # seconds between detail-page / file fetches
SLEEP_SOURCE = 5.0 # source fetch is sensitive — QPS ≤ 0.2 per CLAUDE.md登录态 spirit
SLEEP_PRO = 5.0 # Pro API host (pro.lceda.cn): rate-sensitive, keep at QPS ≤ 0.2
# CDN host (modules.lceda.cn) only serves AES-encrypted history blobs.
# HAR analysis (proexportNew2.har 2026-04-29) shows the editor fires these
# blobs back-to-back without throttling — the CDN can clearly take it.
# Walltime for chain replay is dominated by this loop on multi-hundred-history
# projects (X86 board: chain ≈ 700 → ~1h at 5s/req → ~few min at 0.2s/req).
SLEEP_PRO_CDN = 0.2
# Per-host rate limits — calibrated against ladder probes (scripts/probe_rate_limit.py)
# on 2026-04-29. See data/state/probe_rate_limit_results.md for the methodology.
SLEEP_BETWEEN = 1.0 # oshwhub.com detail/listing — ladder probe: 0.5s clean,
# 1.0s leaves headroom (detail HTML p90 hits 6s at 1.0s,
# 15s at 0.5s due to server-queue softlimit).
SLEEP_SOURCE = 5.0 # lceda.cn Std source endpoints — NOT yet probed; keep
# conservative. Drop only after a dedicated ladder run.
SLEEP_PRO = 0.5 # pro.lceda.cn API host — sustained burst probe (25
# distinct UUIDs at 0.5s) showed 0/25 errors, median
# latency 410ms. 10x faster than the original 5.0s.
# Originally set high out of caution because Pro requires
# logged-in cookie; empirically Pro API tolerates QPS=2
# cleanly. CDN blob loop uses SLEEP_PRO_CDN below.
SLEEP_PRO_CDN = 0.2 # modules.lceda.cn — CDN serving AES-encrypted EPRO2
# history blobs. The editor fires these back-to-back per
# HAR analysis. Chain replay walltime dominated by this
# loop on big projects (X86 board: ~1h at 5s/req →
# ~3 min at 0.2s/req).
# ---------------------------------------------------------------------------