crawler: drop sleep rates 10x for Pro API, 2x for oshwhub detail
Calibrated against ladder probes on 2026-04-29. Findings in docs/sources/probe_rate_limit_results.md. SLEEP_PRO 5.0 -> 0.5 (pro.lceda.cn API) SLEEP_BETWEEN 2.0 -> 1.0 (oshwhub detail/listing) SLEEP_SOURCE 5.0 unchanged (lceda.cn Std endpoints — not yet probed) SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized) The original 5s rate for Pro API was set out of caution because Pro requires a logged-in cookie. Empirical sustained-burst probe (25 distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was wrong — server tolerates QPS=2 cleanly. oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe water mark. Net effect on batch-50 estimate: ~1.5h -> ~30min. scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable for new endpoints (Std source still owes a probe). Designed for safety: 30s tier recovery, low rep counts on auth hosts, bail on first non-200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -44,15 +44,24 @@ BROWSER_UA = (
|
||||
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||
"Chrome/147.0.0.0 Safari/537.36"
|
||||
)
|
||||
SLEEP_BETWEEN = 2.0 # seconds between detail-page / file fetches
|
||||
SLEEP_SOURCE = 5.0 # source fetch is sensitive — QPS ≤ 0.2 per CLAUDE.md登录态 spirit
|
||||
SLEEP_PRO = 5.0 # Pro API host (pro.lceda.cn): rate-sensitive, keep at QPS ≤ 0.2
|
||||
# CDN host (modules.lceda.cn) only serves AES-encrypted history blobs.
|
||||
# HAR analysis (proexportNew2.har 2026-04-29) shows the editor fires these
|
||||
# blobs back-to-back without throttling — the CDN can clearly take it.
|
||||
# Walltime for chain replay is dominated by this loop on multi-hundred-history
|
||||
# projects (X86 board: chain ≈ 700 → ~1h at 5s/req → ~few min at 0.2s/req).
|
||||
SLEEP_PRO_CDN = 0.2
|
||||
# Per-host rate limits — calibrated against ladder probes (scripts/probe_rate_limit.py)
|
||||
# on 2026-04-29. See data/state/probe_rate_limit_results.md for the methodology.
|
||||
SLEEP_BETWEEN = 1.0 # oshwhub.com detail/listing — ladder probe: 0.5s clean,
|
||||
# 1.0s leaves headroom (detail HTML p90 hits 6s at 1.0s,
|
||||
# 15s at 0.5s due to server-queue softlimit).
|
||||
SLEEP_SOURCE = 5.0 # lceda.cn Std source endpoints — NOT yet probed; keep
|
||||
# conservative. Drop only after a dedicated ladder run.
|
||||
SLEEP_PRO = 0.5 # pro.lceda.cn API host — sustained burst probe (25
|
||||
# distinct UUIDs at 0.5s) showed 0/25 errors, median
|
||||
# latency 410ms. 10x faster than the original 5.0s.
|
||||
# Originally set high out of caution because Pro requires
|
||||
# logged-in cookie; empirically Pro API tolerates QPS=2
|
||||
# cleanly. CDN blob loop uses SLEEP_PRO_CDN below.
|
||||
SLEEP_PRO_CDN = 0.2 # modules.lceda.cn — CDN serving AES-encrypted EPRO2
|
||||
# history blobs. The editor fires these back-to-back per
|
||||
# HAR analysis. Chain replay walltime dominated by this
|
||||
# loop on big projects (X86 board: ~1h at 5s/req →
|
||||
# ~3 min at 0.2s/req).
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
Reference in New Issue
Block a user