crawler: drop sleep rates 10x for Pro API, 2x for oshwhub detail

Calibrated against ladder probes on 2026-04-29. Findings in docs/sources/probe_rate_limit_results.md. SLEEP_PRO 5.0 -> 0.5 (pro.lceda.cn API) SLEEP_BETWEEN 2.0 -> 1.0 (oshwhub detail/listing) SLEEP_SOURCE 5.0 unchanged (lceda.cn Std endpoints — not yet probed) SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized) The original 5s rate for Pro API was set out of caution because Pro requires a logged-in cookie. Empirical sustained-burst probe (25 distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was wrong — server tolerates QPS=2 cleanly. oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe water mark. Net effect on batch-50 estimate: ~1.5h -> ~30min. scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable for new endpoints (Std source still owes a probe). Designed for safety: 30s tier recovery, low rep counts on auth hosts, bail on first non-200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:45:34 +08:00
parent 3c00edf6db
commit cb868988b9
3 changed files with 358 additions and 9 deletions
--- a/crawlers/oshwhub/crawler.py
+++ b/crawlers/oshwhub/crawler.py
@@ -44,15 +44,24 @@ BROWSER_UA = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/147.0.0.0 Safari/537.36"
 )
-SLEEP_BETWEEN = 2.0  # seconds between detail-page / file fetches
-SLEEP_SOURCE = 5.0   # source fetch is sensitive — QPS ≤ 0.2 per CLAUDE.md登录态 spirit
-SLEEP_PRO = 5.0      # Pro API host (pro.lceda.cn): rate-sensitive, keep at QPS ≤ 0.2
-# CDN host (modules.lceda.cn) only serves AES-encrypted history blobs.
-# HAR analysis (proexportNew2.har 2026-04-29) shows the editor fires these
-# blobs back-to-back without throttling — the CDN can clearly take it.
-# Walltime for chain replay is dominated by this loop on multi-hundred-history
-# projects (X86 board: chain ≈ 700 → ~1h at 5s/req → ~few min at 0.2s/req).
-SLEEP_PRO_CDN = 0.2
+# Per-host rate limits — calibrated against ladder probes (scripts/probe_rate_limit.py)
+# on 2026-04-29. See data/state/probe_rate_limit_results.md for the methodology.
+SLEEP_BETWEEN = 1.0  # oshwhub.com detail/listing — ladder probe: 0.5s clean,
+                     # 1.0s leaves headroom (detail HTML p90 hits 6s at 1.0s,
+                     # 15s at 0.5s due to server-queue softlimit).
+SLEEP_SOURCE = 5.0   # lceda.cn Std source endpoints — NOT yet probed; keep
+                     # conservative. Drop only after a dedicated ladder run.
+SLEEP_PRO = 0.5      # pro.lceda.cn API host — sustained burst probe (25
+                     # distinct UUIDs at 0.5s) showed 0/25 errors, median
+                     # latency 410ms. 10x faster than the original 5.0s.
+                     # Originally set high out of caution because Pro requires
+                     # logged-in cookie; empirically Pro API tolerates QPS=2
+                     # cleanly. CDN blob loop uses SLEEP_PRO_CDN below.
+SLEEP_PRO_CDN = 0.2  # modules.lceda.cn — CDN serving AES-encrypted EPRO2
+                     # history blobs. The editor fires these back-to-back per
+                     # HAR analysis. Chain replay walltime dominated by this
+                     # loop on big projects (X86 board: ~1h at 5s/req →
+                     # ~3 min at 0.2s/req).


 # ---------------------------------------------------------------------------