crawler: drop sleep rates 10x for Pro API, 2x for oshwhub detail

Calibrated against ladder probes on 2026-04-29. Findings in docs/sources/probe_rate_limit_results.md. SLEEP_PRO 5.0 -> 0.5 (pro.lceda.cn API) SLEEP_BETWEEN 2.0 -> 1.0 (oshwhub detail/listing) SLEEP_SOURCE 5.0 unchanged (lceda.cn Std endpoints — not yet probed) SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized) The original 5s rate for Pro API was set out of caution because Pro requires a logged-in cookie. Empirical sustained-burst probe (25 distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was wrong — server tolerates QPS=2 cleanly. oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe water mark. Net effect on batch-50 estimate: ~1.5h -> ~30min. scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable for new endpoints (Std source still owes a probe). Designed for safety: 30s tier recovery, low rep counts on auth hosts, bail on first non-200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:45:34 +08:00
parent 3c00edf6db
commit cb868988b9
3 changed files with 358 additions and 9 deletions
--- a/docs/sources/probe_rate_limit_results.md
+++ b/docs/sources/probe_rate_limit_results.md
@@ -0,0 +1,92 @@
+# Rate-limit probe results
+
+**Probe date**: 2026-04-29
+**Script**: `scripts/probe_rate_limit.py`
+**Method**: Ladder test — N requests at decreasing inter-request sleep,
+30s recovery between tiers, watch for status != 200, body shrinkage,
+or latency degradation.
+
+## oshwhub.com listing API (`/api/project`)
+
+No auth. 6 tiers × 10 reps = 60 reqs total.
+
+| sleep | status | bad | latency p90 |
+|---|---|---:|---:|
+| 2.0s | all 200 | 0 | 1187ms |
+| 1.0s | all 200 | 0 | 1237ms |
+| 0.5s | all 200 | 0 | 567ms |
+| 0.25s | all 200 | 0 | 1180ms |
+| 0.1s | all 200 | 0 | 2194ms |
+| 0.0s | all 200 | 0 | 5362ms ← server soft-limits via latency |
+
+**Verdict**: 0.5s safe water mark. Going faster doesn't fail but server adds
+queueing latency (no return on the speed-up).
+
+## oshwhub.com detail HTML (`/<owner>/<path>`)
+
+No auth. 6 tiers × 10 distinct paths from batch-50 candidates.
+
+| sleep | status | bad | latency p90 |
+|---|---|---:|---:|
+| 2.0s | all 200 | 0 | 4767ms |
+| 1.0s | all 200 | 0 | 6350ms |
+| 0.5s | all 200 | 0 | **15364ms** ← queue building |
+| 0.25s | all 200 | 0 | 3755ms |
+| 0.1s | all 200 | 0 | 8179ms |
+| 0.0s | all 200 | 0 | 3856ms |
+
+**Verdict**: 1.0s safe water mark. Detail HTML is 0.5 MB SSR, server
+slowdown earlier than listing API. Going to 0.5s already triggers server
+queue (one outlier 15s response), risk of timeout cascades on real bulk runs.
+
+## pro.lceda.cn API (`/api/v4/projects/<P>`)
+
+**Auth required** (logged-in cookie). Conservative ladder, reps capped at 8
+to limit fingerprint exposure. 5 tiers × 8 reqs.
+
+| sleep | status | bad | latency p90 |
+|---|---|---:|---:|
+| 5.0s | all 200 | 0 | 7299ms |
+| 2.0s | all 200 | 0 | 5518ms |
+| 1.0s | all 200 | 0 | 1409ms |
+| 0.5s | all 200 | 0 | 2995ms |
+| 0.25s | all 200 | 0 | 1552ms |
+
+Then **sustained burst test** at the chosen water mark:
+**25 distinct Pro UUIDs at 0.5s sleep, no recovery**.
+
+- 25/25 success (all status 200, all `success: true`)
+- median latency 410ms, p90 932ms, max 1853ms (first call only — TLS handshake)
+- effective QPS 1.0
+- wall time 24.9s (vs ~140s at the old 5s/req — 5.6× speedup)
+
+**Verdict**: 0.5s safe water mark. Empirically Pro API tolerates QPS=2
+cleanly, even sustained. Originally set high (5s) out of caution because
+Pro requires a logged-in account — that caution was unjustified.
+
+## lceda.cn Std source endpoints — NOT YET PROBED
+
+Currently `SLEEP_SOURCE = 5.0`. Should be probed before lowering. Std
+crawler isn't on the critical path for batch-50 (~12 min vs Pro's
+~10 min savings), so this can wait.
+
+## modules.lceda.cn CDN — already at 0.2s
+
+CDN host serving AES-encrypted EPRO2 history blobs. Pre-existing
+`SLEEP_PRO_CDN = 0.2`, validated against editor HAR which fires blobs
+back-to-back without throttling. No further probing needed.
+
+## Settings applied
+
+```python
+SLEEP_BETWEEN = 1.0   # was 2.0  (oshwhub detail/listing)
+SLEEP_SOURCE  = 5.0   # unchanged (Std source — not yet probed)
+SLEEP_PRO     = 0.5   # was 5.0  (Pro API host, 10× speedup)
+SLEEP_PRO_CDN = 0.2   # unchanged (CDN, already optimized)
+```
+
+## Net impact on batch-50 plan
+
+- Pro 25 项 × ~5 API calls each: 5×5 = 25s/proj × 25 = ~10min  →  0.5×5 = 2.5s/proj × 25 = ~1min
+- Detail page scan 50 项: 50 × 2s = 100s  →  50 × 1s = 50s
+- Combined batch-50 walltime estimate: **~1.5h → ~30 min**