crawler: drop sleep rates 10x for Pro API, 2x for oshwhub detail

Calibrated against ladder probes on 2026-04-29. Findings in docs/sources/probe_rate_limit_results.md. SLEEP_PRO 5.0 -> 0.5 (pro.lceda.cn API) SLEEP_BETWEEN 2.0 -> 1.0 (oshwhub detail/listing) SLEEP_SOURCE 5.0 unchanged (lceda.cn Std endpoints — not yet probed) SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized) The original 5s rate for Pro API was set out of caution because Pro requires a logged-in cookie. Empirical sustained-burst probe (25 distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was wrong — server tolerates QPS=2 cleanly. oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe water mark. Net effect on batch-50 estimate: ~1.5h -> ~30min. scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable for new endpoints (Std source still owes a probe). Designed for safety: 30s tier recovery, low rep counts on auth hosts, bail on first non-200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:45:34 +08:00
parent 3c00edf6db
commit cb868988b9
3 changed files with 358 additions and 9 deletions
--- a/crawlers/oshwhub/crawler.py
+++ b/crawlers/oshwhub/crawler.py
@@ -44,15 +44,24 @@ BROWSER_UA = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/147.0.0.0 Safari/537.36"
 )
-SLEEP_BETWEEN = 2.0  # seconds between detail-page / file fetches
+# Per-host rate limits — calibrated against ladder probes (scripts/probe_rate_limit.py)
-SLEEP_SOURCE = 5.0   # source fetch is sensitive — QPS ≤ 0.2 per CLAUDE.md登录态 spirit
+# on 2026-04-29. See data/state/probe_rate_limit_results.md for the methodology.
-SLEEP_PRO = 5.0      # Pro API host (pro.lceda.cn): rate-sensitive, keep at QPS ≤ 0.2
+SLEEP_BETWEEN = 1.0  # oshwhub.com detail/listing — ladder probe: 0.5s clean,
-# CDN host (modules.lceda.cn) only serves AES-encrypted history blobs.
+                     # 1.0s leaves headroom (detail HTML p90 hits 6s at 1.0s,
-# HAR analysis (proexportNew2.har 2026-04-29) shows the editor fires these
+                     # 15s at 0.5s due to server-queue softlimit).
-# blobs back-to-back without throttling — the CDN can clearly take it.
+SLEEP_SOURCE = 5.0   # lceda.cn Std source endpoints — NOT yet probed; keep
-# Walltime for chain replay is dominated by this loop on multi-hundred-history
+                     # conservative. Drop only after a dedicated ladder run.
-# projects (X86 board: chain ≈ 700 → ~1h at 5s/req → ~few min at 0.2s/req).
+SLEEP_PRO = 0.5      # pro.lceda.cn API host — sustained burst probe (25
-SLEEP_PRO_CDN = 0.2
+                     # distinct UUIDs at 0.5s) showed 0/25 errors, median
                     # latency 410ms. 10x faster than the original 5.0s.
                     # Originally set high out of caution because Pro requires
                     # logged-in cookie; empirically Pro API tolerates QPS=2
                     # cleanly. CDN blob loop uses SLEEP_PRO_CDN below.
 SLEEP_PRO_CDN = 0.2  # modules.lceda.cn — CDN serving AES-encrypted EPRO2
                     # history blobs. The editor fires these back-to-back per
                     # HAR analysis. Chain replay walltime dominated by this
                     # loop on big projects (X86 board: ~1h at 5s/req →
                     # ~3 min at 0.2s/req).
 # ---------------------------------------------------------------------------
--- a/docs/sources/probe_rate_limit_results.md
+++ b/docs/sources/probe_rate_limit_results.md
@@ -0,0 +1,92 @@
 # Rate-limit probe results
 **Probe date**: 2026-04-29
 **Script**: `scripts/probe_rate_limit.py`
 **Method**: Ladder test — N requests at decreasing inter-request sleep,
 30s recovery between tiers, watch for status != 200, body shrinkage,
 or latency degradation.
 ## oshwhub.com listing API (`/api/project`)
 No auth. 6 tiers × 10 reps = 60 reqs total.
 | sleep | status | bad | latency p90 |
 |---|---|---:|---:|
 | 2.0s | all 200 | 0 | 1187ms |
 | 1.0s | all 200 | 0 | 1237ms |
 | 0.5s | all 200 | 0 | 567ms |
 | 0.25s | all 200 | 0 | 1180ms |
 | 0.1s | all 200 | 0 | 2194ms |
 | 0.0s | all 200 | 0 | 5362ms ← server soft-limits via latency |
 **Verdict**: 0.5s safe water mark. Going faster doesn't fail but server adds
 queueing latency (no return on the speed-up).
 ## oshwhub.com detail HTML (`/<owner>/<path>`)
 No auth. 6 tiers × 10 distinct paths from batch-50 candidates.
 | sleep | status | bad | latency p90 |
 |---|---|---:|---:|
 | 2.0s | all 200 | 0 | 4767ms |
 | 1.0s | all 200 | 0 | 6350ms |
 | 0.5s | all 200 | 0 | **15364ms** ← queue building |
 | 0.25s | all 200 | 0 | 3755ms |
 | 0.1s | all 200 | 0 | 8179ms |
 | 0.0s | all 200 | 0 | 3856ms |
 **Verdict**: 1.0s safe water mark. Detail HTML is 0.5 MB SSR, server
 slowdown earlier than listing API. Going to 0.5s already triggers server
 queue (one outlier 15s response), risk of timeout cascades on real bulk runs.
 ## pro.lceda.cn API (`/api/v4/projects/<P>`)
 **Auth required** (logged-in cookie). Conservative ladder, reps capped at 8
 to limit fingerprint exposure. 5 tiers × 8 reqs.
 | sleep | status | bad | latency p90 |
 |---|---|---:|---:|
 | 5.0s | all 200 | 0 | 7299ms |
 | 2.0s | all 200 | 0 | 5518ms |
 | 1.0s | all 200 | 0 | 1409ms |
 | 0.5s | all 200 | 0 | 2995ms |
 | 0.25s | all 200 | 0 | 1552ms |
 Then **sustained burst test** at the chosen water mark:
 **25 distinct Pro UUIDs at 0.5s sleep, no recovery**.
 - 25/25 success (all status 200, all `success: true`)
 - median latency 410ms, p90 932ms, max 1853ms (first call only — TLS handshake)
 - effective QPS 1.0
 - wall time 24.9s (vs ~140s at the old 5s/req — 5.6× speedup)
 **Verdict**: 0.5s safe water mark. Empirically Pro API tolerates QPS=2
 cleanly, even sustained. Originally set high (5s) out of caution because
 Pro requires a logged-in account — that caution was unjustified.
 ## lceda.cn Std source endpoints — NOT YET PROBED
 Currently `SLEEP_SOURCE = 5.0`. Should be probed before lowering. Std
 crawler isn't on the critical path for batch-50 (~12 min vs Pro's
 ~10 min savings), so this can wait.
 ## modules.lceda.cn CDN — already at 0.2s
 CDN host serving AES-encrypted EPRO2 history blobs. Pre-existing
 `SLEEP_PRO_CDN = 0.2`, validated against editor HAR which fires blobs
 back-to-back without throttling. No further probing needed.
 ## Settings applied
 ```python
 SLEEP_BETWEEN = 1.0   # was 2.0  (oshwhub detail/listing)
 SLEEP_SOURCE  = 5.0   # unchanged (Std source — not yet probed)
 SLEEP_PRO     = 0.5   # was 5.0  (Pro API host, 10× speedup)
 SLEEP_PRO_CDN = 0.2   # unchanged (CDN, already optimized)
 ```
 ## Net impact on batch-50 plan
 - Pro 25 项 × ~5 API calls each: 5×5 = 25s/proj × 25 = ~10min  →  0.5×5 = 2.5s/proj × 25 = ~1min
 - Detail page scan 50 项: 50 × 2s = 100s  →  50 × 1s = 50s
 - Combined batch-50 walltime estimate: **~1.5h → ~30 min**
--- a/scripts/probe_rate_limit.py
+++ b/scripts/probe_rate_limit.py
@@ -0,0 +1,248 @@
 """Rate-limit ladder probe — find each host's actual ceiling.
 依次以越来越短的间隔向目标端点发请求，监控状态码 / body size / 异常。
 任何一档出现 429 / 403 / 5xx / 异常 close → 停在该档，把上一档作为安全水位。
 设计原则
 - 单点采样不下重复结论：每档至少 8-10 次请求才作判断
 - 每两档之间插 30s 恢复期，避免上一档触发的限流污染下一档
 - 只读端点（GET），不修改任何东西
 - Pro API 用候选清单里我们本来就要打的 UUID，不浪费指纹
 Usage:
    uv run python scripts/probe_rate_limit.py --host oshwhub
    uv run python scripts/probe_rate_limit.py --host detail
    uv run python scripts/probe_rate_limit.py --host pro     # cookie required
 """
 from __future__ import annotations
 import argparse
 import json
 import statistics
 import sys
 import time
 from pathlib import Path
 import httpx
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from crawlers.oshwhub.crawler import (  # noqa: E402
    BROWSER_UA,
    PRO_API,
    PRO_COOKIE_PATH_DEFAULT,
    PRO_EDITOR_VERSION,
    UA,
    make_client,
    make_pro_source_client,
 )
 def ladder_oshwhub_listing(reps: int = 10) -> None:
    """oshwhub.com/api/project — listing API, no auth."""
    client = make_client()
    sched = [2.0, 1.0, 0.5, 0.25, 0.1, 0.0]
    for sleep in sched:
        if not _run_one_tier(
            client,
            "GET",
            "https://oshwhub.com/api/project",
            params={"page": 1, "pageSize": 30, "origin": "pro"},
            sleep=sleep,
            reps=reps,
            tier_name=f"listing@{sleep}s",
        ):
            break
 def ladder_oshwhub_detail(reps: int = 10) -> None:
    """oshwhub.com/<owner>/<path> — detail HTML pages.
    Use the 50 candidate paths so the test exercises real targets.
    """
    candidates = [
        json.loads(ln)
        for ln in open("data/state/oshwhub_batch50_candidates.jsonl")
    ]
    client = make_client()
    # Start polite, ramp aggressive
    sched = [2.0, 1.0, 0.5, 0.25, 0.1, 0.0]
    for sleep in sched:
        # Pull `reps` distinct paths; rotate so we don't hit same page twice in a tier
        paths = [c["path"] for c in candidates[:reps]]
        if not _run_paths_tier(
            client,
            paths,
            sleep=sleep,
            tier_name=f"detail@{sleep}s",
        ):
            break
 def ladder_pro_api(reps: int = 8) -> None:
    """pro.lceda.cn/api/v4/projects/<P> — auth required.
    Probes the project-meta endpoint with logged-in cookie. We cap reps
    lower since this is the most precious host (account ban risk).
    Conservative ladder; bail aggressively on any non-200.
    """
    candidates = [
        json.loads(ln)
        for ln in open("data/state/oshwhub_batch50_candidates.jsonl")
    ]
    pro_uuids = [c["uuid"] for c in candidates if c.get("origin") == "pro"]
    if len(pro_uuids) < reps:
        print(f"only {len(pro_uuids)} Pro UUIDs available (need {reps})", file=sys.stderr)
        return
    client = make_pro_source_client()
    # Conservative ladder for Pro: start 5s, halve down, stop early on trouble
    sched = [5.0, 2.0, 1.0, 0.5, 0.25]
    for sleep in sched:
        if not _run_pro_tier(client, pro_uuids[:reps], sleep=sleep, tier=f"pro@{sleep}s"):
            print(f"\n  STOP at {sleep}s — previous tier is safe water-mark.")
            break
 def _run_one_tier(
    client: httpx.Client,
    method: str,
    url: str,
    *,
    sleep: float,
    reps: int,
    tier_name: str,
    params: dict | None = None,
 ) -> bool:
    print(f"\n=== {tier_name}  ({reps} reqs at {sleep}s interval) ===")
    statuses, sizes, latencies = [], [], []
    bad = 0
    for i in range(reps):
        t0 = time.perf_counter()
        try:
            r = client.request(method, url, params=params)
            sz = len(r.content)
            statuses.append(r.status_code)
            sizes.append(sz)
            latencies.append(time.perf_counter() - t0)
            ok = (r.status_code == 200) and sz > 0
            if not ok:
                bad += 1
                print(f"  [{i+1}] !! status={r.status_code} sz={sz}", flush=True)
        except Exception as e:  # noqa: BLE001
            bad += 1
            statuses.append(-1)
            print(f"  [{i+1}] EXC {type(e).__name__}: {e}", flush=True)
        if i + 1 < reps:
            time.sleep(sleep)
    _summary(statuses, sizes, latencies, bad)
    if bad:
        print(f"  -> tier FAILED ({bad}/{reps} bad). Stopping ladder.")
        return False
    if sleep > 0:
        print(f"  recovery sleep 30s before next tier...")
        time.sleep(30)
    return True
 def _run_paths_tier(
    client: httpx.Client, paths: list[str], *, sleep: float, tier_name: str
 ) -> bool:
    print(f"\n=== {tier_name}  ({len(paths)} pages at {sleep}s interval) ===")
    statuses, sizes, latencies = [], [], []
    bad = 0
    for i, p in enumerate(paths):
        url = f"https://oshwhub.com/{p}"
        t0 = time.perf_counter()
        try:
            r = client.get(url)
            sz = len(r.content)
            statuses.append(r.status_code); sizes.append(sz)
            latencies.append(time.perf_counter() - t0)
            ok = (r.status_code == 200) and sz > 5000  # detail pages should be sizable
            if not ok:
                bad += 1
                print(f"  [{i+1}] !! status={r.status_code} sz={sz} url={url[:80]}",
                      flush=True)
        except Exception as e:  # noqa: BLE001
            bad += 1
            statuses.append(-1)
            print(f"  [{i+1}] EXC {type(e).__name__}: {e}", flush=True)
        if i + 1 < len(paths):
            time.sleep(sleep)
    _summary(statuses, sizes, latencies, bad)
    if bad:
        print(f"  -> tier FAILED. Stopping ladder.")
        return False
    if sleep > 0:
        print(f"  recovery sleep 30s before next tier..."); time.sleep(30)
    return True
 def _run_pro_tier(client: httpx.Client, uuids: list[str], *, sleep: float, tier: str) -> bool:
    print(f"\n=== {tier}  ({len(uuids)} project meta calls at {sleep}s) ===")
    statuses, sizes, latencies = [], [], []
    bad = 0
    for i, u in enumerate(uuids):
        url = f"{PRO_API}/projects/{u}"
        t0 = time.perf_counter()
        try:
            r = client.get(url, headers={"path": u})
            sz = len(r.content)
            statuses.append(r.status_code); sizes.append(sz)
            latencies.append(time.perf_counter() - t0)
            try:
                j = r.json()
                ok = r.status_code == 200 and j.get("success", False)
            except Exception:
                ok = False
            if not ok:
                bad += 1
                print(f"  [{i+1}] !! status={r.status_code} sz={sz} body[:200]={r.text[:200]!r}",
                      flush=True)
        except Exception as e:  # noqa: BLE001
            bad += 1
            statuses.append(-1)
            print(f"  [{i+1}] EXC {type(e).__name__}: {e}", flush=True)
        if i + 1 < len(uuids):
            time.sleep(sleep)
    _summary(statuses, sizes, latencies, bad)
    if bad:
        return False
    if sleep > 0:
        print(f"  recovery sleep 30s before next tier..."); time.sleep(30)
    return True
 def _summary(statuses, sizes, latencies, bad) -> None:
    if not statuses:
        return
    by_code: dict[int, int] = {}
    for s in statuses:
        by_code[s] = by_code.get(s, 0) + 1
    if latencies:
        med = statistics.median(latencies)
        p90 = sorted(latencies)[int(len(latencies) * 0.9)]
        print(f"  status: {by_code}  bad={bad}  latency med={med * 1000:.0f}ms p90={p90 * 1000:.0f}ms")
    else:
        print(f"  status: {by_code}  bad={bad}")
    if sizes:
        print(f"  size:   median={statistics.median(sizes)} min={min(sizes)} max={max(sizes)}")
 def main() -> int:
    ap = argparse.ArgumentParser()
    ap.add_argument("--host", choices=["oshwhub", "detail", "pro"], required=True)
    ap.add_argument("--reps", type=int, default=10)
    args = ap.parse_args()
    if args.host == "oshwhub":
        ladder_oshwhub_listing(reps=args.reps)
    elif args.host == "detail":
        ladder_oshwhub_detail(reps=args.reps)
    elif args.host == "pro":
        ladder_pro_api(reps=min(args.reps, 8))  # cap pro reps
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())