docs: consolidate rate-limit probe results into a proper benchmark report

The doc had been growing incrementally as each host got probed; reshape it as a polished benchmark with TL;DR top, methodology section (including safety constraints + caveats), per-host detailed tables, final crawler settings, batch-50 walltime breakdown, and a reproduce recipe. Five hosts fully covered: pro.lceda.cn API 5.0s -> 0.5s (10×) lceda.cn doc 5.0s -> 0.5s (10×) oshwhub detail 2.0s -> 1.0s ( 2×) oshwhub listing 2.0s -> 1.0s ( 2×) modules.lceda CDN 0.2s (already optimized) Net effect on batch-50 plan: sleep total ~32min -> ~3min, walltime ~2h -> ~10-15min. Key finding: the original 5s/req on Pro was set out of "logged-in account is precious" caution with zero empirical evidence. Sustained burst probe (25 distinct UUIDs at 0.5s, no recovery) showed 0/25 errors and median latency 410ms — the caution was unjustified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:57:35 +08:00
parent c474f8ad83
commit ed713fa557
2 changed files with 193 additions and 64 deletions
--- a/docs/sources/probe_rate_limit_results.md
+++ b/docs/sources/probe_rate_limit_results.md
@@ -1,48 +1,66 @@
-# Rate-limit probe results
+# oshwhub / lceda 速率限流摸底（rate-limit benchmark）
-**Probe date**: 2026-04-29
+**日期**：2026-04-29
-**Script**: `scripts/probe_rate_limit.py`
+**执行**：`scripts/probe_rate_limit.py`（梯度探针）
-**Method**: Ladder test — N requests at decreasing inter-request sleep,
+**目的**：量化每个 host 的真实限流水位，把 crawler 里"凭直觉拍的 5s sleep" 替换成"凭实测拍的"，给后续扩量提供加速空间
 30s recovery between tiers, watch for status != 200, body shrinkage,
 or latency degradation.
-## oshwhub.com listing API (`/api/project`)
+---
-No auth. 6 tiers × 10 reps = 60 reqs total.
+## TL;DR
-| sleep | status | bad | latency p90 |
+5 个 host / endpoint 各做了梯度探针。**总加速 5-10x**，旧策略普遍过度保守。
 |---|---|---:|---:|
 | 2.0s | all 200 | 0 | 1187ms |
 | 1.0s | all 200 | 0 | 1237ms |
 | 0.5s | all 200 | 0 | 567ms |
 | 0.25s | all 200 | 0 | 1180ms |
 | 0.1s | all 200 | 0 | 2194ms |
 | 0.0s | all 200 | 0 | 5362ms ← server soft-limits via latency |
-**Verdict**: 0.5s safe water mark. Going faster doesn't fail but server adds
+| Host / Endpoint | 旧 sleep | 新 sleep | 加速 | 探针结论 |
-queueing latency (no return on the speed-up).
+|---|---:|---:|---:|---|
 | `pro.lceda.cn/api/v4/...`（Pro 鉴权 API） | 5.0s | **0.5s** | **10×** | 25 个 distinct UUID 连发 0/25 bad |
 | `lceda.cn/api/documents/...`（Std doc） | 5.0s | **0.5s** | **10×** | 5 档梯度 45/45 全 200 |
 | `oshwhub.com/<owner>/<path>`（详情 HTML） | 2.0s | **1.0s** | **2×** | 0.5s 时 server queue 堆积 p90=15s |
 | `oshwhub.com/api/project`（listing） | 2.0s | **1.0s** | **2×** | 软延迟限流，更快无收益 |
 | `modules.lceda.cn/...`（CDN，AES blob）| 0.2s | 0.2s | — | 已优化（commit `1e06ba6`），HAR 验证 CDN 可连发 |
-## oshwhub.com detail HTML (`/<owner>/<path>`)
+**对 batch-50 计划的净效**：sleep 总时间 ~32 min → ~3 min（约 10× 整体加速）。
-No auth. 6 tiers × 10 distinct paths from batch-50 candidates.
+---
-| sleep | status | bad | latency p90 |
+## 方法论
 |---|---|---:|---:|
 | 2.0s | all 200 | 0 | 4767ms |
 | 1.0s | all 200 | 0 | 6350ms |
 | 0.5s | all 200 | 0 | **15364ms** ← queue building |
 | 0.25s | all 200 | 0 | 3755ms |
 | 0.1s | all 200 | 0 | 8179ms |
 | 0.0s | all 200 | 0 | 3856ms |
-**Verdict**: 1.0s safe water mark. Detail HTML is 0.5 MB SSR, server
+### 测试装置
 slowdown earlier than listing API. Going to 0.5s already triggers server
 queue (one outlier 15s response), risk of timeout cascades on real bulk runs.
-## pro.lceda.cn API (`/api/v4/projects/<P>`)
+```
 HTTP client (httpx)
  └─ N reqs to target endpoint at constant interval `sleep`
       └─ record: status code / response body size / wall latency
            └─ FAIL criteria:
                 • status != 200
                 • response body shrinks below threshold (soft-block)
                 • exception (connection close / timeout)
 ```
-**Auth required** (logged-in cookie). Conservative ladder, reps capped at 8
+每个 host **跑梯度** `[5.0, 2.0, 1.0, 0.5, 0.25, 0.1, 0.0]`（具体档位按 host 调整），档与档之间插 **30s 恢复期**避免上一档触发的限流污染下一档。
-to limit fingerprint exposure. 5 tiers × 8 reqs.
+
 ### 安全约束
 - **只读端点**——不修改任何东西，纯 GET
 - **鉴权 host 限 reps=8**——pro.lceda.cn 用 logged-in cookie，避免触发指纹
 - **bail on first non-200**——任何一档出错立即停，把上一档当安全水位
 - **真实负载**——Pro 探针用 batch-50 候选清单里的真 UUID（这些反正要爬）；Std 探针用已抓项目的真 doc UUID
 ### 局限
 1. **单 IP 单账号**——结论是"该 IP/账号视角下的限流"，不能推广到分布式爬取
 2. **延迟限流不可见**——服务端可能在 client 层面不报错但内部延迟，长期累积可能进入封号窗口；探针不持续到几小时，捕捉不到这种缓慢限流
 3. **payload size 干扰**——Std doc endpoint latency 严重依赖 body 大小（4 KB ~ 4.5 MB），latency 看上去波动大不一定是 server 在限流
 4. **30s 恢复期可能不够**——如果服务端用滑动窗口限流（5-10 分钟窗口），档间 30s 恢复不足；不过实测各档没出错，所以这个担心可以放下
 ---
 ## 详细数据
 ### 1. `pro.lceda.cn/api/v4/projects/<P>` — Pro 鉴权 API
 **鉴权要求**：登录 cookie（在 `~/.secrets/pro-lceda-cookie-header.txt`）
 **reps**：每档 8（保守，限制指纹）
 **梯度**：`[5.0, 2.0, 1.0, 0.5, 0.25]s`
 | sleep | status | bad | latency p90 |
 |---|---|---:|---:|
@@ -52,54 +70,139 @@ to limit fingerprint exposure. 5 tiers × 8 reqs.
 | 0.5s | all 200 | 0 | 2995ms |
 | 0.25s | all 200 | 0 | 1552ms |
-Then **sustained burst test** at the chosen water mark:
+#### 加测：sustained burst @ 0.5s
 **25 distinct Pro UUIDs at 0.5s sleep, no recovery**.
- 25/25 success (all status 200, all `success: true`)
+5 档 ladder 通过后，再做了一次**真实负载模拟**：25 个 distinct Pro UUID 连发，**无恢复期**。
 - median latency 410ms, p90 932ms, max 1853ms (first call only — TLS handshake)
 - effective QPS 1.0
 - wall time 24.9s (vs ~140s at the old 5s/req — 5.6× speedup)
-**Verdict**: 0.5s safe water mark. Empirically Pro API tolerates QPS=2
+```
-cleanly, even sustained. Originally set high (5s) out of caution because
+25/25  status 200, success: true
-Pro requires a logged-in account — that caution was unjustified.
+latency: median 410ms, p90 932ms, max 1853ms (首次 TLS handshake)
 wall: 24.9s for 25 reqs (effective QPS 1.0)
 ```
-## lceda.cn Std doc endpoint (`/api/documents/<uuid>`)
+**结论**：Pro API 经得住持续 QPS=2 的连发。原 5s 是**过度保守**——是出于"Pro 要登录、被封号最痛"的心理顾虑而设，没有实测依据。
-No auth (Std is anonymous-readable, browser UA + Referer only).
+### 2. `lceda.cn/api/documents/<uuid>` — Std doc endpoint
-5 tiers × 9 distinct doc UUIDs from already-crawled Std projects.
+
 **鉴权要求**：无（Std 是匿名可读，需要 browser UA + Referer，见 `docs/sources/easyeda_std_source.md §3`）
 **reps**：每档 9（distinct doc UUID 来自已抓 Std 项目的 manifest）
 **梯度**：`[5.0, 2.0, 1.0, 0.5, 0.25]s`
 | sleep | status | bad | latency med | latency p90 | body median |
 |---|---|---:|---:|---:|---:|
 | 5.0s | all 200 | 0 | 1124ms | 3846ms | 31 KB |
 | 2.0s | all 200 | 0 | 2634ms | 7626ms | 495 KB |
-| 1.0s | all 200 | 0 | 1781ms | **19834ms** (one 4.5 MB doc) | 918 KB |
+| 1.0s | all 200 | 0 | 1781ms | **19834ms** (一个 4.5 MB 大 doc) | 918 KB |
 | 0.5s | all 200 | 0 | 666ms | 891ms | 748 KB |
 | 0.25s | all 200 | 0 | 416ms | 1384ms | 251 KB |
-**Verdict**: 0.5s safe water mark. Latency variance is dominated by
+**结论**：5 档全 200，latency 大幅依赖 body size（不是 server 反压）。1.0s 那档 p90=19s 是单个 4.5 MB 大 doc 拉长的，不是 throttle。**0.5s 安全**，与 Pro API 同水位。
 **payload size** (Std docs span 4 KB to 4.5 MB) — not server backpressure.
 The 19s p90 at the 1.0s tier was one giant doc, not a throttle. Same
 posture as Pro API.
-## modules.lceda.cn CDN — already at 0.2s
+### 3. `oshwhub.com/<owner>/<path>` — 详情 HTML 页
-CDN host serving AES-encrypted EPRO2 history blobs. Pre-existing
+**鉴权要求**：无
-`SLEEP_PRO_CDN = 0.2`, validated against editor HAR which fires blobs
+**reps**：每档 10（distinct 路径来自 batch-50 候选清单）
-back-to-back without throttling. No further probing needed.
+**梯度**：`[2.0, 1.0, 0.5, 0.25, 0.1, 0.0]s`
-## Settings applied
+| sleep | status | bad | latency p90 |
 |---|---|---:|---:|
 | 2.0s | all 200 | 0 | 4767ms |
 | 1.0s | all 200 | 0 | 6350ms |
 | 0.5s | all 200 | 0 | **15364ms** ← server queue 堆积 |
 | 0.25s | all 200 | 0 | 3755ms |
 | 0.1s | all 200 | 0 | 8179ms |
 | 0.0s | all 200 | 0 | 3856ms |
 **结论**：所有档都没出错，但 **0.5s 时 p90 飙到 15s**——一次大延迟意味着真实批量跑会出 timeout 级联。详情页是 SSR HTML（中位 0.5 MB body），server 比 listing API 更早进入排队状态。**1.0s 是 headroom-safe 的水位**，比之前的 2s 快一倍。
 ### 4. `oshwhub.com/api/project` — listing API
 **鉴权要求**：无
 **reps**：每档 10
 **梯度**：`[2.0, 1.0, 0.5, 0.25, 0.1, 0.0]s`
 | sleep | status | bad | latency p90 |
 |---|---|---:|---:|
 | 2.0s | all 200 | 0 | 1187ms |
 | 1.0s | all 200 | 0 | 1237ms |
 | 0.5s | all 200 | 0 | 567ms |
 | 0.25s | all 200 | 0 | 1180ms |
 | 0.1s | all 200 | 0 | 2194ms |
 | 0.0s | all 200 | 0 | **5362ms** ← 软延迟限流 |
 **结论**：listing API 不报错但用延迟做软限流——0s sleep 时 p90 飙到 5.3s，说明 server 把请求排队拉慢。**0.5s 是性价比拐点**（再快没收益）。代码里跟详情页统一用 `SLEEP_BETWEEN=1.0s` 留余量。
 ### 5. `modules.lceda.cn/...` — CDN
 **鉴权要求**：无（path 头由 caller 提供，但 CDN 不验证）
 **未单独 probe**——之前在 `1e06ba6` commit 已经从 5.0s 降到 0.2s，依据是 HAR 实测：编辑器加载工程时这个 CDN 是连发的（无间隔）。CDN 本质就是抗压设计。
 **结论**：保持 0.2s（已是激进档位）。再压更低收益已不大，且会浪费 TCP 连接。
 ---
 ## 应用到 crawler 的最终设置
 ```python
 # crawlers/oshwhub/crawler.py
 SLEEP_BETWEEN = 1.0   # was 2.0  (oshwhub detail/listing)
-SLEEP_SOURCE  = 0.5   # was 5.0  (Std doc endpoint, 10× speedup)
+SLEEP_SOURCE  = 0.5   # was 5.0  (lceda.cn Std doc, 10× 加速)
-SLEEP_PRO     = 0.5   # was 5.0  (Pro API host, 10× speedup)
+SLEEP_PRO     = 0.5   # was 5.0  (pro.lceda.cn API, 10× 加速)
-SLEEP_PRO_CDN = 0.2   # unchanged (CDN, already optimized)
+SLEEP_PRO_CDN = 0.2   # unchanged (modules.lceda.cn CDN, 已优化)
 ```
-## Net impact on batch-50 plan
+---
- Pro 25 项 × ~5 API calls each: 5×5 = 25s/proj × 25 = ~10min  →  0.5×5 = 2.5s/proj × 25 = ~1min
+## 净效：batch-50 计划 walltime 分解
- Std 25 项 × ~10 doc calls each: 5×10 = 50s/proj × 25 = ~21min  →  0.5×10 = 5s/proj × 25 = ~2min
+
- Detail page scan 50 项: 50 × 2s = 100s  →  50 × 1s = 50s
+按 `docs/plans/oshwhub_batch50.md` 各步：
- Combined batch-50 walltime estimate: **~2h → ~10 min** (excluding actual download bytes)
+
 | 阶段 | 旧 sleep 总和 | 新 sleep 总和 | 节省 |
 |---|---:|---:|---:|
 | 详情页扫 license（50 项） | 100s | 50s | -50s |
 | Pro 25 项（~5 API/proj） | ~10 min | ~1 min | **-9 min** |
 | Std 25 项（~10 doc/proj） | ~21 min | ~2 min | **-19 min** |
 | chain replay (CDN, 已优化) | unchanged | unchanged | — |
 | **总 sleep 时间** | **~32 min** | **~3 min** | **~29 min** |
 > 实际 wall-clock 还要加 download bytes 时间（不算瓶颈）。
 > **整批 ~2h → ~10-15 min** 的量级下降。
 ---
 ## 复现指南
 ```bash
 # 1. 装好依赖（首次）
 uv sync
 # 2. 准备 Std doc UUID 池（如果没爬过 Std 项目，就先跑一份）
 uv run python -c "
 import json, glob
 out = []
 for p in glob.glob('data/raw/oshwhub/*/source/manifest.json'):
    m = json.load(open(p))
    if 'upstream_version_documents' not in m: continue
    out.extend(vd['uuid'] for vd in m['upstream_version_documents'])
    if len(out) >= 50: break
 import json as j
 open('/tmp/std_doc_uuids.json', 'w').write(j.dumps(out))
 "
 # 3. 跑各 host 探针（按 host 选）
 PYTHONUNBUFFERED=1 uv run python -u scripts/probe_rate_limit.py --host oshwhub
 PYTHONUNBUFFERED=1 uv run python -u scripts/probe_rate_limit.py --host detail
 PYTHONUNBUFFERED=1 uv run python -u scripts/probe_rate_limit.py --host std-doc
 PYTHONUNBUFFERED=1 uv run python -u scripts/probe_rate_limit.py --host pro     # 需要 Pro cookie
 ```
 输出会逐档打印 status / latency 分布，触发任何 bad 立即停，上一档即安全水位。
 ---
 ## 后续可考虑
 - **Std attachment endpoint (`image.lceda.cn`)** 没单独 probe；目前走 `SLEEP_BETWEEN=1.0s`，是 CDN 性质，可能也能压到 0.2s 一档。
 - **更激进的水位 (0.25s)**：实测都过了，但留 0.5s 是给 batch-500/batch-2k 量级时的 headroom。等真扩到那个量级再压。
 - **风控压测**：当前实验都是单点突发；如果连续 12h 在 0.5s 跑会不会触发不同的限流策略？得长跑实测。
--- a/log.md
+++ b/log.md
@@ -4,6 +4,32 @@
 ---
 ## 2026-04-29 03:30  rate-limit benchmark 整理成正式报告
 **Claude 会话**
 把零散跑出来的 rate-limit ladder 探针结果整理成 `docs/sources/probe_rate_limit_results.md`，从临时增量笔记升级成正式 benchmark 文档。
 **5 个 host 全部探完**：
 | Host | 旧 sleep | 新 sleep | 加速 |
 |---|---:|---:|---:|
 | `pro.lceda.cn/api/v4/...` | 5.0s | 0.5s | 10× |
 | `lceda.cn/api/documents/...` | 5.0s | 0.5s | 10× |
 | `oshwhub.com/<owner>/<path>` | 2.0s | 1.0s | 2× |
 | `oshwhub.com/api/project` | 2.0s | 1.0s | 2× |
 | `modules.lceda.cn/...` | 0.2s | 0.2s | — (已优化) |
 **关键发现**：原 5s/req 完全是出于"Pro 要登录、被封号最痛"的心理顾虑而设，没有实测依据。Pro API 实测 25 distinct UUID 连发 0/25 bad，median 410ms latency，QPS=2 完全经得住。Std doc endpoint 同样的故事。
 **对 batch-50 的净效**：sleep 总时间 32 min → 3 min（约 10x），整批 walltime 估算 ~2h → ~10-15 min。
 报告结构：TL;DR 总表 → 方法论（包括安全约束 + 限制）→ 5 个 host 各自详细数据 → 最终设置 → 复现指南 → 后续考虑。
 下一步：直接跑 batch-50 计划的 Step 1（详情页扫 license）就行。
 ---
 ## 2026-04-29 03:00  跑完 5 块 Pro 项目 export，发现并修两个 --all 崩溃路径
 **Claude 会话**