Files
FacereDataset/docs/sources/probe_rate_limit_results.md
Knowit ed713fa557 docs: consolidate rate-limit probe results into a proper benchmark report
The doc had been growing incrementally as each host got probed; reshape
it as a polished benchmark with TL;DR top, methodology section
(including safety constraints + caveats), per-host detailed tables,
final crawler settings, batch-50 walltime breakdown, and a reproduce
recipe.

Five hosts fully covered:
  pro.lceda.cn API   5.0s -> 0.5s  (10×)
  lceda.cn doc       5.0s -> 0.5s  (10×)
  oshwhub detail     2.0s -> 1.0s  ( 2×)
  oshwhub listing    2.0s -> 1.0s  ( 2×)
  modules.lceda CDN  0.2s          (already optimized)

Net effect on batch-50 plan: sleep total ~32min -> ~3min, walltime
~2h -> ~10-15min.

Key finding: the original 5s/req on Pro was set out of "logged-in
account is precious" caution with zero empirical evidence. Sustained
burst probe (25 distinct UUIDs at 0.5s, no recovery) showed 0/25 errors
and median latency 410ms — the caution was unjustified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:57:35 +08:00

209 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# oshwhub / lceda 速率限流摸底rate-limit benchmark
**日期**2026-04-29
**执行**`scripts/probe_rate_limit.py`(梯度探针)
**目的**:量化每个 host 的真实限流水位,把 crawler 里"凭直觉拍的 5s sleep" 替换成"凭实测拍的",给后续扩量提供加速空间
---
## TL;DR
5 个 host / endpoint 各做了梯度探针。**总加速 5-10x**,旧策略普遍过度保守。
| Host / Endpoint | 旧 sleep | 新 sleep | 加速 | 探针结论 |
|---|---:|---:|---:|---|
| `pro.lceda.cn/api/v4/...`Pro 鉴权 API | 5.0s | **0.5s** | **10×** | 25 个 distinct UUID 连发 0/25 bad |
| `lceda.cn/api/documents/...`Std doc | 5.0s | **0.5s** | **10×** | 5 档梯度 45/45 全 200 |
| `oshwhub.com/<owner>/<path>`(详情 HTML | 2.0s | **1.0s** | **2×** | 0.5s 时 server queue 堆积 p90=15s |
| `oshwhub.com/api/project`listing | 2.0s | **1.0s** | **2×** | 软延迟限流,更快无收益 |
| `modules.lceda.cn/...`CDNAES blob| 0.2s | 0.2s | — | 已优化commit `1e06ba6`HAR 验证 CDN 可连发 |
**对 batch-50 计划的净效**sleep 总时间 ~32 min → ~3 min约 10× 整体加速)。
---
## 方法论
### 测试装置
```
HTTP client (httpx)
└─ N reqs to target endpoint at constant interval `sleep`
└─ record: status code / response body size / wall latency
└─ FAIL criteria:
• status != 200
• response body shrinks below threshold (soft-block)
• exception (connection close / timeout)
```
每个 host **跑梯度** `[5.0, 2.0, 1.0, 0.5, 0.25, 0.1, 0.0]`(具体档位按 host 调整),档与档之间插 **30s 恢复期**避免上一档触发的限流污染下一档。
### 安全约束
- **只读端点**——不修改任何东西,纯 GET
- **鉴权 host 限 reps=8**——pro.lceda.cn 用 logged-in cookie避免触发指纹
- **bail on first non-200**——任何一档出错立即停,把上一档当安全水位
- **真实负载**——Pro 探针用 batch-50 候选清单里的真 UUID这些反正要爬Std 探针用已抓项目的真 doc UUID
### 局限
1. **单 IP 单账号**——结论是"该 IP/账号视角下的限流",不能推广到分布式爬取
2. **延迟限流不可见**——服务端可能在 client 层面不报错但内部延迟,长期累积可能进入封号窗口;探针不持续到几小时,捕捉不到这种缓慢限流
3. **payload size 干扰**——Std doc endpoint latency 严重依赖 body 大小4 KB ~ 4.5 MBlatency 看上去波动大不一定是 server 在限流
4. **30s 恢复期可能不够**——如果服务端用滑动窗口限流5-10 分钟窗口),档间 30s 恢复不足;不过实测各档没出错,所以这个担心可以放下
---
## 详细数据
### 1. `pro.lceda.cn/api/v4/projects/<P>` — Pro 鉴权 API
**鉴权要求**:登录 cookie`~/.secrets/pro-lceda-cookie-header.txt`
**reps**:每档 8保守限制指纹
**梯度**`[5.0, 2.0, 1.0, 0.5, 0.25]s`
| sleep | status | bad | latency p90 |
|---|---|---:|---:|
| 5.0s | all 200 | 0 | 7299ms |
| 2.0s | all 200 | 0 | 5518ms |
| 1.0s | all 200 | 0 | 1409ms |
| 0.5s | all 200 | 0 | 2995ms |
| 0.25s | all 200 | 0 | 1552ms |
#### 加测sustained burst @ 0.5s
5 档 ladder 通过后,再做了一次**真实负载模拟**25 个 distinct Pro UUID 连发,**无恢复期**。
```
25/25 status 200, success: true
latency: median 410ms, p90 932ms, max 1853ms (首次 TLS handshake)
wall: 24.9s for 25 reqs (effective QPS 1.0)
```
**结论**Pro API 经得住持续 QPS=2 的连发。原 5s 是**过度保守**——是出于"Pro 要登录、被封号最痛"的心理顾虑而设,没有实测依据。
### 2. `lceda.cn/api/documents/<uuid>` — Std doc endpoint
**鉴权要求**Std 是匿名可读,需要 browser UA + Referer`docs/sources/easyeda_std_source.md §3`
**reps**:每档 9distinct doc UUID 来自已抓 Std 项目的 manifest
**梯度**`[5.0, 2.0, 1.0, 0.5, 0.25]s`
| sleep | status | bad | latency med | latency p90 | body median |
|---|---|---:|---:|---:|---:|
| 5.0s | all 200 | 0 | 1124ms | 3846ms | 31 KB |
| 2.0s | all 200 | 0 | 2634ms | 7626ms | 495 KB |
| 1.0s | all 200 | 0 | 1781ms | **19834ms** (一个 4.5 MB 大 doc) | 918 KB |
| 0.5s | all 200 | 0 | 666ms | 891ms | 748 KB |
| 0.25s | all 200 | 0 | 416ms | 1384ms | 251 KB |
**结论**5 档全 200latency 大幅依赖 body size不是 server 反压。1.0s 那档 p90=19s 是单个 4.5 MB 大 doc 拉长的,不是 throttle。**0.5s 安全**,与 Pro API 同水位。
### 3. `oshwhub.com/<owner>/<path>` — 详情 HTML 页
**鉴权要求**:无
**reps**:每档 10distinct 路径来自 batch-50 候选清单)
**梯度**`[2.0, 1.0, 0.5, 0.25, 0.1, 0.0]s`
| sleep | status | bad | latency p90 |
|---|---|---:|---:|
| 2.0s | all 200 | 0 | 4767ms |
| 1.0s | all 200 | 0 | 6350ms |
| 0.5s | all 200 | 0 | **15364ms** ← server queue 堆积 |
| 0.25s | all 200 | 0 | 3755ms |
| 0.1s | all 200 | 0 | 8179ms |
| 0.0s | all 200 | 0 | 3856ms |
**结论**:所有档都没出错,但 **0.5s 时 p90 飙到 15s**——一次大延迟意味着真实批量跑会出 timeout 级联。详情页是 SSR HTML中位 0.5 MB bodyserver 比 listing API 更早进入排队状态。**1.0s 是 headroom-safe 的水位**,比之前的 2s 快一倍。
### 4. `oshwhub.com/api/project` — listing API
**鉴权要求**:无
**reps**:每档 10
**梯度**`[2.0, 1.0, 0.5, 0.25, 0.1, 0.0]s`
| sleep | status | bad | latency p90 |
|---|---|---:|---:|
| 2.0s | all 200 | 0 | 1187ms |
| 1.0s | all 200 | 0 | 1237ms |
| 0.5s | all 200 | 0 | 567ms |
| 0.25s | all 200 | 0 | 1180ms |
| 0.1s | all 200 | 0 | 2194ms |
| 0.0s | all 200 | 0 | **5362ms** ← 软延迟限流 |
**结论**listing API 不报错但用延迟做软限流——0s sleep 时 p90 飙到 5.3s,说明 server 把请求排队拉慢。**0.5s 是性价比拐点**(再快没收益)。代码里跟详情页统一用 `SLEEP_BETWEEN=1.0s` 留余量。
### 5. `modules.lceda.cn/...` — CDN
**鉴权要求**path 头由 caller 提供,但 CDN 不验证)
**未单独 probe**——之前在 `1e06ba6` commit 已经从 5.0s 降到 0.2s,依据是 HAR 实测:编辑器加载工程时这个 CDN 是连发的无间隔。CDN 本质就是抗压设计。
**结论**:保持 0.2s(已是激进档位)。再压更低收益已不大,且会浪费 TCP 连接。
---
## 应用到 crawler 的最终设置
```python
# crawlers/oshwhub/crawler.py
SLEEP_BETWEEN = 1.0 # was 2.0 (oshwhub detail/listing)
SLEEP_SOURCE = 0.5 # was 5.0 (lceda.cn Std doc, 10× 加速)
SLEEP_PRO = 0.5 # was 5.0 (pro.lceda.cn API, 10× 加速)
SLEEP_PRO_CDN = 0.2 # unchanged (modules.lceda.cn CDN, 已优化)
```
---
## 净效batch-50 计划 walltime 分解
`docs/plans/oshwhub_batch50.md` 各步:
| 阶段 | 旧 sleep 总和 | 新 sleep 总和 | 节省 |
|---|---:|---:|---:|
| 详情页扫 license50 项) | 100s | 50s | -50s |
| Pro 25 项(~5 API/proj | ~10 min | ~1 min | **-9 min** |
| Std 25 项(~10 doc/proj | ~21 min | ~2 min | **-19 min** |
| chain replay (CDN, 已优化) | unchanged | unchanged | — |
| **总 sleep 时间** | **~32 min** | **~3 min** | **~29 min** |
> 实际 wall-clock 还要加 download bytes 时间(不算瓶颈)。
> **整批 ~2h → ~10-15 min** 的量级下降。
---
## 复现指南
```bash
# 1. 装好依赖(首次)
uv sync
# 2. 准备 Std doc UUID 池(如果没爬过 Std 项目,就先跑一份)
uv run python -c "
import json, glob
out = []
for p in glob.glob('data/raw/oshwhub/*/source/manifest.json'):
m = json.load(open(p))
if 'upstream_version_documents' not in m: continue
out.extend(vd['uuid'] for vd in m['upstream_version_documents'])
if len(out) >= 50: break
import json as j
open('/tmp/std_doc_uuids.json', 'w').write(j.dumps(out))
"
# 3. 跑各 host 探针(按 host 选)
PYTHONUNBUFFERED=1 uv run python -u scripts/probe_rate_limit.py --host oshwhub
PYTHONUNBUFFERED=1 uv run python -u scripts/probe_rate_limit.py --host detail
PYTHONUNBUFFERED=1 uv run python -u scripts/probe_rate_limit.py --host std-doc
PYTHONUNBUFFERED=1 uv run python -u scripts/probe_rate_limit.py --host pro # 需要 Pro cookie
```
输出会逐档打印 status / latency 分布,触发任何 bad 立即停,上一档即安全水位。
---
## 后续可考虑
- **Std attachment endpoint (`image.lceda.cn`)** 没单独 probe目前走 `SLEEP_BETWEEN=1.0s`,是 CDN 性质,可能也能压到 0.2s 一档。
- **更激进的水位 (0.25s)**:实测都过了,但留 0.5s 是给 batch-500/batch-2k 量级时的 headroom。等真扩到那个量级再压。
- **风控压测**:当前实验都是单点突发;如果连续 12h 在 0.5s 跑会不会触发不同的限流策略?得长跑实测。