crawler: split sleep policy by host — chain blob fetches drop 5s -> 0.2s

The Pro modern fetch_pro_modern walks a per-history blob loop on
modules.lceda.cn (CDN-flavored host serving AES-encrypted EPRO2 streams).
We were sleeping 5s between every blob — same rate we use for the
rate-sensitive pro.lceda.cn API host. HAR analysis (proexportNew2.har)
shows the editor fires these blobs back-to-back without throttling, so
0.2s is plenty.

Walltime drops linearly with chain length:
  ESP-VoCat (chain=12):    80s sleep -> 22s sleep  (-72%)
  220V power (chain=28):  160s sleep -> 26s sleep  (-84%)
  X86 board  (chain~700, projection):  ~1h -> ~3min

Verified by re-fetching ESP-VoCat + 220V power: byte-identical output
across all per-doc .epro2 files (sha256 match), only fetched_at
timestamp differs in manifest.json. Two manifest files re-stamped as
proof of the validation runs.

API host sleeps (4x 5s in modern fetcher, 7x 5s in legacy fetcher) are
unchanged — those go to pro.lceda.cn /api/ which still wants polite
QPS<=0.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 00:09:19 +08:00
parent ff5553fb06
commit 1e06ba6582
3 changed files with 11 additions and 4 deletions

View File

@@ -45,7 +45,13 @@ BROWSER_UA = (
)
SLEEP_BETWEEN = 2.0 # seconds between detail-page / file fetches
SLEEP_SOURCE = 5.0 # source fetch is sensitive — QPS ≤ 0.2 per CLAUDE.md登录态 spirit
SLEEP_PRO = 5.0 # Pro is logged-in; same QPS ≤ 0.2 per docs/sources/easyeda_pro_source.md §4.1
SLEEP_PRO = 5.0 # Pro API host (pro.lceda.cn): rate-sensitive, keep at QPS ≤ 0.2
# CDN host (modules.lceda.cn) only serves AES-encrypted history blobs.
# HAR analysis (proexportNew2.har 2026-04-29) shows the editor fires these
# blobs back-to-back without throttling — the CDN can clearly take it.
# Walltime for chain replay is dominated by this loop on multi-hundred-history
# projects (X86 board: chain ≈ 700 → ~1h at 5s/req → ~few min at 0.2s/req).
SLEEP_PRO_CDN = 0.2
# ---------------------------------------------------------------------------
@@ -549,7 +555,8 @@ def _fetch_pro_modern(
}
if cur_doc and cur_doc in docs:
docs[cur_doc]["lines"].append(ln)
time.sleep(sleep)
# CDN host, not the rate-sensitive API host — see SLEEP_PRO_CDN comment.
time.sleep(SLEEP_PRO_CDN)
# 6. write per-doc .epro2 + manifest
doc_metas: list[dict] = []