The doc had been growing incrementally as each host got probed; reshape
it as a polished benchmark with TL;DR top, methodology section
(including safety constraints + caveats), per-host detailed tables,
final crawler settings, batch-50 walltime breakdown, and a reproduce
recipe.
Five hosts fully covered:
pro.lceda.cn API 5.0s -> 0.5s (10×)
lceda.cn doc 5.0s -> 0.5s (10×)
oshwhub detail 2.0s -> 1.0s ( 2×)
oshwhub listing 2.0s -> 1.0s ( 2×)
modules.lceda CDN 0.2s (already optimized)
Net effect on batch-50 plan: sleep total ~32min -> ~3min, walltime
~2h -> ~10-15min.
Key finding: the original 5s/req on Pro was set out of "logged-in
account is precious" caution with zero empirical evidence. Sustained
burst probe (25 distinct UUIDs at 0.5s, no recovery) showed 0/25 errors
and median latency 410ms — the caution was unjustified.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s)
× 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency
variance is dominated by payload size (Std docs span 4 KB to 4.5 MB)
not server backpressure. Same posture as Pro API.
Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19
min wall time (21min sleep -> 2min sleep). Combined plan now projects
~2h -> ~10min walltime exclusive of download bytes.
scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs
from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json
upstream_version_documents lists). Reusable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Calibrated against ladder probes on 2026-04-29. Findings in
docs/sources/probe_rate_limit_results.md.
SLEEP_PRO 5.0 -> 0.5 (pro.lceda.cn API)
SLEEP_BETWEEN 2.0 -> 1.0 (oshwhub detail/listing)
SLEEP_SOURCE 5.0 unchanged (lceda.cn Std endpoints — not yet probed)
SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized)
The original 5s rate for Pro API was set out of caution because Pro
requires a logged-in cookie. Empirical sustained-burst probe (25
distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median
latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was
wrong — server tolerates QPS=2 cleanly.
oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to
p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe
water mark.
Net effect on batch-50 estimate: ~1.5h -> ~30min.
scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable
for new endpoints (Std source still owes a probe). Designed for safety:
30s tier recovery, low rep counts on auth hosts, bail on first non-200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Colleague-facing explainer at docs/sources/pro_crawl_vs_export.md.
Addresses the "I see 278 .epro2 files but my browser only downloaded
one" confusion: web download is a ZIP container (extension is a UX
choice, not a format), our crawl produces per-doc message streams.
Both carry equivalent EPRO2 data; only real gap is IMAGE/ binary
previews which we don't fetch yet.
Why per-doc and not ZIP: the ZIP path has no public endpoint —
three HARs confirm the export button fires zero HTTP requests, it's
pure client-side JSZip on data already loaded by the editor. Our
crawler hits the same chain endpoints the editor uses internally,
which delivers per-doc streams.
Log entry references the 278 vs 266 doc-count delta for ESP-VoCat
(we walk full history chain, web export is a current snapshot).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Probed listing API and learned: total field is exposed (Pro=21,202 / Std=12,493),
pageSize accepts >=1000 (full corpus = 35 requests / 71s), sort param is silently
ignored. Dump all listings via scripts/dump_listing_index.py to local jsonl so
downstream batch-selection no longer hits the API.
Why: needed quantitative anchors before scaling Pro batch beyond top-5. License
is detail-page only (~19h serial scan), so we want to filter on grade/like
*locally* first to shortlist before paying that cost. Quality-tier counts now
known: A-tier (grade>=3 & like>=10) = 2,806 across both origins.
- scripts/dump_listing_index.py: one-shot scraper, polite QPS, streams to jsonl
- docs/sources/oshwhub_listing_full.md: human-readable report with growth
trends, quality tiers, owner concentration, and storage-budget anchors
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>