Files

Knowit 183f82a3be crawler: drop SLEEP_SOURCE 5.0 -> 0.5 (Std doc endpoint probe)

Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s)
× 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency
variance is dominated by payload size (Std docs span 4 KB to 4.5 MB)
not server backpressure. Same posture as Pro API.

Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19
min wall time (21min sleep -> 2min sleep). Combined plan now projects
~2h -> ~10min walltime exclusive of download bytes.

scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs
from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json
upstream_version_documents lists). Reusable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 00:54:46 +08:00

3.9 KiB

Raw Blame History

Rate-limit probe results

Probe date: 2026-04-29 Script: scripts/probe_rate_limit.py Method: Ladder test — N requests at decreasing inter-request sleep, 30s recovery between tiers, watch for status != 200, body shrinkage, or latency degradation.

oshwhub.com listing API (`/api/project`)

No auth. 6 tiers × 10 reps = 60 reqs total.

sleep	status	latency p90
2.0s	all 200	1187ms
1.0s	all 200	1237ms
0.5s	all 200	567ms
0.25s	all 200	1180ms
0.1s	all 200	2194ms
0.0s	all 200	5362ms ← server soft-limits via latency

Verdict: 0.5s safe water mark. Going faster doesn't fail but server adds queueing latency (no return on the speed-up).

oshwhub.com detail HTML (`/<owner>/<path>`)

No auth. 6 tiers × 10 distinct paths from batch-50 candidates.

sleep	status	latency p90
2.0s	all 200	4767ms
1.0s	all 200	6350ms
0.5s	all 200	15364ms ← queue building
0.25s	all 200	3755ms
0.1s	all 200	8179ms
0.0s	all 200	3856ms

Verdict: 1.0s safe water mark. Detail HTML is 0.5 MB SSR, server slowdown earlier than listing API. Going to 0.5s already triggers server queue (one outlier 15s response), risk of timeout cascades on real bulk runs.

pro.lceda.cn API (`/api/v4/projects/<P>`)

Auth required (logged-in cookie). Conservative ladder, reps capped at 8 to limit fingerprint exposure. 5 tiers × 8 reqs.

sleep	status	latency p90
5.0s	all 200	7299ms
2.0s	all 200	5518ms
1.0s	all 200	1409ms
0.5s	all 200	2995ms
0.25s	all 200	1552ms

Then sustained burst test at the chosen water mark: 25 distinct Pro UUIDs at 0.5s sleep, no recovery.

25/25 success (all status 200, all success: true)
median latency 410ms, p90 932ms, max 1853ms (first call only — TLS handshake)
effective QPS 1.0
wall time 24.9s (vs ~140s at the old 5s/req — 5.6× speedup)

Verdict: 0.5s safe water mark. Empirically Pro API tolerates QPS=2 cleanly, even sustained. Originally set high (5s) out of caution because Pro requires a logged-in account — that caution was unjustified.

lceda.cn Std doc endpoint (`/api/documents/<uuid>`)

No auth (Std is anonymous-readable, browser UA + Referer only). 5 tiers × 9 distinct doc UUIDs from already-crawled Std projects.

sleep	status	latency med	latency p90	body median
5.0s	all 200	1124ms	3846ms	31 KB
2.0s	all 200	2634ms	7626ms	495 KB
1.0s	all 200	1781ms	19834ms (one 4.5 MB doc)	918 KB
0.5s	all 200	666ms	891ms	748 KB
0.25s	all 200	416ms	1384ms	251 KB

Verdict: 0.5s safe water mark. Latency variance is dominated by payload size (Std docs span 4 KB to 4.5 MB) — not server backpressure. The 19s p90 at the 1.0s tier was one giant doc, not a throttle. Same posture as Pro API.

modules.lceda.cn CDN — already at 0.2s

CDN host serving AES-encrypted EPRO2 history blobs. Pre-existing SLEEP_PRO_CDN = 0.2, validated against editor HAR which fires blobs back-to-back without throttling. No further probing needed.

Settings applied

SLEEP_BETWEEN = 1.0   # was 2.0  (oshwhub detail/listing)
SLEEP_SOURCE  = 0.5   # was 5.0  (Std doc endpoint, 10× speedup)
SLEEP_PRO     = 0.5   # was 5.0  (Pro API host, 10× speedup)
SLEEP_PRO_CDN = 0.2   # unchanged (CDN, already optimized)

Net impact on batch-50 plan

Pro 25 项 × ~5 API calls each: 5×5 = 25s/proj × 25 = ~10min → 0.5×5 = 2.5s/proj × 25 = ~1min
Std 25 项 × ~10 doc calls each: 5×10 = 50s/proj × 25 = ~21min → 0.5×10 = 5s/proj × 25 = ~2min
Detail page scan 50 项: 50 × 2s = 100s → 50 × 1s = 50s
Combined batch-50 walltime estimate: ~2h → ~10 min (excluding actual download bytes)

3.9 KiB Raw Blame History Unescape Escape

Rate-limit probe results

oshwhub.com listing API (/api/project)

oshwhub.com detail HTML (/<owner>/<path>)

pro.lceda.cn API (/api/v4/projects/<P>)

lceda.cn Std doc endpoint (/api/documents/<uuid>)