crawler: drop SLEEP_SOURCE 5.0 -> 0.5 (Std doc endpoint probe)

Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s) × 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency variance is dominated by payload size (Std docs span 4 KB to 4.5 MB) not server backpressure. Same posture as Pro API. Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19 min wall time (21min sleep -> 2min sleep). Combined plan now projects ~2h -> ~10min walltime exclusive of download bytes. scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json upstream_version_documents lists). Reusable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:54:46 +08:00
parent 8b857428e3
commit 183f82a3be
3 changed files with 86 additions and 9 deletions
--- a/docs/sources/probe_rate_limit_results.md
+++ b/docs/sources/probe_rate_limit_results.md
@@ -64,11 +64,23 @@ Then **sustained burst test** at the chosen water mark:
 cleanly, even sustained. Originally set high (5s) out of caution because
 Pro requires a logged-in account — that caution was unjustified.

-## lceda.cn Std source endpoints — NOT YET PROBED
+## lceda.cn Std doc endpoint (`/api/documents/<uuid>`)

-Currently `SLEEP_SOURCE = 5.0`. Should be probed before lowering. Std
-crawler isn't on the critical path for batch-50 (~12 min vs Pro's
-~10 min savings), so this can wait.
+No auth (Std is anonymous-readable, browser UA + Referer only).
+5 tiers × 9 distinct doc UUIDs from already-crawled Std projects.
+
+| sleep | status | bad | latency med | latency p90 | body median |
+|---|---|---:|---:|---:|---:|
+| 5.0s | all 200 | 0 | 1124ms | 3846ms | 31 KB |
+| 2.0s | all 200 | 0 | 2634ms | 7626ms | 495 KB |
+| 1.0s | all 200 | 0 | 1781ms | **19834ms** (one 4.5 MB doc) | 918 KB |
+| 0.5s | all 200 | 0 | 666ms | 891ms | 748 KB |
+| 0.25s | all 200 | 0 | 416ms | 1384ms | 251 KB |
+
+**Verdict**: 0.5s safe water mark. Latency variance is dominated by
+**payload size** (Std docs span 4 KB to 4.5 MB) — not server backpressure.
+The 19s p90 at the 1.0s tier was one giant doc, not a throttle. Same
+posture as Pro API.

 ## modules.lceda.cn CDN — already at 0.2s

@@ -80,7 +92,7 @@ back-to-back without throttling. No further probing needed.

 ```python
 SLEEP_BETWEEN = 1.0   # was 2.0  (oshwhub detail/listing)
-SLEEP_SOURCE  = 5.0   # unchanged (Std source — not yet probed)
+SLEEP_SOURCE  = 0.5   # was 5.0  (Std doc endpoint, 10× speedup)
 SLEEP_PRO     = 0.5   # was 5.0  (Pro API host, 10× speedup)
 SLEEP_PRO_CDN = 0.2   # unchanged (CDN, already optimized)
 ```
@@ -88,5 +100,6 @@ SLEEP_PRO_CDN = 0.2   # unchanged (CDN, already optimized)
 ## Net impact on batch-50 plan

 - Pro 25 项 × ~5 API calls each: 5×5 = 25s/proj × 25 = ~10min  →  0.5×5 = 2.5s/proj × 25 = ~1min
+- Std 25 项 × ~10 doc calls each: 5×10 = 50s/proj × 25 = ~21min  →  0.5×10 = 5s/proj × 25 = ~2min
 - Detail page scan 50 项: 50 × 2s = 100s  →  50 × 1s = 50s
- Combined batch-50 walltime estimate: **~1.5h → ~30 min**
+- Combined batch-50 walltime estimate: **~2h → ~10 min** (excluding actual download bytes)