crawler: drop SLEEP_SOURCE 5.0 -> 0.5 (Std doc endpoint probe)

Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s)
× 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency
variance is dominated by payload size (Std docs span 4 KB to 4.5 MB)
not server backpressure. Same posture as Pro API.

Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19
min wall time (21min sleep -> 2min sleep). Combined plan now projects
~2h -> ~10min walltime exclusive of download bytes.

scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs
from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json
upstream_version_documents lists). Reusable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 00:54:46 +08:00
parent 8b857428e3
commit 183f82a3be
3 changed files with 86 additions and 9 deletions

View File

@@ -64,11 +64,23 @@ Then **sustained burst test** at the chosen water mark:
cleanly, even sustained. Originally set high (5s) out of caution because
Pro requires a logged-in account — that caution was unjustified.
## lceda.cn Std source endpoints — NOT YET PROBED
## lceda.cn Std doc endpoint (`/api/documents/<uuid>`)
Currently `SLEEP_SOURCE = 5.0`. Should be probed before lowering. Std
crawler isn't on the critical path for batch-50 (~12 min vs Pro's
~10 min savings), so this can wait.
No auth (Std is anonymous-readable, browser UA + Referer only).
5 tiers × 9 distinct doc UUIDs from already-crawled Std projects.
| sleep | status | bad | latency med | latency p90 | body median |
|---|---|---:|---:|---:|---:|
| 5.0s | all 200 | 0 | 1124ms | 3846ms | 31 KB |
| 2.0s | all 200 | 0 | 2634ms | 7626ms | 495 KB |
| 1.0s | all 200 | 0 | 1781ms | **19834ms** (one 4.5 MB doc) | 918 KB |
| 0.5s | all 200 | 0 | 666ms | 891ms | 748 KB |
| 0.25s | all 200 | 0 | 416ms | 1384ms | 251 KB |
**Verdict**: 0.5s safe water mark. Latency variance is dominated by
**payload size** (Std docs span 4 KB to 4.5 MB) — not server backpressure.
The 19s p90 at the 1.0s tier was one giant doc, not a throttle. Same
posture as Pro API.
## modules.lceda.cn CDN — already at 0.2s
@@ -80,7 +92,7 @@ back-to-back without throttling. No further probing needed.
```python
SLEEP_BETWEEN = 1.0 # was 2.0 (oshwhub detail/listing)
SLEEP_SOURCE = 5.0 # unchanged (Std source — not yet probed)
SLEEP_SOURCE = 0.5 # was 5.0 (Std doc endpoint, 10× speedup)
SLEEP_PRO = 0.5 # was 5.0 (Pro API host, 10× speedup)
SLEEP_PRO_CDN = 0.2 # unchanged (CDN, already optimized)
```
@@ -88,5 +100,6 @@ SLEEP_PRO_CDN = 0.2 # unchanged (CDN, already optimized)
## Net impact on batch-50 plan
- Pro 25 项 × ~5 API calls each: 5×5 = 25s/proj × 25 = ~10min → 0.5×5 = 2.5s/proj × 25 = ~1min
- Std 25 项 × ~10 doc calls each: 5×10 = 50s/proj × 25 = ~21min → 0.5×10 = 5s/proj × 25 = ~2min
- Detail page scan 50 项: 50 × 2s = 100s → 50 × 1s = 50s
- Combined batch-50 walltime estimate: **~1.5h → ~30 min**
- Combined batch-50 walltime estimate: **~2h → ~10 min** (excluding actual download bytes)