crawler: drop SLEEP_SOURCE 5.0 -> 0.5 (Std doc endpoint probe)
Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s) × 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency variance is dominated by payload size (Std docs span 4 KB to 4.5 MB) not server backpressure. Same posture as Pro API. Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19 min wall time (21min sleep -> 2min sleep). Combined plan now projects ~2h -> ~10min walltime exclusive of download bytes. scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json upstream_version_documents lists). Reusable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -64,11 +64,23 @@ Then **sustained burst test** at the chosen water mark:
|
||||
cleanly, even sustained. Originally set high (5s) out of caution because
|
||||
Pro requires a logged-in account — that caution was unjustified.
|
||||
|
||||
## lceda.cn Std source endpoints — NOT YET PROBED
|
||||
## lceda.cn Std doc endpoint (`/api/documents/<uuid>`)
|
||||
|
||||
Currently `SLEEP_SOURCE = 5.0`. Should be probed before lowering. Std
|
||||
crawler isn't on the critical path for batch-50 (~12 min vs Pro's
|
||||
~10 min savings), so this can wait.
|
||||
No auth (Std is anonymous-readable, browser UA + Referer only).
|
||||
5 tiers × 9 distinct doc UUIDs from already-crawled Std projects.
|
||||
|
||||
| sleep | status | bad | latency med | latency p90 | body median |
|
||||
|---|---|---:|---:|---:|---:|
|
||||
| 5.0s | all 200 | 0 | 1124ms | 3846ms | 31 KB |
|
||||
| 2.0s | all 200 | 0 | 2634ms | 7626ms | 495 KB |
|
||||
| 1.0s | all 200 | 0 | 1781ms | **19834ms** (one 4.5 MB doc) | 918 KB |
|
||||
| 0.5s | all 200 | 0 | 666ms | 891ms | 748 KB |
|
||||
| 0.25s | all 200 | 0 | 416ms | 1384ms | 251 KB |
|
||||
|
||||
**Verdict**: 0.5s safe water mark. Latency variance is dominated by
|
||||
**payload size** (Std docs span 4 KB to 4.5 MB) — not server backpressure.
|
||||
The 19s p90 at the 1.0s tier was one giant doc, not a throttle. Same
|
||||
posture as Pro API.
|
||||
|
||||
## modules.lceda.cn CDN — already at 0.2s
|
||||
|
||||
@@ -80,7 +92,7 @@ back-to-back without throttling. No further probing needed.
|
||||
|
||||
```python
|
||||
SLEEP_BETWEEN = 1.0 # was 2.0 (oshwhub detail/listing)
|
||||
SLEEP_SOURCE = 5.0 # unchanged (Std source — not yet probed)
|
||||
SLEEP_SOURCE = 0.5 # was 5.0 (Std doc endpoint, 10× speedup)
|
||||
SLEEP_PRO = 0.5 # was 5.0 (Pro API host, 10× speedup)
|
||||
SLEEP_PRO_CDN = 0.2 # unchanged (CDN, already optimized)
|
||||
```
|
||||
@@ -88,5 +100,6 @@ SLEEP_PRO_CDN = 0.2 # unchanged (CDN, already optimized)
|
||||
## Net impact on batch-50 plan
|
||||
|
||||
- Pro 25 项 × ~5 API calls each: 5×5 = 25s/proj × 25 = ~10min → 0.5×5 = 2.5s/proj × 25 = ~1min
|
||||
- Std 25 项 × ~10 doc calls each: 5×10 = 50s/proj × 25 = ~21min → 0.5×10 = 5s/proj × 25 = ~2min
|
||||
- Detail page scan 50 项: 50 × 2s = 100s → 50 × 1s = 50s
|
||||
- Combined batch-50 walltime estimate: **~1.5h → ~30 min**
|
||||
- Combined batch-50 walltime estimate: **~2h → ~10 min** (excluding actual download bytes)
|
||||
|
||||
Reference in New Issue
Block a user