Both _run_backfill_source and _run_backfill_pro_source now honor
--concurrency N (default 1 keeps current sequential behavior). Shared
dispatch helper _run_backfill_concurrent + _discover_backfill_targets
factored out — the two paths had drifted but were structurally the same.
Thread safety:
- httpx.Client is sync-thread-safe per docs; one client shared across
threads is correct
- Per-project file writes (metadata.json + source/*) don't conflict
since each thread owns one project dir
- Oversize state file is shared; serialized via a Lock around
_record_oversize
- Print is wrapped in a Lock for readable progress
Expected speedup on dev1 (Guangzhou): batch-200 Pro 100 项 sequential
~14 min -> concurrency 5 ~3-4 min. Std similar 2-3x. Server-side limit
isn't likely to bite at this scale (probe showed Pro QPS=2 sustained
clean; concurrency 5 puts effective rate around 4-5 req/s).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pro 2.x project metadata's boards[] can reference sch/pcb UUIDs that
the project owner has since deprecated/deleted (e.g. "主控板V1(废弃)").
Such UUIDs are gone from ticket.schematics / ticket.pcbs but still in
boards[]. Asking schematic/lists or documents/lists for them returns
401 and aborts the whole project.
Filter both lists against the authoritative ticket dict before posting.
Verified on 7f7565ef11 (Super Dial 电机旋钮屏): 4 boards but only 3
sch entries in schematics dict, isolating the deprecated 8bc59f to a
401 we now skip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three crawler ergonomics for batch operations:
--no-cover Skip cover image download. For scan-only modes (license/meta
scrape) this drops ~1.3s/project and avoids slow-CDN hangs.
--concurrency N ThreadPoolExecutor wrapping the per-project loop. Default
1 = serial (current behavior). Anonymous endpoints tolerate
5+ comfortably; output uses a print lock for readable
interleaved progress. fetch_cover plumbs through crawl_one.
Drop cross-host sleep #1: in crawl_one between detail HTML (oshwhub.com)
and cover image (image.lceda.cn). Different hosts — sleep was unnecessary.
Saves ~1s/project. Sleep #2 (post-cover, before next iteration) stays — it
gates the next oshwhub.com hit.
download_to gains max_seconds wall budget (default 60s, cover uses 15s).
Defends against pathologically slow CDN connections — observed 10 KB/s
on image.lceda.cn for one project, would have hung 6+ min on a 3.6 MB
cover otherwise. httpx default timeout resets per chunk, so streaming
downloads need an external wall-clock guard.
batch-50 Step 1 (license/meta scrape) shipped:
50/50 candidates have metadata.json + license recorded
License distribution: GPL 3.0 32, Public Domain 6, NC variants 8,
CERN-OHL 1, MIT 1, CC BY 3.0 1
Forge-friendly (non-NC): 41/50 (82%)
Declared attachments: 180 files / 2.36 GB (median 18 MB/proj, max 304 MB)
Walltime: 3min 26s for 28 projects at concurrency=5 (server-side
HTML render bound, not sleep-bound)
One orphan partial cover (a670e60a...) cleaned up — leftover from the
first aborted run before the timeout fix landed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s)
× 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency
variance is dominated by payload size (Std docs span 4 KB to 4.5 MB)
not server backpressure. Same posture as Pro API.
Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19
min wall time (21min sleep -> 2min sleep). Combined plan now projects
~2h -> ~10min walltime exclusive of download bytes.
scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs
from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json
upstream_version_documents lists). Reusable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Calibrated against ladder probes on 2026-04-29. Findings in
docs/sources/probe_rate_limit_results.md.
SLEEP_PRO 5.0 -> 0.5 (pro.lceda.cn API)
SLEEP_BETWEEN 2.0 -> 1.0 (oshwhub detail/listing)
SLEEP_SOURCE 5.0 unchanged (lceda.cn Std endpoints — not yet probed)
SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized)
The original 5s rate for Pro API was set out of caution because Pro
requires a logged-in cookie. Empirical sustained-burst probe (25
distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median
latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was
wrong — server tolerates QPS=2 cleanly.
oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to
p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe
water mark.
Net effect on batch-50 estimate: ~1.5h -> ~30min.
scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable
for new endpoints (Std source still owes a probe). Designed for safety:
30s tier recovery, low rep counts on auth hosts, bail on first non-200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CLI gates needed before scaling Pro batch beyond top-5:
--skip-ext mp4,qt,mov (attachment filter)
Skips video extensions in attachment download. Phase 1 measurements
showed mp4+qt occupy ~54% of attachment storage. Entry still recorded
in metadata.json with skipped:ext:<token> so we can re-fetch later if
the policy changes. Honors both server-declared `ext` and filename
suffix, case-insensitively.
--max-source-mb N (Pro source size cap)
Trips inside the chain replay loop on encrypted-blob total. On trip:
raise ProjectOversizeError, wipe partial source/, append a row to
data/state/oshwhub_pro_oversize.jsonl. Lets us shortlist 50+ Pro
projects without one X86-board-class outlier (~500 MB) blowing the
LFS budget. Std and Pro 2.x legacy are not capped (both <2 MB in
sample).
Verified:
- cap=0 trips on first blob (1.2 MB), source/ wiped, state recorded
- cap=100 runs full ESP-VoCat (7.5 MB plain, 278 docs)
- skip-ext microtest: 8/8 cases (case-insensitive, declared/suffix
fallback, empty-token edge cases)
Plan + frozen candidate list for the next 50 projects:
- docs/plans/oshwhub_batch50.md
- data/state/oshwhub_batch50_candidates.jsonl (gitignore exception added)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Pro modern fetch_pro_modern walks a per-history blob loop on
modules.lceda.cn (CDN-flavored host serving AES-encrypted EPRO2 streams).
We were sleeping 5s between every blob — same rate we use for the
rate-sensitive pro.lceda.cn API host. HAR analysis (proexportNew2.har)
shows the editor fires these blobs back-to-back without throttling, so
0.2s is plenty.
Walltime drops linearly with chain length:
ESP-VoCat (chain=12): 80s sleep -> 22s sleep (-72%)
220V power (chain=28): 160s sleep -> 26s sleep (-84%)
X86 board (chain~700, projection): ~1h -> ~3min
Verified by re-fetching ESP-VoCat + 220V power: byte-identical output
across all per-doc .epro2 files (sha256 match), only fetched_at
timestamp differs in manifest.json. Two manifest files re-stamped as
proof of the validation runs.
API host sleeps (4x 5s in modern fetcher, 7x 5s in legacy fetcher) are
unchanged — those go to pro.lceda.cn /api/ which still wants polite
QPS<=0.2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>