Topic-pull from local listing index (`name OR introduction` contains
飞控). 77 std hits in oshwhub_listing_full.jsonl, minus 2 already
crawled = 75 attempted; 74 OK + 1 hard fail (`m1_mh743_ada_v4`,
listing entry missing `count.like`).
dev1 walltime: Step 1 ~12s, Step 4 ~80s (concurrency=5).
License mix: 65% GPL 3.0, 11% Public Domain, 11% MIT, 4% CC variants.
3 partial dirs (Step 1 KeyError on missing `count.like`) dropped — to
be re-fetched after a follow-up crawler patch makes count fields
defensive against listing-index outliers.
Source backfill 74/74 OK, total +46 MB.
Doubles down on what worked in batch-50:
- dev1 (Guangzhou) is primary execution host
- Owner cap=2 for diversity
- --max-source-mb 200 to defend against X86-class outliers
- Pro 2.x deprecated-board fix is already in (commit c3cac97)
- SSH transport for dev1 -> gitea (commit 8220c99)
Candidate pool:
200 picks from A-tier (grade>=3 & like>=10) minus already-crawled 65
Remaining A-tier corpus is 2,741 (Pro 1326 + Std 1415)
173 unique authors, like median 258, grade dist 4:118 / 3:82
Estimated walltime ~25-35 min on dev1 for Step 1-4 (no attachments).
LFS increment ~2.5 GB (source only) or +10 GB if Step 5 attachments
included. Either way well within Gitea's 200 GB migration threshold.
Step 5 (attachment download) deferred — not on the critical path for
EPRO2/Std → KiCad work, can revisit when license-filtered Forge
projection demands it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pro 2.x legacy. 6 docs (3 sch sheets + 3 PCBs), 0.2 MB plain. The
deprecated 主控板V1 sch/pcb pair is correctly skipped (filter via
ticket.schematics/pcbs keys, see crawler commit c3cac97).
batch-50 success rate is now 50/50 (was 49/50).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run on dev1 (Guangzhou) for the latency advantage. Walltime 3:41 vs
Singapore-estimated 1-2h (~30x speedup, mostly from image.lceda.cn
RTT going from 263ms to 2.6ms).
Pro 25: 24 ok + 1 fail (Super Dial 7f7565ef11 — Pro 2.x legacy
schematic/lists 401, separate cookie-perm issue)
611 docs, 31 MB total
Std 25: 25 ok, 97 docs, 74 MB total
Combined: 49/50 success, 708 docs, 105 MB new disk usage
--max-source-mb 200 cap was not tripped; the 25 Pro candidates are all
under 10 MB, so the 481 MB X86-board outlier from the original sample
was not representative.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three crawler ergonomics for batch operations:
--no-cover Skip cover image download. For scan-only modes (license/meta
scrape) this drops ~1.3s/project and avoids slow-CDN hangs.
--concurrency N ThreadPoolExecutor wrapping the per-project loop. Default
1 = serial (current behavior). Anonymous endpoints tolerate
5+ comfortably; output uses a print lock for readable
interleaved progress. fetch_cover plumbs through crawl_one.
Drop cross-host sleep #1: in crawl_one between detail HTML (oshwhub.com)
and cover image (image.lceda.cn). Different hosts — sleep was unnecessary.
Saves ~1s/project. Sleep #2 (post-cover, before next iteration) stays — it
gates the next oshwhub.com hit.
download_to gains max_seconds wall budget (default 60s, cover uses 15s).
Defends against pathologically slow CDN connections — observed 10 KB/s
on image.lceda.cn for one project, would have hung 6+ min on a 3.6 MB
cover otherwise. httpx default timeout resets per chunk, so streaming
downloads need an external wall-clock guard.
batch-50 Step 1 (license/meta scrape) shipped:
50/50 candidates have metadata.json + license recorded
License distribution: GPL 3.0 32, Public Domain 6, NC variants 8,
CERN-OHL 1, MIT 1, CC BY 3.0 1
Forge-friendly (non-NC): 41/50 (82%)
Declared attachments: 180 files / 2.36 GB (median 18 MB/proj, max 304 MB)
Walltime: 3min 26s for 28 projects at concurrency=5 (server-side
HTML render bound, not sleep-bound)
One orphan partial cover (a670e60a...) cleaned up — leftover from the
first aborted run before the timeout fix landed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CLI gates needed before scaling Pro batch beyond top-5:
--skip-ext mp4,qt,mov (attachment filter)
Skips video extensions in attachment download. Phase 1 measurements
showed mp4+qt occupy ~54% of attachment storage. Entry still recorded
in metadata.json with skipped:ext:<token> so we can re-fetch later if
the policy changes. Honors both server-declared `ext` and filename
suffix, case-insensitively.
--max-source-mb N (Pro source size cap)
Trips inside the chain replay loop on encrypted-blob total. On trip:
raise ProjectOversizeError, wipe partial source/, append a row to
data/state/oshwhub_pro_oversize.jsonl. Lets us shortlist 50+ Pro
projects without one X86-board-class outlier (~500 MB) blowing the
LFS budget. Std and Pro 2.x legacy are not capped (both <2 MB in
sample).
Verified:
- cap=0 trips on first blob (1.2 MB), source/ wiped, state recorded
- cap=100 runs full ESP-VoCat (7.5 MB plain, 278 docs)
- skip-ext microtest: 8/8 cases (case-insensitive, declared/suffix
fallback, empty-token edge cases)
Plan + frozen candidate list for the next 50 projects:
- docs/plans/oshwhub_batch50.md
- data/state/oshwhub_batch50_candidates.jsonl (gitignore exception added)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Pro modern fetch_pro_modern walks a per-history blob loop on
modules.lceda.cn (CDN-flavored host serving AES-encrypted EPRO2 streams).
We were sleeping 5s between every blob — same rate we use for the
rate-sensitive pro.lceda.cn API host. HAR analysis (proexportNew2.har)
shows the editor fires these blobs back-to-back without throttling, so
0.2s is plenty.
Walltime drops linearly with chain length:
ESP-VoCat (chain=12): 80s sleep -> 22s sleep (-72%)
220V power (chain=28): 160s sleep -> 26s sleep (-84%)
X86 board (chain~700, projection): ~1h -> ~3min
Verified by re-fetching ESP-VoCat + 220V power: byte-identical output
across all per-doc .epro2 files (sha256 match), only fetched_at
timestamp differs in manifest.json. Two manifest files re-stamped as
proof of the validation runs.
API host sleeps (4x 5s in modern fetcher, 7x 5s in legacy fetcher) are
unchanged — those go to pro.lceda.cn /api/ which still wants polite
QPS<=0.2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit added the dump script + report but the actual jsonl was caught
by data/state/* gitignore. Add a targeted exception so the snapshot travels with
the repo — anyone who clones can do local filtering without re-hitting the API.
The data is regenerable (scripts/dump_listing_index.py is one-shot, ~1 min), but
pinning a dated snapshot lets us reason about "the state of the corpus on
2026-04-28" reproducibly. Future re-dumps overwrite the same path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>