Doubles down on what worked in batch-50:
- dev1 (Guangzhou) is primary execution host
- Owner cap=2 for diversity
- --max-source-mb 200 to defend against X86-class outliers
- Pro 2.x deprecated-board fix is already in (commit c3cac97)
- SSH transport for dev1 -> gitea (commit 8220c99)
Candidate pool:
200 picks from A-tier (grade>=3 & like>=10) minus already-crawled 65
Remaining A-tier corpus is 2,741 (Pro 1326 + Std 1415)
173 unique authors, like median 258, grade dist 4:118 / 3:82
Estimated walltime ~25-35 min on dev1 for Step 1-4 (no attachments).
LFS increment ~2.5 GB (source only) or +10 GB if Step 5 attachments
included. Either way well within Gitea's 200 GB migration threshold.
Step 5 (attachment download) deferred — not on the critical path for
EPRO2/Std → KiCad work, can revisit when license-filtered Forge
projection demands it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CLI gates needed before scaling Pro batch beyond top-5:
--skip-ext mp4,qt,mov (attachment filter)
Skips video extensions in attachment download. Phase 1 measurements
showed mp4+qt occupy ~54% of attachment storage. Entry still recorded
in metadata.json with skipped:ext:<token> so we can re-fetch later if
the policy changes. Honors both server-declared `ext` and filename
suffix, case-insensitively.
--max-source-mb N (Pro source size cap)
Trips inside the chain replay loop on encrypted-blob total. On trip:
raise ProjectOversizeError, wipe partial source/, append a row to
data/state/oshwhub_pro_oversize.jsonl. Lets us shortlist 50+ Pro
projects without one X86-board-class outlier (~500 MB) blowing the
LFS budget. Std and Pro 2.x legacy are not capped (both <2 MB in
sample).
Verified:
- cap=0 trips on first blob (1.2 MB), source/ wiped, state recorded
- cap=100 runs full ESP-VoCat (7.5 MB plain, 278 docs)
- skip-ext microtest: 8/8 cases (case-insensitive, declared/suffix
fallback, empty-token edge cases)
Plan + frozen candidate list for the next 50 projects:
- docs/plans/oshwhub_batch50.md
- data/state/oshwhub_batch50_candidates.jsonl (gitignore exception added)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>