plan: batch-200 expansion (100 Pro + 100 Std)

Doubles down on what worked in batch-50:
  - dev1 (Guangzhou) is primary execution host
  - Owner cap=2 for diversity
  - --max-source-mb 200 to defend against X86-class outliers
  - Pro 2.x deprecated-board fix is already in (commit c3cac97)
  - SSH transport for dev1 -> gitea (commit 8220c99)

Candidate pool:
  200 picks from A-tier (grade>=3 & like>=10) minus already-crawled 65
  Remaining A-tier corpus is 2,741 (Pro 1326 + Std 1415)
  173 unique authors, like median 258, grade dist 4:118 / 3:82

Estimated walltime ~25-35 min on dev1 for Step 1-4 (no attachments).
LFS increment ~2.5 GB (source only) or +10 GB if Step 5 attachments
included. Either way well within Gitea's 200 GB migration threshold.

Step 5 (attachment download) deferred — not on the critical path for
EPRO2/Std → KiCad work, can revisit when license-filtered Forge
projection demands it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 02:29:53 +08:00
parent 7f3729b89c
commit 7cb35020f4
3 changed files with 386 additions and 0 deletions

1
.gitignore vendored
View File

@@ -7,6 +7,7 @@ data/state/*
!data/state/oshwhub_listing_full.jsonl
# 例外:扩抓批次的"冻结候选清单"——计划文档以这份为准,可重生成
!data/state/oshwhub_batch50_candidates.jsonl
!data/state/oshwhub_batch200_candidates.jsonl
# data/raw 入库(工程二进制走 LFS见 .gitattributes