Two CLI gates needed before scaling Pro batch beyond top-5:
--skip-ext mp4,qt,mov (attachment filter)
Skips video extensions in attachment download. Phase 1 measurements
showed mp4+qt occupy ~54% of attachment storage. Entry still recorded
in metadata.json with skipped:ext:<token> so we can re-fetch later if
the policy changes. Honors both server-declared `ext` and filename
suffix, case-insensitively.
--max-source-mb N (Pro source size cap)
Trips inside the chain replay loop on encrypted-blob total. On trip:
raise ProjectOversizeError, wipe partial source/, append a row to
data/state/oshwhub_pro_oversize.jsonl. Lets us shortlist 50+ Pro
projects without one X86-board-class outlier (~500 MB) blowing the
LFS budget. Std and Pro 2.x legacy are not capped (both <2 MB in
sample).
Verified:
- cap=0 trips on first blob (1.2 MB), source/ wiped, state recorded
- cap=100 runs full ESP-VoCat (7.5 MB plain, 278 docs)
- skip-ext microtest: 8/8 cases (case-insensitive, declared/suffix
fallback, empty-token edge cases)
Plan + frozen candidate list for the next 50 projects:
- docs/plans/oshwhub_batch50.md
- data/state/oshwhub_batch50_candidates.jsonl (gitignore exception added)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
52 lines
938 B
Plaintext
52 lines
938 B
Plaintext
# Derivative 数据(可从 raw 重建),不入库
|
||
data/processed/*
|
||
data/state/*
|
||
!data/processed/.gitkeep
|
||
!data/state/.gitkeep
|
||
# 例外:oshwhub 全量 listing 索引快照入库(28 MB jsonl,可重抓但要钉个版本)
|
||
!data/state/oshwhub_listing_full.jsonl
|
||
# 例外:扩抓批次的"冻结候选清单"——计划文档以这份为准,可重生成
|
||
!data/state/oshwhub_batch50_candidates.jsonl
|
||
|
||
# data/raw 入库(工程二进制走 LFS,见 .gitattributes)
|
||
|
||
# Python
|
||
__pycache__/
|
||
*.py[cod]
|
||
*.egg-info/
|
||
.pytest_cache/
|
||
.ruff_cache/
|
||
.mypy_cache/
|
||
.venv/
|
||
venv/
|
||
.env
|
||
.env.*
|
||
!.env.example
|
||
|
||
# uv
|
||
uv.lock
|
||
|
||
# Node (if we add JS helpers)
|
||
node_modules/
|
||
|
||
# Editor / OS
|
||
.vscode/
|
||
.idea/
|
||
.DS_Store
|
||
Thumbs.db
|
||
*.swp
|
||
|
||
# Claude Code session-local state
|
||
.claude/
|
||
|
||
# Local scratch
|
||
/tmp/
|
||
/scratch/
|
||
*.log
|
||
# kicad-cli sch erc default output (when --output not given goes to cwd/<input>.rpt)
|
||
*.rpt
|
||
|
||
# Private keys — never commit
|
||
*.pem
|
||
*.key
|