Files
FacereDataset/.gitignore
Knowit eee1a9b97e crawler: --skip-ext + --max-source-mb gates for batch-50 expansion
Two CLI gates needed before scaling Pro batch beyond top-5:

--skip-ext mp4,qt,mov  (attachment filter)
  Skips video extensions in attachment download. Phase 1 measurements
  showed mp4+qt occupy ~54% of attachment storage. Entry still recorded
  in metadata.json with skipped:ext:<token> so we can re-fetch later if
  the policy changes. Honors both server-declared `ext` and filename
  suffix, case-insensitively.

--max-source-mb N  (Pro source size cap)
  Trips inside the chain replay loop on encrypted-blob total. On trip:
  raise ProjectOversizeError, wipe partial source/, append a row to
  data/state/oshwhub_pro_oversize.jsonl. Lets us shortlist 50+ Pro
  projects without one X86-board-class outlier (~500 MB) blowing the
  LFS budget. Std and Pro 2.x legacy are not capped (both <2 MB in
  sample).

Verified:
  - cap=0 trips on first blob (1.2 MB), source/ wiped, state recorded
  - cap=100 runs full ESP-VoCat (7.5 MB plain, 278 docs)
  - skip-ext microtest: 8/8 cases (case-insensitive, declared/suffix
    fallback, empty-token edge cases)

Plan + frozen candidate list for the next 50 projects:
  - docs/plans/oshwhub_batch50.md
  - data/state/oshwhub_batch50_candidates.jsonl (gitignore exception added)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:24:55 +08:00

52 lines
938 B
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Derivative 数据(可从 raw 重建),不入库
data/processed/*
data/state/*
!data/processed/.gitkeep
!data/state/.gitkeep
# 例外oshwhub 全量 listing 索引快照入库28 MB jsonl可重抓但要钉个版本
!data/state/oshwhub_listing_full.jsonl
# 例外:扩抓批次的"冻结候选清单"——计划文档以这份为准,可重生成
!data/state/oshwhub_batch50_candidates.jsonl
# data/raw 入库(工程二进制走 LFS见 .gitattributes
# Python
__pycache__/
*.py[cod]
*.egg-info/
.pytest_cache/
.ruff_cache/
.mypy_cache/
.venv/
venv/
.env
.env.*
!.env.example
# uv
uv.lock
# Node (if we add JS helpers)
node_modules/
# Editor / OS
.vscode/
.idea/
.DS_Store
Thumbs.db
*.swp
# Claude Code session-local state
.claude/
# Local scratch
/tmp/
/scratch/
*.log
# kicad-cli sch erc default output (when --output not given goes to cwd/<input>.rpt)
*.rpt
# Private keys — never commit
*.pem
*.key