Files
FacereDataset/.gitignore
Knowit 7cb35020f4 plan: batch-200 expansion (100 Pro + 100 Std)
Doubles down on what worked in batch-50:
  - dev1 (Guangzhou) is primary execution host
  - Owner cap=2 for diversity
  - --max-source-mb 200 to defend against X86-class outliers
  - Pro 2.x deprecated-board fix is already in (commit c3cac97)
  - SSH transport for dev1 -> gitea (commit 8220c99)

Candidate pool:
  200 picks from A-tier (grade>=3 & like>=10) minus already-crawled 65
  Remaining A-tier corpus is 2,741 (Pro 1326 + Std 1415)
  173 unique authors, like median 258, grade dist 4:118 / 3:82

Estimated walltime ~25-35 min on dev1 for Step 1-4 (no attachments).
LFS increment ~2.5 GB (source only) or +10 GB if Step 5 attachments
included. Either way well within Gitea's 200 GB migration threshold.

Step 5 (attachment download) deferred — not on the critical path for
EPRO2/Std → KiCad work, can revisit when license-filtered Forge
projection demands it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 02:29:53 +08:00

53 lines
984 B
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Derivative 数据(可从 raw 重建),不入库
data/processed/*
data/state/*
!data/processed/.gitkeep
!data/state/.gitkeep
# 例外oshwhub 全量 listing 索引快照入库28 MB jsonl可重抓但要钉个版本
!data/state/oshwhub_listing_full.jsonl
# 例外:扩抓批次的"冻结候选清单"——计划文档以这份为准,可重生成
!data/state/oshwhub_batch50_candidates.jsonl
!data/state/oshwhub_batch200_candidates.jsonl
# data/raw 入库(工程二进制走 LFS见 .gitattributes
# Python
__pycache__/
*.py[cod]
*.egg-info/
.pytest_cache/
.ruff_cache/
.mypy_cache/
.venv/
venv/
.env
.env.*
!.env.example
# uv
uv.lock
# Node (if we add JS helpers)
node_modules/
# Editor / OS
.vscode/
.idea/
.DS_Store
Thumbs.db
*.swp
# Claude Code session-local state
.claude/
# Local scratch
/tmp/
/scratch/
*.log
# kicad-cli sch erc default output (when --output not given goes to cwd/<input>.rpt)
*.rpt
# Private keys — never commit
*.pem
*.key