Add corpus size/license estimator; snapshot 90-project findings

Why: - 放量决策需要比"52MB/项目 × 12493 = 650GB"更扎实的数据。用 scripts/estimate_size.py 采样 90 个 hot 项目的 attachments[].size 得到真实分布（median 9MB / p90 54MB），全量 median 估算 110GB， p90 上界 660GB。这给 Charles 一个可信的存储预算。 - 附带 license 和 ext 分布采出两个重要洞察： (1) mp4+qt 视频占 54% 存储，加 --skip-ext 开关可节省一半； (2) NC (Non-Commercial) 许可 ~11%，下游必须按 whitelist 过滤。 What: - scripts/estimate_size.py: 无下载的元数据采样器，复用 crawler.parse_detail_html - docs/sources/oshwhub_corpus_estimate.md: 结果快照 + 决策建议 - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:45:54 +08:00
parent c8d55a22eb
commit e222b08f27
3 changed files with 213 additions and 1 deletions
--- a/docs/sources/oshwhub_corpus_estimate.md
+++ b/docs/sources/oshwhub_corpus_estimate.md
@@ -0,0 +1,67 @@
+# oshwhub 全量规模与特征估算
+
+**采样方法**：`scripts/estimate_size.py` 从 `/api/project?sort=hot` 取 **90 个项目**（3 页 × 30），解析每页详情 HTML 的 `attachments[]`，不下载任何附件。
+**采样日期**：2026-04-23
+**重跑方式**：`uv run python scripts/estimate_size.py --pages 3 --page-size 30 --sort hot`
+
+## 单项目分布
+
+| 指标 | 附件数 | 体积 |
+|-----|-------|-----|
+| mean | 3.1 | 22.2 MB |
+| median | 2 | 9.0 MB |
+| p90 | — | 54.2 MB |
+| max | 15 | 204.5 MB |
+
+90 个样本共 2001 MB。注意 `sort=hot` 的采样偏向**有高人气、有文件**的项目；长尾应更小。
+
+## 全量推算（12 493 projects）
+
+| 基数 | 估算 |
+|-----|-----|
+| mean × total | **271 GB** |
+| median × total | **110 GB** ← 合理规划值 |
+| p90 × total | **662 GB** ← 上界 |
+
+建议按 **150 GB** 做预算（median + buffer，排除 hot 偏差）；**300 GB** 做容量上限，预留故障余地。
+
+## 文件类型分布（按字节）
+
+| 后缀 | 样本总量 | 占比 |
+|------|---------|------|
+| .mp4 | 1029 MB | **51%** |
+| .zip | 676 MB | 34% |
+| .rar | 72 MB | 4% |
+| .qt | 66 MB | 3% |
+| .pdf | 32 MB | 2% |
+| .bin | 27 MB | 1% |
+| .jpeg | 26 MB | 1% |
+| .7z | 10 MB | <1% |
+
+> **关键洞察**：视频（mp4 + qt ≈ 54%）占存储一半以上。如果训练数据主要用 PCB / 原理图 / BOM，可在爬虫中加 `--skip-ext mp4,qt` 滤掉视频，存储立省一半。
+
+## 许可证分布（90 个样本）
+
+| License | 计数 | 占比 |
+|---------|-----|-----|
+| GPL 3.0 | 44 | **49%** |
+| Public Domain | 19 | 21% |
+| CC BY-NC-SA 4.0 | 5 | 6% |
+| CERN Open Hardware License | 4 | 4% |
+| CC BY-NC-SA 3.0 | 3 | 3% |
+| CC BY-SA 4.0 | 2 | 2% |
+| TAPR Open Hardware License | 2 | 2% |
+| CC-BY-NC-SA 3.0 | 2 | 2% |
+| 其他 CC | 2 | 2% |
+
+**全部开源/公共领域许可**，样本中无闭源。但注意：
+- 49% GPL 3.0 — 用于**训练模型**无直接违反（模型权重不是 derivative work 的学术共识存在争议，保守起见训练输出不可简单商业化再分发）
+- **NC (Non-Commercial)** 约 11% — 商用场景应**过滤剔除**
+- 样本偏大型项目；全量中 `license: "unknown"` 比例可能更高，需要下游按 whitelist 过滤
+
+## 给 Charles 的建议
+
+1. **放量预算**：150 GB 存储 + 15% buffer ≈ **180 GB LFS 空间**
+2. **滤视频**：在 Phase 1.4 之前给 crawler 加 `--skip-ext mp4,qt,mov,avi` 开关，存储需求砍半
+3. **许可证白名单**：下游派生数据集按 `license in {Public Domain, CC0, CC BY, CC BY-SA, MIT, Apache-2.0, BSD*, CERN-OHL*}` 过滤 NC / 未知
+4. **分期爬取**：按 `sort=hot` 按 page 推进，每 500 项目 checkpoint 一次
--- a/log.md
+++ b/log.md
@@ -83,11 +83,33 @@ jsonschema 做两层校验：

 ### 还是需要 Charles 决策

- 放量规模（推算：52MB/项目 × 12493 ≈ 650GB 全量，需评估 Gitea LFS 容量）
+- 放量规模 —— 已提供实测数据：**median ≈ 110 GB，p90 上界 ≈ 660 GB，建议预算 150–180 GB**（见 `docs/sources/oshwhub_corpus_estimate.md`）
 - 是否需要抓 `u.lceda.cn` 的 EasyEDA 源 JSON（需登录，v0.1 跳过）

 ---

+## 2026-04-23 19:45  全量规模实测 + License 分布
+
+**Claude 会话**（自主推进）
+
+写 `scripts/estimate_size.py`，只抓详情 HTML 解析 `attachments[].size`，不下载；采样 90 个 hot 项目（3 页 × 30）。
+
+**关键发现**：
+- 单项目 median 9 MB / mean 22 MB / p90 54 MB / max 204 MB；12493 全量 median 估算 **110 GB**，p90 上界 660 GB
+- **视频 (.mp4 + .qt) 占 54% 存储**！如果训练只要 PCB/原理图/BOM，加 `--skip-ext mp4,qt` 存储直接砍半
+- License 分布健康：GPL 3.0 占 49%，Public Domain 21%，CC 系列 ~20%，CERN/TAPR OHL 6%；样本内无闭源
+- **NC (Non-Commercial) 占 ~11%**，商用场景必须过滤
+
+结果固化到 `docs/sources/oshwhub_corpus_estimate.md`，可随时重跑验证。
+
+### 给 Charles 的建议
+
+1. 存储预算定 **180 GB**（median + 15% buffer）
+2. Phase 1.4 前给 crawler 加 `--skip-ext` 开关滤视频
+3. 下游建立 license whitelist 过滤 NC / 未知
+
+---
+
 ## 2026-04-23 18:50  仓库初始化 & 数据源调研

 **Claude 会话**：初始化
--- a/scripts/estimate_size.py
+++ b/scripts/estimate_size.py
@@ -0,0 +1,123 @@
+"""Estimate full-corpus storage by sampling oshwhub detail pages (no downloads).
+
+从列表 API 取 N 个项目，解析每个详情页的 `attachments[]`，把 `size` 字段求和。
+不下载任何附件，仅抓 HTML 页，对服务器压力小；可用来快速给 Charles 一个
+放量存储估计。
+
+Usage:
+    uv run python scripts/estimate_size.py --pages 5 --sort hot
+"""
+
+from __future__ import annotations
+
+import argparse
+import statistics as st
+import sys
+import time
+from pathlib import Path
+
+# Reuse crawler helpers; avoid duplicating HTTP/parse code
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+from crawlers.oshwhub.crawler import (  # noqa: E402
+    make_client,
+    list_projects,
+    parse_detail_html,
+    BASE,
+)
+
+
+def fmt_mb(bytes_: float) -> str:
+    return f"{bytes_ / 1024 / 1024:.1f} MB"
+
+
+def fmt_gb(bytes_: float) -> str:
+    return f"{bytes_ / 1024 / 1024 / 1024:.2f} GB"
+
+
+def main(argv: list[str] | None = None) -> int:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--pages", type=int, default=3)
+    ap.add_argument("--page-size", type=int, default=30)
+    ap.add_argument("--sort", default="hot")
+    ap.add_argument("--sleep", type=float, default=1.0, help="seconds between detail fetches")
+    args = ap.parse_args(argv)
+
+    sample_sizes: list[int] = []  # per-project total bytes
+    sample_counts: list[int] = []  # per-project attachment count
+    ext_hist: dict[str, int] = {}  # bytes by extension
+    lic_hist: dict[str, int] = {}
+
+    total = None
+    with make_client() as client:
+        for page in range(1, args.pages + 1):
+            res = list_projects(client, page=page, page_size=args.page_size, sort=args.sort)
+            total = res["total"]
+            for it in res["lists"]:
+                path = it["path"]
+                url = f"{BASE}/{path}"
+                try:
+                    r = client.get(url)
+                    r.raise_for_status()
+                    d = parse_detail_html(r.text)
+                except Exception as e:
+                    print(f"  skip {path}: {e}", file=sys.stderr)
+                    continue
+
+                lic = d.get("license") or "unknown"
+                lic_hist[lic] = lic_hist.get(lic, 0) + 1
+
+                proj_size = 0
+                count = 0
+                for a in d.get("attachments", []):
+                    size = a.get("size") or 0
+                    ext = (a.get("ext") or "?").lower()
+                    proj_size += size
+                    count += 1
+                    ext_hist[ext] = ext_hist.get(ext, 0) + size
+                sample_sizes.append(proj_size)
+                sample_counts.append(count)
+                print(
+                    f"  p{page:02d} {path:50.50} files={count:>2} size={fmt_mb(proj_size)}",
+                    flush=True,
+                )
+                time.sleep(args.sleep)
+
+    if not sample_sizes:
+        print("no samples")
+        return 1
+
+    n = len(sample_sizes)
+    total_bytes = sum(sample_sizes)
+    mean = st.mean(sample_sizes)
+    median = st.median(sample_sizes)
+    p90 = sorted(sample_sizes)[int(n * 0.9)] if n >= 10 else max(sample_sizes)
+    max_ = max(sample_sizes)
+
+    print()
+    print(f"sampled: {n} projects (sort={args.sort})")
+    print(f"attachments/proj: mean={st.mean(sample_counts):.1f} "
+          f"median={st.median(sample_counts):.0f} max={max(sample_counts)}")
+    print(f"size/proj:        mean={fmt_mb(mean)}  median={fmt_mb(median)}  "
+          f"p90={fmt_mb(p90)}  max={fmt_mb(max_)}")
+    print(f"sample total:     {fmt_mb(total_bytes)}")
+    if total:
+        est_full = mean * total
+        print(f"\ncorpus total (API reports):  {total} projects")
+        print(f"  × mean → estimate:    {fmt_gb(est_full)}")
+        print(f"  × median → estimate:  {fmt_gb(median * total)}")
+        print(f"  × p90 → upper bound:  {fmt_gb(p90 * total)}")
+
+    print("\ntop ext by total bytes:")
+    for ext, b in sorted(ext_hist.items(), key=lambda x: -x[1])[:10]:
+        print(f"  .{ext:6} {fmt_mb(b):>12}")
+
+    print("\nlicense distribution in sample:")
+    for lic, c in sorted(lic_hist.items(), key=lambda x: -x[1])[:10]:
+        pct = 100 * c / n
+        print(f"  {lic:30} {c:>3}  ({pct:.0f}%)")
+
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())