Add corpus size/license estimator; snapshot 90-project findings

Why: - 放量决策需要比"52MB/项目 × 12493 = 650GB"更扎实的数据。用 scripts/estimate_size.py 采样 90 个 hot 项目的 attachments[].size 得到真实分布（median 9MB / p90 54MB），全量 median 估算 110GB， p90 上界 660GB。这给 Charles 一个可信的存储预算。 - 附带 license 和 ext 分布采出两个重要洞察： (1) mp4+qt 视频占 54% 存储，加 --skip-ext 开关可节省一半； (2) NC (Non-Commercial) 许可 ~11%，下游必须按 whitelist 过滤。 What: - scripts/estimate_size.py: 无下载的元数据采样器，复用 crawler.parse_detail_html - docs/sources/oshwhub_corpus_estimate.md: 结果快照 + 决策建议 - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:45:54 +08:00
parent c8d55a22eb
commit e222b08f27
3 changed files with 213 additions and 1 deletions
--- a/docs/sources/oshwhub_corpus_estimate.md
+++ b/docs/sources/oshwhub_corpus_estimate.md
@@ -0,0 +1,67 @@
 # oshwhub 全量规模与特征估算
 **采样方法**：`scripts/estimate_size.py` 从 `/api/project?sort=hot` 取 **90 个项目**（3 页 × 30），解析每页详情 HTML 的 `attachments[]`，不下载任何附件。
 **采样日期**：2026-04-23
 **重跑方式**：`uv run python scripts/estimate_size.py --pages 3 --page-size 30 --sort hot`
 ## 单项目分布
 | 指标 | 附件数 | 体积 |
 |-----|-------|-----|
 | mean | 3.1 | 22.2 MB |
 | median | 2 | 9.0 MB |
 | p90 | — | 54.2 MB |
 | max | 15 | 204.5 MB |
 90 个样本共 2001 MB。注意 `sort=hot` 的采样偏向**有高人气、有文件**的项目；长尾应更小。
 ## 全量推算（12 493 projects）
 | 基数 | 估算 |
 |-----|-----|
 | mean × total | **271 GB** |
 | median × total | **110 GB** ← 合理规划值 |
 | p90 × total | **662 GB** ← 上界 |
 建议按 **150 GB** 做预算（median + buffer，排除 hot 偏差）；**300 GB** 做容量上限，预留故障余地。
 ## 文件类型分布（按字节）
 | 后缀 | 样本总量 | 占比 |
 |------|---------|------|
 | .mp4 | 1029 MB | **51%** |
 | .zip | 676 MB | 34% |
 | .rar | 72 MB | 4% |
 | .qt | 66 MB | 3% |
 | .pdf | 32 MB | 2% |
 | .bin | 27 MB | 1% |
 | .jpeg | 26 MB | 1% |
 | .7z | 10 MB | <1% |
 > **关键洞察**：视频（mp4 + qt ≈ 54%）占存储一半以上。如果训练数据主要用 PCB / 原理图 / BOM，可在爬虫中加 `--skip-ext mp4,qt` 滤掉视频，存储立省一半。
 ## 许可证分布（90 个样本）
 | License | 计数 | 占比 |
 |---------|-----|-----|
 | GPL 3.0 | 44 | **49%** |
 | Public Domain | 19 | 21% |
 | CC BY-NC-SA 4.0 | 5 | 6% |
 | CERN Open Hardware License | 4 | 4% |
 | CC BY-NC-SA 3.0 | 3 | 3% |
 | CC BY-SA 4.0 | 2 | 2% |
 | TAPR Open Hardware License | 2 | 2% |
 | CC-BY-NC-SA 3.0 | 2 | 2% |
 | 其他 CC | 2 | 2% |
 **全部开源/公共领域许可**，样本中无闭源。但注意：
 - 49% GPL 3.0 — 用于**训练模型**无直接违反（模型权重不是 derivative work 的学术共识存在争议，保守起见训练输出不可简单商业化再分发）
 - **NC (Non-Commercial)** 约 11% — 商用场景应**过滤剔除**
 - 样本偏大型项目；全量中 `license: "unknown"` 比例可能更高，需要下游按 whitelist 过滤
 ## 给 Charles 的建议
 1. **放量预算**：150 GB 存储 + 15% buffer ≈ **180 GB LFS 空间**
 2. **滤视频**：在 Phase 1.4 之前给 crawler 加 `--skip-ext mp4,qt,mov,avi` 开关，存储需求砍半
 3. **许可证白名单**：下游派生数据集按 `license in {Public Domain, CC0, CC BY, CC BY-SA, MIT, Apache-2.0, BSD*, CERN-OHL*}` 过滤 NC / 未知
 4. **分期爬取**：按 `sort=hot` 按 page 推进，每 500 项目 checkpoint 一次
--- a/log.md
+++ b/log.md
@@ -83,11 +83,33 @@ jsonschema 做两层校验：
 ### 还是需要 Charles 决策
- 放量规模（推算：52MB/项目 × 12493 ≈ 650GB 全量，需评估 Gitea LFS 容量）
+- 放量规模 —— 已提供实测数据：**median ≈ 110 GB，p90 上界 ≈ 660 GB，建议预算 150–180 GB**（见 `docs/sources/oshwhub_corpus_estimate.md`）
 - 是否需要抓 `u.lceda.cn` 的 EasyEDA 源 JSON（需登录，v0.1 跳过）
 ---
 ## 2026-04-23 19:45  全量规模实测 + License 分布
 **Claude 会话**（自主推进）
 写 `scripts/estimate_size.py`，只抓详情 HTML 解析 `attachments[].size`，不下载；采样 90 个 hot 项目（3 页 × 30）。
 **关键发现**：
 - 单项目 median 9 MB / mean 22 MB / p90 54 MB / max 204 MB；12493 全量 median 估算 **110 GB**，p90 上界 660 GB
 - **视频 (.mp4 + .qt) 占 54% 存储**！如果训练只要 PCB/原理图/BOM，加 `--skip-ext mp4,qt` 存储直接砍半
 - License 分布健康：GPL 3.0 占 49%，Public Domain 21%，CC 系列 ~20%，CERN/TAPR OHL 6%；样本内无闭源
 - **NC (Non-Commercial) 占 ~11%**，商用场景必须过滤
 结果固化到 `docs/sources/oshwhub_corpus_estimate.md`，可随时重跑验证。
 ### 给 Charles 的建议
 1. 存储预算定 **180 GB**（median + 15% buffer）
 2. Phase 1.4 前给 crawler 加 `--skip-ext` 开关滤视频
 3. 下游建立 license whitelist 过滤 NC / 未知
 ---
 ## 2026-04-23 18:50  仓库初始化 & 数据源调研
 **Claude 会话**：初始化
--- a/scripts/estimate_size.py
+++ b/scripts/estimate_size.py
@@ -0,0 +1,123 @@
 """Estimate full-corpus storage by sampling oshwhub detail pages (no downloads).
 从列表 API 取 N 个项目，解析每个详情页的 `attachments[]`，把 `size` 字段求和。
 不下载任何附件，仅抓 HTML 页，对服务器压力小；可用来快速给 Charles 一个
 放量存储估计。
 Usage:
    uv run python scripts/estimate_size.py --pages 5 --sort hot
 """
 from __future__ import annotations
 import argparse
 import statistics as st
 import sys
 import time
 from pathlib import Path
 # Reuse crawler helpers; avoid duplicating HTTP/parse code
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 from crawlers.oshwhub.crawler import (  # noqa: E402
    make_client,
    list_projects,
    parse_detail_html,
    BASE,
 )
 def fmt_mb(bytes_: float) -> str:
    return f"{bytes_ / 1024 / 1024:.1f} MB"
 def fmt_gb(bytes_: float) -> str:
    return f"{bytes_ / 1024 / 1024 / 1024:.2f} GB"
 def main(argv: list[str] | None = None) -> int:
    ap = argparse.ArgumentParser()
    ap.add_argument("--pages", type=int, default=3)
    ap.add_argument("--page-size", type=int, default=30)
    ap.add_argument("--sort", default="hot")
    ap.add_argument("--sleep", type=float, default=1.0, help="seconds between detail fetches")
    args = ap.parse_args(argv)
    sample_sizes: list[int] = []  # per-project total bytes
    sample_counts: list[int] = []  # per-project attachment count
    ext_hist: dict[str, int] = {}  # bytes by extension
    lic_hist: dict[str, int] = {}
    total = None
    with make_client() as client:
        for page in range(1, args.pages + 1):
            res = list_projects(client, page=page, page_size=args.page_size, sort=args.sort)
            total = res["total"]
            for it in res["lists"]:
                path = it["path"]
                url = f"{BASE}/{path}"
                try:
                    r = client.get(url)
                    r.raise_for_status()
                    d = parse_detail_html(r.text)
                except Exception as e:
                    print(f"  skip {path}: {e}", file=sys.stderr)
                    continue
                lic = d.get("license") or "unknown"
                lic_hist[lic] = lic_hist.get(lic, 0) + 1
                proj_size = 0
                count = 0
                for a in d.get("attachments", []):
                    size = a.get("size") or 0
                    ext = (a.get("ext") or "?").lower()
                    proj_size += size
                    count += 1
                    ext_hist[ext] = ext_hist.get(ext, 0) + size
                sample_sizes.append(proj_size)
                sample_counts.append(count)
                print(
                    f"  p{page:02d} {path:50.50} files={count:>2} size={fmt_mb(proj_size)}",
                    flush=True,
                )
                time.sleep(args.sleep)
    if not sample_sizes:
        print("no samples")
        return 1
    n = len(sample_sizes)
    total_bytes = sum(sample_sizes)
    mean = st.mean(sample_sizes)
    median = st.median(sample_sizes)
    p90 = sorted(sample_sizes)[int(n * 0.9)] if n >= 10 else max(sample_sizes)
    max_ = max(sample_sizes)
    print()
    print(f"sampled: {n} projects (sort={args.sort})")
    print(f"attachments/proj: mean={st.mean(sample_counts):.1f} "
          f"median={st.median(sample_counts):.0f} max={max(sample_counts)}")
    print(f"size/proj:        mean={fmt_mb(mean)}  median={fmt_mb(median)}  "
          f"p90={fmt_mb(p90)}  max={fmt_mb(max_)}")
    print(f"sample total:     {fmt_mb(total_bytes)}")
    if total:
        est_full = mean * total
        print(f"\ncorpus total (API reports):  {total} projects")
        print(f"  × mean → estimate:    {fmt_gb(est_full)}")
        print(f"  × median → estimate:  {fmt_gb(median * total)}")
        print(f"  × p90 → upper bound:  {fmt_gb(p90 * total)}")
    print("\ntop ext by total bytes:")
    for ext, b in sorted(ext_hist.items(), key=lambda x: -x[1])[:10]:
        print(f"  .{ext:6} {fmt_mb(b):>12}")
    print("\nlicense distribution in sample:")
    for lic, c in sorted(lic_hist.items(), key=lambda x: -x[1])[:10]:
        pct = 100 * c / n
        print(f"  {lic:30} {c:>3}  ({pct:.0f}%)")
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())