Add corpus size/license estimator; snapshot 90-project findings
Why: - 放量决策需要比"52MB/项目 × 12493 = 650GB"更扎实的数据。用 scripts/estimate_size.py 采样 90 个 hot 项目的 attachments[].size 得到真实分布(median 9MB / p90 54MB),全量 median 估算 110GB, p90 上界 660GB。这给 Charles 一个可信的存储预算。 - 附带 license 和 ext 分布采出两个重要洞察: (1) mp4+qt 视频占 54% 存储,加 --skip-ext 开关可节省一半; (2) NC (Non-Commercial) 许可 ~11%,下游必须按 whitelist 过滤。 What: - scripts/estimate_size.py: 无下载的元数据采样器,复用 crawler.parse_detail_html - docs/sources/oshwhub_corpus_estimate.md: 结果快照 + 决策建议 - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
67
docs/sources/oshwhub_corpus_estimate.md
Normal file
67
docs/sources/oshwhub_corpus_estimate.md
Normal file
@@ -0,0 +1,67 @@
|
|||||||
|
# oshwhub 全量规模与特征估算
|
||||||
|
|
||||||
|
**采样方法**:`scripts/estimate_size.py` 从 `/api/project?sort=hot` 取 **90 个项目**(3 页 × 30),解析每页详情 HTML 的 `attachments[]`,不下载任何附件。
|
||||||
|
**采样日期**:2026-04-23
|
||||||
|
**重跑方式**:`uv run python scripts/estimate_size.py --pages 3 --page-size 30 --sort hot`
|
||||||
|
|
||||||
|
## 单项目分布
|
||||||
|
|
||||||
|
| 指标 | 附件数 | 体积 |
|
||||||
|
|-----|-------|-----|
|
||||||
|
| mean | 3.1 | 22.2 MB |
|
||||||
|
| median | 2 | 9.0 MB |
|
||||||
|
| p90 | — | 54.2 MB |
|
||||||
|
| max | 15 | 204.5 MB |
|
||||||
|
|
||||||
|
90 个样本共 2001 MB。注意 `sort=hot` 的采样偏向**有高人气、有文件**的项目;长尾应更小。
|
||||||
|
|
||||||
|
## 全量推算(12 493 projects)
|
||||||
|
|
||||||
|
| 基数 | 估算 |
|
||||||
|
|-----|-----|
|
||||||
|
| mean × total | **271 GB** |
|
||||||
|
| median × total | **110 GB** ← 合理规划值 |
|
||||||
|
| p90 × total | **662 GB** ← 上界 |
|
||||||
|
|
||||||
|
建议按 **150 GB** 做预算(median + buffer,排除 hot 偏差);**300 GB** 做容量上限,预留故障余地。
|
||||||
|
|
||||||
|
## 文件类型分布(按字节)
|
||||||
|
|
||||||
|
| 后缀 | 样本总量 | 占比 |
|
||||||
|
|------|---------|------|
|
||||||
|
| .mp4 | 1029 MB | **51%** |
|
||||||
|
| .zip | 676 MB | 34% |
|
||||||
|
| .rar | 72 MB | 4% |
|
||||||
|
| .qt | 66 MB | 3% |
|
||||||
|
| .pdf | 32 MB | 2% |
|
||||||
|
| .bin | 27 MB | 1% |
|
||||||
|
| .jpeg | 26 MB | 1% |
|
||||||
|
| .7z | 10 MB | <1% |
|
||||||
|
|
||||||
|
> **关键洞察**:视频(mp4 + qt ≈ 54%)占存储一半以上。如果训练数据主要用 PCB / 原理图 / BOM,可在爬虫中加 `--skip-ext mp4,qt` 滤掉视频,存储立省一半。
|
||||||
|
|
||||||
|
## 许可证分布(90 个样本)
|
||||||
|
|
||||||
|
| License | 计数 | 占比 |
|
||||||
|
|---------|-----|-----|
|
||||||
|
| GPL 3.0 | 44 | **49%** |
|
||||||
|
| Public Domain | 19 | 21% |
|
||||||
|
| CC BY-NC-SA 4.0 | 5 | 6% |
|
||||||
|
| CERN Open Hardware License | 4 | 4% |
|
||||||
|
| CC BY-NC-SA 3.0 | 3 | 3% |
|
||||||
|
| CC BY-SA 4.0 | 2 | 2% |
|
||||||
|
| TAPR Open Hardware License | 2 | 2% |
|
||||||
|
| CC-BY-NC-SA 3.0 | 2 | 2% |
|
||||||
|
| 其他 CC | 2 | 2% |
|
||||||
|
|
||||||
|
**全部开源/公共领域许可**,样本中无闭源。但注意:
|
||||||
|
- 49% GPL 3.0 — 用于**训练模型**无直接违反(模型权重不是 derivative work 的学术共识存在争议,保守起见训练输出不可简单商业化再分发)
|
||||||
|
- **NC (Non-Commercial)** 约 11% — 商用场景应**过滤剔除**
|
||||||
|
- 样本偏大型项目;全量中 `license: "unknown"` 比例可能更高,需要下游按 whitelist 过滤
|
||||||
|
|
||||||
|
## 给 Charles 的建议
|
||||||
|
|
||||||
|
1. **放量预算**:150 GB 存储 + 15% buffer ≈ **180 GB LFS 空间**
|
||||||
|
2. **滤视频**:在 Phase 1.4 之前给 crawler 加 `--skip-ext mp4,qt,mov,avi` 开关,存储需求砍半
|
||||||
|
3. **许可证白名单**:下游派生数据集按 `license in {Public Domain, CC0, CC BY, CC BY-SA, MIT, Apache-2.0, BSD*, CERN-OHL*}` 过滤 NC / 未知
|
||||||
|
4. **分期爬取**:按 `sort=hot` 按 page 推进,每 500 项目 checkpoint 一次
|
||||||
24
log.md
24
log.md
@@ -83,11 +83,33 @@ jsonschema 做两层校验:
|
|||||||
|
|
||||||
### 还是需要 Charles 决策
|
### 还是需要 Charles 决策
|
||||||
|
|
||||||
- 放量规模(推算:52MB/项目 × 12493 ≈ 650GB 全量,需评估 Gitea LFS 容量)
|
- 放量规模 —— 已提供实测数据:**median ≈ 110 GB,p90 上界 ≈ 660 GB,建议预算 150–180 GB**(见 `docs/sources/oshwhub_corpus_estimate.md`)
|
||||||
- 是否需要抓 `u.lceda.cn` 的 EasyEDA 源 JSON(需登录,v0.1 跳过)
|
- 是否需要抓 `u.lceda.cn` 的 EasyEDA 源 JSON(需登录,v0.1 跳过)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## 2026-04-23 19:45 全量规模实测 + License 分布
|
||||||
|
|
||||||
|
**Claude 会话**(自主推进)
|
||||||
|
|
||||||
|
写 `scripts/estimate_size.py`,只抓详情 HTML 解析 `attachments[].size`,不下载;采样 90 个 hot 项目(3 页 × 30)。
|
||||||
|
|
||||||
|
**关键发现**:
|
||||||
|
- 单项目 median 9 MB / mean 22 MB / p90 54 MB / max 204 MB;12493 全量 median 估算 **110 GB**,p90 上界 660 GB
|
||||||
|
- **视频 (.mp4 + .qt) 占 54% 存储**!如果训练只要 PCB/原理图/BOM,加 `--skip-ext mp4,qt` 存储直接砍半
|
||||||
|
- License 分布健康:GPL 3.0 占 49%,Public Domain 21%,CC 系列 ~20%,CERN/TAPR OHL 6%;样本内无闭源
|
||||||
|
- **NC (Non-Commercial) 占 ~11%**,商用场景必须过滤
|
||||||
|
|
||||||
|
结果固化到 `docs/sources/oshwhub_corpus_estimate.md`,可随时重跑验证。
|
||||||
|
|
||||||
|
### 给 Charles 的建议
|
||||||
|
|
||||||
|
1. 存储预算定 **180 GB**(median + 15% buffer)
|
||||||
|
2. Phase 1.4 前给 crawler 加 `--skip-ext` 开关滤视频
|
||||||
|
3. 下游建立 license whitelist 过滤 NC / 未知
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## 2026-04-23 18:50 仓库初始化 & 数据源调研
|
## 2026-04-23 18:50 仓库初始化 & 数据源调研
|
||||||
|
|
||||||
**Claude 会话**:初始化
|
**Claude 会话**:初始化
|
||||||
|
|||||||
123
scripts/estimate_size.py
Normal file
123
scripts/estimate_size.py
Normal file
@@ -0,0 +1,123 @@
|
|||||||
|
"""Estimate full-corpus storage by sampling oshwhub detail pages (no downloads).
|
||||||
|
|
||||||
|
从列表 API 取 N 个项目,解析每个详情页的 `attachments[]`,把 `size` 字段求和。
|
||||||
|
不下载任何附件,仅抓 HTML 页,对服务器压力小;可用来快速给 Charles 一个
|
||||||
|
放量存储估计。
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
uv run python scripts/estimate_size.py --pages 5 --sort hot
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import statistics as st
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
# Reuse crawler helpers; avoid duplicating HTTP/parse code
|
||||||
|
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||||
|
from crawlers.oshwhub.crawler import ( # noqa: E402
|
||||||
|
make_client,
|
||||||
|
list_projects,
|
||||||
|
parse_detail_html,
|
||||||
|
BASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def fmt_mb(bytes_: float) -> str:
|
||||||
|
return f"{bytes_ / 1024 / 1024:.1f} MB"
|
||||||
|
|
||||||
|
|
||||||
|
def fmt_gb(bytes_: float) -> str:
|
||||||
|
return f"{bytes_ / 1024 / 1024 / 1024:.2f} GB"
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
ap = argparse.ArgumentParser()
|
||||||
|
ap.add_argument("--pages", type=int, default=3)
|
||||||
|
ap.add_argument("--page-size", type=int, default=30)
|
||||||
|
ap.add_argument("--sort", default="hot")
|
||||||
|
ap.add_argument("--sleep", type=float, default=1.0, help="seconds between detail fetches")
|
||||||
|
args = ap.parse_args(argv)
|
||||||
|
|
||||||
|
sample_sizes: list[int] = [] # per-project total bytes
|
||||||
|
sample_counts: list[int] = [] # per-project attachment count
|
||||||
|
ext_hist: dict[str, int] = {} # bytes by extension
|
||||||
|
lic_hist: dict[str, int] = {}
|
||||||
|
|
||||||
|
total = None
|
||||||
|
with make_client() as client:
|
||||||
|
for page in range(1, args.pages + 1):
|
||||||
|
res = list_projects(client, page=page, page_size=args.page_size, sort=args.sort)
|
||||||
|
total = res["total"]
|
||||||
|
for it in res["lists"]:
|
||||||
|
path = it["path"]
|
||||||
|
url = f"{BASE}/{path}"
|
||||||
|
try:
|
||||||
|
r = client.get(url)
|
||||||
|
r.raise_for_status()
|
||||||
|
d = parse_detail_html(r.text)
|
||||||
|
except Exception as e:
|
||||||
|
print(f" skip {path}: {e}", file=sys.stderr)
|
||||||
|
continue
|
||||||
|
|
||||||
|
lic = d.get("license") or "unknown"
|
||||||
|
lic_hist[lic] = lic_hist.get(lic, 0) + 1
|
||||||
|
|
||||||
|
proj_size = 0
|
||||||
|
count = 0
|
||||||
|
for a in d.get("attachments", []):
|
||||||
|
size = a.get("size") or 0
|
||||||
|
ext = (a.get("ext") or "?").lower()
|
||||||
|
proj_size += size
|
||||||
|
count += 1
|
||||||
|
ext_hist[ext] = ext_hist.get(ext, 0) + size
|
||||||
|
sample_sizes.append(proj_size)
|
||||||
|
sample_counts.append(count)
|
||||||
|
print(
|
||||||
|
f" p{page:02d} {path:50.50} files={count:>2} size={fmt_mb(proj_size)}",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
time.sleep(args.sleep)
|
||||||
|
|
||||||
|
if not sample_sizes:
|
||||||
|
print("no samples")
|
||||||
|
return 1
|
||||||
|
|
||||||
|
n = len(sample_sizes)
|
||||||
|
total_bytes = sum(sample_sizes)
|
||||||
|
mean = st.mean(sample_sizes)
|
||||||
|
median = st.median(sample_sizes)
|
||||||
|
p90 = sorted(sample_sizes)[int(n * 0.9)] if n >= 10 else max(sample_sizes)
|
||||||
|
max_ = max(sample_sizes)
|
||||||
|
|
||||||
|
print()
|
||||||
|
print(f"sampled: {n} projects (sort={args.sort})")
|
||||||
|
print(f"attachments/proj: mean={st.mean(sample_counts):.1f} "
|
||||||
|
f"median={st.median(sample_counts):.0f} max={max(sample_counts)}")
|
||||||
|
print(f"size/proj: mean={fmt_mb(mean)} median={fmt_mb(median)} "
|
||||||
|
f"p90={fmt_mb(p90)} max={fmt_mb(max_)}")
|
||||||
|
print(f"sample total: {fmt_mb(total_bytes)}")
|
||||||
|
if total:
|
||||||
|
est_full = mean * total
|
||||||
|
print(f"\ncorpus total (API reports): {total} projects")
|
||||||
|
print(f" × mean → estimate: {fmt_gb(est_full)}")
|
||||||
|
print(f" × median → estimate: {fmt_gb(median * total)}")
|
||||||
|
print(f" × p90 → upper bound: {fmt_gb(p90 * total)}")
|
||||||
|
|
||||||
|
print("\ntop ext by total bytes:")
|
||||||
|
for ext, b in sorted(ext_hist.items(), key=lambda x: -x[1])[:10]:
|
||||||
|
print(f" .{ext:6} {fmt_mb(b):>12}")
|
||||||
|
|
||||||
|
print("\nlicense distribution in sample:")
|
||||||
|
for lic, c in sorted(lic_hist.items(), key=lambda x: -x[1])[:10]:
|
||||||
|
pct = 100 * c / n
|
||||||
|
print(f" {lic:30} {c:>3} ({pct:.0f}%)")
|
||||||
|
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
raise SystemExit(main())
|
||||||
Reference in New Issue
Block a user