docs: std corpus 2026-05 snapshot + batch-1000/4000/remaining log

Snapshot of full oshwhub std corpus delivery:
- 12,493 projects total, 12,166 (97.4%) with editor source
- 4 sweep batches + 1 early-mixed = 5 zip artifacts in COS GZ + SG buckets
- 30-day SG-region presigned URLs for downstream pickup

log.md tracks the multi-batch sweep including driver bug postmortem
(bash heredoc python3 missed httpx → 26-min run wasted on empty zips,
recovered by switching to uv run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-03 10:56:09 +08:00
parent d5cc6507cb
commit 6aa72faf84
2 changed files with 315 additions and 0 deletions

144
docs/std_corpus_2026-05.md Normal file
View File

@@ -0,0 +1,144 @@
# oshwhub Std corpus 交付2026-05 快照)
**快照时间**2026-05-03
**数据源**oshwhub.comorigin=std
**用途**:研究用,不再分发;下游同学批量接入 EPRO2/Std → KiCad / Wokwi pipeline
---
## 总览
| 项 | 值 |
|---|---:|
| oshwhub Std 项目总数origin=std | **12,493** |
| 含完整可编辑器源工程 | **12,16697.4%** |
| 仅 metadata + 附件upstream 没编辑器 session | 3272.6% |
| sch + pcb doc 总数(多页累加) | **30,488** |
| 源工程文件体积(`.json` 解码后) | 11.79 GB |
| 上游 listing pool 覆盖率 | 12,493 / 12,493 = **100%** |
---
## 按批次
| 批次 | 项目数 | 有源 | attach_only | docs | 选取规则 |
|---|---:|---:|---:|---:|---|
| `batch_early_std` | 112 | 108 | 4 | 427 | 早期混抓Pro 同期顺手抓的 std|
| `batch1000_std` | 1,000 | 963 | 37 | 2,853 | A 档头部like p50=43 |
| `batch4000_std` | 4,000 | 3,884 | 116 | 10,100 | A 档剩余 + B + C 头 |
| `batch_remaining_a_std` | 3,691 | 3,641 | 50 | 8,877 | rank 中段 |
| `batch_remaining_b_std` | 3,690 | 3,570 | 120 | 8,231 | 长尾grade 0/1|
| **合计** | **12,493** | **12,166** | **327** | **30,488** | |
---
## License 分布top 12
| 数 | 占比 | License |
|---:|---:|---|
| 7,050 | 56.4% | GPL 3.0 |
| 2,384 | 19.1% | Public Domain |
| 543 | 4.3% | CC-BY-NC-SA 3.0 |
| 507 | 4.1% | unknown |
| 377 | 3.0% | MIT |
| 156 | 1.2% | MIT License同 MIT立创平台没归一|
| 147 | 1.2% | CC BY-NC-SA 4.0 |
| 147 | 1.2% | BSD |
| 144 | 1.2% | CERN Open Hardware License |
| 140 | 1.1% | LGPL 3.0 |
| 136 | 1.1% | CC-BY-NC 3.0 |
| 132 | 1.1% | CC BY-NC-SA 3.0 |
| 630 | 5.0% | 其它 21 种(含 TAPR / CC BY / CC0 / null …)|
> 下游做 license 归一化白名单时,正向许可可见 6 类MIT / BSD / Apache / CC0 / CC-BY / Public Domain。注意 license 字段保留原始字符串,未做归一化("MIT" 与 "MIT License" 视为不同 key
---
## EasyEDA Std editor 版本top 10含源工程的 12,166 项里统计)
| 数 | 占比 | 版本 |
|---:|---:|---|
| 1,192 | 9.8% | 6.4.25 |
| 906 | 7.4% | 6.4.7 |
| 678 | 5.6% | 6.5.5 |
| 564 | 4.6% | 6.5.15 |
| 535 | 4.4% | 6.5.22 |
| 447 | 3.7% | 6.4.20.6 |
| 403 | 3.3% | 6.5.1 |
| 355 | 2.9% | 6.5.23 |
| 350 | 2.9% | 6.5.34 |
| 327 | 2.7% | 6.5.28 |
剩余分布在 6.3.x ~ 6.5.4x 全谱系。下游 parser 按 6.4.x / 6.5.x 主版本分支处理即可。
---
## 数据交付
### 双桶副本(腾讯云 COS
| Region | Bucket |
|---|---|
| ap-guangzhou | `facere-gz-1321068335` |
| ap-singapore | `facere-1321068335` |
### Singapore 区直链下载30 天有效2026-06-02 过期)
| 对象 key | 大小 | 项目数 | 直链 |
|---|---:|---:|---|
| `batch_early_std.zip` | 93 MB | 112 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch_early_std.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=09bf61ec57fbe8d758397c73d981faff47e1086e) |
| `batch1000_std.zip` | 471 MB | 1,000 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch1000_std.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=fd003a27c831c0c0337615b36e7f697159f4f83e) |
| `batch4000_std.zip` | 1,378 MB | 4,000 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch4000_std.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=427b5f15f2888f630c91b5ca5e0107b2d6b15c4c) |
| `batch_remaining_a.zip` | 1,065 MB | 3,691 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch_remaining_a.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=9fa2744d879bd17842129396b2656cb1f56ae81b) |
| `batch_remaining_b.zip` | 891 MB | 3,690 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch_remaining_b.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=f5cfa3b6010ac2fd81df36c61e64e9da2d68d8fa) |
| **合计** | **3,898 MB** | **12,493** | |
直链特性:
- presigned URL30 天有效(到 **2026-06-02 ~17:14 UTC** 失效)
- URL 内嵌 `q-ak`COS access key id公开标识不是密钥不含 SecretKey
- 任何能访问公网的机器 `wget` / `curl -O` 即可
- 失效后联系我重新签发URL 本身不可续期
下载示例:
```bash
wget -O batch1000_std.zip 'https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch1000_std.zip?...'
```
或用 coscmd需要凭据
```bash
coscmd download batch1000_std.zip ./batch1000_std.zip
```
---
## 解压后单项目目录结构
每个 zip 解开后落到 `data/raw/oshwhub/<project_uuid>/`
```
data/raw/oshwhub/<uuid>/
├── metadata.json # 统一 schema见 schemas/project.schema.json
├── description.md # 标题 + 简介 + license
├── cover.{jpg,png} # 封面图(如果上游有)
├── _urls.json # 原始 URL 集合
└── source/ # EasyEDA Std 源工程(含完整源的项目才有)
├── <doc_uuid_1>.json
├── <doc_uuid_2>.json
└── ...
```
`source/*.json` 是 EasyEDA Std API 返回的 dataStr
- `result.docType` = 1schematic/ 3PCB/ 2symbol library
- `result.dataStr.shape[]` = `VERB~field1~field2~...` 串数组LIB / W / N / TRACK / VIA / COPPERAREA …)
- `result.dataStr.canvas` / `layers` / `head`(含 editorVersion
下游 EPRO2/Std → KiCad / Wokwi 适配代码已经在 `tools/epro2/std/` 走通,参考 `docs/sources/epro2_to_std_mapping.md` 看字段映射。
---
## 注意事项
- **327 项 attach_only**upstream API 返空 `documents`,多为早期纯 PCB 上传 / 项目废弃;保留了 metadata + 附件 URL没有可编辑器源
- **license 未归一化**:保留 oshwhub 原始字段值;下游做白名单过滤时注意大小写 / 空格 / 同义词(如 "MIT" vs "MIT License" vs "mit"
- **小批次(< 100 项)抽样验证**:建议下游先抽 `batch1000_std.zip` 跑通解析 pipeline确认无误再吃全量
- **重新签发 URL**:临时脚本 `/tmp/gen_urls.py`dev1 + SG box 都有),改 `TTL` 后重跑