docs: std corpus 2026-05 snapshot + batch-1000/4000/remaining log
Snapshot of full oshwhub std corpus delivery: - 12,493 projects total, 12,166 (97.4%) with editor source - 4 sweep batches + 1 early-mixed = 5 zip artifacts in COS GZ + SG buckets - 30-day SG-region presigned URLs for downstream pickup log.md tracks the multi-batch sweep including driver bug postmortem (bash heredoc python3 missed httpx → 26-min run wasted on empty zips, recovered by switching to uv run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
144
docs/std_corpus_2026-05.md
Normal file
144
docs/std_corpus_2026-05.md
Normal file
@@ -0,0 +1,144 @@
|
||||
# oshwhub Std corpus 交付(2026-05 快照)
|
||||
|
||||
**快照时间**:2026-05-03
|
||||
**数据源**:oshwhub.com(origin=std)
|
||||
**用途**:研究用,不再分发;下游同学批量接入 EPRO2/Std → KiCad / Wokwi pipeline
|
||||
|
||||
---
|
||||
|
||||
## 总览
|
||||
|
||||
| 项 | 值 |
|
||||
|---|---:|
|
||||
| oshwhub Std 项目总数(origin=std) | **12,493** |
|
||||
| 含完整可编辑器源工程 | **12,166(97.4%)** |
|
||||
| 仅 metadata + 附件(upstream 没编辑器 session) | 327(2.6%) |
|
||||
| sch + pcb doc 总数(多页累加) | **30,488** |
|
||||
| 源工程文件体积(`.json` 解码后) | 11.79 GB |
|
||||
| 上游 listing pool 覆盖率 | 12,493 / 12,493 = **100%** |
|
||||
|
||||
---
|
||||
|
||||
## 按批次
|
||||
|
||||
| 批次 | 项目数 | 有源 | attach_only | docs | 选取规则 |
|
||||
|---|---:|---:|---:|---:|---|
|
||||
| `batch_early_std` | 112 | 108 | 4 | 427 | 早期混抓(Pro 同期顺手抓的 std)|
|
||||
| `batch1000_std` | 1,000 | 963 | 37 | 2,853 | A 档头部,like p50=43 |
|
||||
| `batch4000_std` | 4,000 | 3,884 | 116 | 10,100 | A 档剩余 + B + C 头 |
|
||||
| `batch_remaining_a_std` | 3,691 | 3,641 | 50 | 8,877 | rank 中段 |
|
||||
| `batch_remaining_b_std` | 3,690 | 3,570 | 120 | 8,231 | 长尾(grade 0/1)|
|
||||
| **合计** | **12,493** | **12,166** | **327** | **30,488** | |
|
||||
|
||||
---
|
||||
|
||||
## License 分布(top 12)
|
||||
|
||||
| 数 | 占比 | License |
|
||||
|---:|---:|---|
|
||||
| 7,050 | 56.4% | GPL 3.0 |
|
||||
| 2,384 | 19.1% | Public Domain |
|
||||
| 543 | 4.3% | CC-BY-NC-SA 3.0 |
|
||||
| 507 | 4.1% | unknown |
|
||||
| 377 | 3.0% | MIT |
|
||||
| 156 | 1.2% | MIT License(同 MIT,立创平台没归一)|
|
||||
| 147 | 1.2% | CC BY-NC-SA 4.0 |
|
||||
| 147 | 1.2% | BSD |
|
||||
| 144 | 1.2% | CERN Open Hardware License |
|
||||
| 140 | 1.1% | LGPL 3.0 |
|
||||
| 136 | 1.1% | CC-BY-NC 3.0 |
|
||||
| 132 | 1.1% | CC BY-NC-SA 3.0 |
|
||||
| 630 | 5.0% | 其它 21 种(含 TAPR / CC BY / CC0 / null …)|
|
||||
|
||||
> 下游做 license 归一化白名单时,正向许可可见 6 类:MIT / BSD / Apache / CC0 / CC-BY / Public Domain。注意 license 字段保留原始字符串,未做归一化("MIT" 与 "MIT License" 视为不同 key)。
|
||||
|
||||
---
|
||||
|
||||
## EasyEDA Std editor 版本(top 10,含源工程的 12,166 项里统计)
|
||||
|
||||
| 数 | 占比 | 版本 |
|
||||
|---:|---:|---|
|
||||
| 1,192 | 9.8% | 6.4.25 |
|
||||
| 906 | 7.4% | 6.4.7 |
|
||||
| 678 | 5.6% | 6.5.5 |
|
||||
| 564 | 4.6% | 6.5.15 |
|
||||
| 535 | 4.4% | 6.5.22 |
|
||||
| 447 | 3.7% | 6.4.20.6 |
|
||||
| 403 | 3.3% | 6.5.1 |
|
||||
| 355 | 2.9% | 6.5.23 |
|
||||
| 350 | 2.9% | 6.5.34 |
|
||||
| 327 | 2.7% | 6.5.28 |
|
||||
|
||||
剩余分布在 6.3.x ~ 6.5.4x 全谱系。下游 parser 按 6.4.x / 6.5.x 主版本分支处理即可。
|
||||
|
||||
---
|
||||
|
||||
## 数据交付
|
||||
|
||||
### 双桶副本(腾讯云 COS)
|
||||
|
||||
| Region | Bucket |
|
||||
|---|---|
|
||||
| ap-guangzhou | `facere-gz-1321068335` |
|
||||
| ap-singapore | `facere-1321068335` |
|
||||
|
||||
### Singapore 区直链下载(30 天有效,2026-06-02 过期)
|
||||
|
||||
| 对象 key | 大小 | 项目数 | 直链 |
|
||||
|---|---:|---:|---|
|
||||
| `batch_early_std.zip` | 93 MB | 112 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch_early_std.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=09bf61ec57fbe8d758397c73d981faff47e1086e) |
|
||||
| `batch1000_std.zip` | 471 MB | 1,000 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch1000_std.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=fd003a27c831c0c0337615b36e7f697159f4f83e) |
|
||||
| `batch4000_std.zip` | 1,378 MB | 4,000 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch4000_std.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=427b5f15f2888f630c91b5ca5e0107b2d6b15c4c) |
|
||||
| `batch_remaining_a.zip` | 1,065 MB | 3,691 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch_remaining_a.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=9fa2744d879bd17842129396b2656cb1f56ae81b) |
|
||||
| `batch_remaining_b.zip` | 891 MB | 3,690 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch_remaining_b.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=f5cfa3b6010ac2fd81df36c61e64e9da2d68d8fa) |
|
||||
| **合计** | **3,898 MB** | **12,493** | |
|
||||
|
||||
直链特性:
|
||||
- presigned URL,30 天有效(到 **2026-06-02 ~17:14 UTC** 失效)
|
||||
- URL 内嵌 `q-ak`(COS access key id,公开标识,不是密钥),不含 SecretKey
|
||||
- 任何能访问公网的机器 `wget` / `curl -O` 即可
|
||||
- 失效后联系我重新签发;URL 本身不可续期
|
||||
|
||||
下载示例:
|
||||
```bash
|
||||
wget -O batch1000_std.zip 'https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch1000_std.zip?...'
|
||||
```
|
||||
|
||||
或用 coscmd(需要凭据):
|
||||
```bash
|
||||
coscmd download batch1000_std.zip ./batch1000_std.zip
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 解压后单项目目录结构
|
||||
|
||||
每个 zip 解开后落到 `data/raw/oshwhub/<project_uuid>/`:
|
||||
|
||||
```
|
||||
data/raw/oshwhub/<uuid>/
|
||||
├── metadata.json # 统一 schema,见 schemas/project.schema.json
|
||||
├── description.md # 标题 + 简介 + license
|
||||
├── cover.{jpg,png} # 封面图(如果上游有)
|
||||
├── _urls.json # 原始 URL 集合
|
||||
└── source/ # EasyEDA Std 源工程(含完整源的项目才有)
|
||||
├── <doc_uuid_1>.json
|
||||
├── <doc_uuid_2>.json
|
||||
└── ...
|
||||
```
|
||||
|
||||
`source/*.json` 是 EasyEDA Std API 返回的 dataStr:
|
||||
- `result.docType` = 1(schematic)/ 3(PCB)/ 2(symbol library)
|
||||
- `result.dataStr.shape[]` = `VERB~field1~field2~...` 串数组(LIB / W / N / TRACK / VIA / COPPERAREA …)
|
||||
- `result.dataStr.canvas` / `layers` / `head`(含 editorVersion)
|
||||
|
||||
下游 EPRO2/Std → KiCad / Wokwi 适配代码已经在 `tools/epro2/std/` 走通,参考 `docs/sources/epro2_to_std_mapping.md` 看字段映射。
|
||||
|
||||
---
|
||||
|
||||
## 注意事项
|
||||
|
||||
- **327 项 attach_only**:upstream API 返空 `documents`,多为早期纯 PCB 上传 / 项目废弃;保留了 metadata + 附件 URL,没有可编辑器源
|
||||
- **license 未归一化**:保留 oshwhub 原始字段值;下游做白名单过滤时注意大小写 / 空格 / 同义词(如 "MIT" vs "MIT License" vs "mit")
|
||||
- **小批次(< 100 项)抽样验证**:建议下游先抽 `batch1000_std.zip` 跑通解析 pipeline,确认无误再吃全量
|
||||
- **重新签发 URL**:临时脚本 `/tmp/gen_urls.py`(dev1 + SG box 都有),改 `TTL` 后重跑
|
||||
Reference in New Issue
Block a user