Files
FacereDataset/docs/std_corpus_2026-05.md
Knowit 6aa72faf84 docs: std corpus 2026-05 snapshot + batch-1000/4000/remaining log
Snapshot of full oshwhub std corpus delivery:
- 12,493 projects total, 12,166 (97.4%) with editor source
- 4 sweep batches + 1 early-mixed = 5 zip artifacts in COS GZ + SG buckets
- 30-day SG-region presigned URLs for downstream pickup

log.md tracks the multi-batch sweep including driver bug postmortem
(bash heredoc python3 missed httpx → 26-min run wasted on empty zips,
recovered by switching to uv run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 10:56:09 +08:00

145 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# oshwhub Std corpus 交付2026-05 快照)
**快照时间**2026-05-03
**数据源**oshwhub.comorigin=std
**用途**:研究用,不再分发;下游同学批量接入 EPRO2/Std → KiCad / Wokwi pipeline
---
## 总览
| 项 | 值 |
|---|---:|
| oshwhub Std 项目总数origin=std | **12,493** |
| 含完整可编辑器源工程 | **12,16697.4%** |
| 仅 metadata + 附件upstream 没编辑器 session | 3272.6% |
| sch + pcb doc 总数(多页累加) | **30,488** |
| 源工程文件体积(`.json` 解码后) | 11.79 GB |
| 上游 listing pool 覆盖率 | 12,493 / 12,493 = **100%** |
---
## 按批次
| 批次 | 项目数 | 有源 | attach_only | docs | 选取规则 |
|---|---:|---:|---:|---:|---|
| `batch_early_std` | 112 | 108 | 4 | 427 | 早期混抓Pro 同期顺手抓的 std|
| `batch1000_std` | 1,000 | 963 | 37 | 2,853 | A 档头部like p50=43 |
| `batch4000_std` | 4,000 | 3,884 | 116 | 10,100 | A 档剩余 + B + C 头 |
| `batch_remaining_a_std` | 3,691 | 3,641 | 50 | 8,877 | rank 中段 |
| `batch_remaining_b_std` | 3,690 | 3,570 | 120 | 8,231 | 长尾grade 0/1|
| **合计** | **12,493** | **12,166** | **327** | **30,488** | |
---
## License 分布top 12
| 数 | 占比 | License |
|---:|---:|---|
| 7,050 | 56.4% | GPL 3.0 |
| 2,384 | 19.1% | Public Domain |
| 543 | 4.3% | CC-BY-NC-SA 3.0 |
| 507 | 4.1% | unknown |
| 377 | 3.0% | MIT |
| 156 | 1.2% | MIT License同 MIT立创平台没归一|
| 147 | 1.2% | CC BY-NC-SA 4.0 |
| 147 | 1.2% | BSD |
| 144 | 1.2% | CERN Open Hardware License |
| 140 | 1.1% | LGPL 3.0 |
| 136 | 1.1% | CC-BY-NC 3.0 |
| 132 | 1.1% | CC BY-NC-SA 3.0 |
| 630 | 5.0% | 其它 21 种(含 TAPR / CC BY / CC0 / null …)|
> 下游做 license 归一化白名单时,正向许可可见 6 类MIT / BSD / Apache / CC0 / CC-BY / Public Domain。注意 license 字段保留原始字符串,未做归一化("MIT" 与 "MIT License" 视为不同 key
---
## EasyEDA Std editor 版本top 10含源工程的 12,166 项里统计)
| 数 | 占比 | 版本 |
|---:|---:|---|
| 1,192 | 9.8% | 6.4.25 |
| 906 | 7.4% | 6.4.7 |
| 678 | 5.6% | 6.5.5 |
| 564 | 4.6% | 6.5.15 |
| 535 | 4.4% | 6.5.22 |
| 447 | 3.7% | 6.4.20.6 |
| 403 | 3.3% | 6.5.1 |
| 355 | 2.9% | 6.5.23 |
| 350 | 2.9% | 6.5.34 |
| 327 | 2.7% | 6.5.28 |
剩余分布在 6.3.x ~ 6.5.4x 全谱系。下游 parser 按 6.4.x / 6.5.x 主版本分支处理即可。
---
## 数据交付
### 双桶副本(腾讯云 COS
| Region | Bucket |
|---|---|
| ap-guangzhou | `facere-gz-1321068335` |
| ap-singapore | `facere-1321068335` |
### Singapore 区直链下载30 天有效2026-06-02 过期)
| 对象 key | 大小 | 项目数 | 直链 |
|---|---:|---:|---|
| `batch_early_std.zip` | 93 MB | 112 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch_early_std.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=09bf61ec57fbe8d758397c73d981faff47e1086e) |
| `batch1000_std.zip` | 471 MB | 1,000 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch1000_std.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=fd003a27c831c0c0337615b36e7f697159f4f83e) |
| `batch4000_std.zip` | 1,378 MB | 4,000 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch4000_std.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=427b5f15f2888f630c91b5ca5e0107b2d6b15c4c) |
| `batch_remaining_a.zip` | 1,065 MB | 3,691 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch_remaining_a.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=9fa2744d879bd17842129396b2656cb1f56ae81b) |
| `batch_remaining_b.zip` | 891 MB | 3,690 | [download](https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch_remaining_b.zip?q-sign-algorithm=sha1&q-ak=AKID6HF1bx6A3jCSXP3UjneIjwwj7JJ8kANN&q-sign-time=1777776835%3B1780368895&q-key-time=1777776835%3B1780368895&q-header-list=host&q-url-param-list=&q-signature=f5cfa3b6010ac2fd81df36c61e64e9da2d68d8fa) |
| **合计** | **3,898 MB** | **12,493** | |
直链特性:
- presigned URL30 天有效(到 **2026-06-02 ~17:14 UTC** 失效)
- URL 内嵌 `q-ak`COS access key id公开标识不是密钥不含 SecretKey
- 任何能访问公网的机器 `wget` / `curl -O` 即可
- 失效后联系我重新签发URL 本身不可续期
下载示例:
```bash
wget -O batch1000_std.zip 'https://facere-1321068335.cos.ap-singapore.myqcloud.com/batch1000_std.zip?...'
```
或用 coscmd需要凭据
```bash
coscmd download batch1000_std.zip ./batch1000_std.zip
```
---
## 解压后单项目目录结构
每个 zip 解开后落到 `data/raw/oshwhub/<project_uuid>/`
```
data/raw/oshwhub/<uuid>/
├── metadata.json # 统一 schema见 schemas/project.schema.json
├── description.md # 标题 + 简介 + license
├── cover.{jpg,png} # 封面图(如果上游有)
├── _urls.json # 原始 URL 集合
└── source/ # EasyEDA Std 源工程(含完整源的项目才有)
├── <doc_uuid_1>.json
├── <doc_uuid_2>.json
└── ...
```
`source/*.json` 是 EasyEDA Std API 返回的 dataStr
- `result.docType` = 1schematic/ 3PCB/ 2symbol library
- `result.dataStr.shape[]` = `VERB~field1~field2~...` 串数组LIB / W / N / TRACK / VIA / COPPERAREA …)
- `result.dataStr.canvas` / `layers` / `head`(含 editorVersion
下游 EPRO2/Std → KiCad / Wokwi 适配代码已经在 `tools/epro2/std/` 走通,参考 `docs/sources/epro2_to_std_mapping.md` 看字段映射。
---
## 注意事项
- **327 项 attach_only**upstream API 返空 `documents`,多为早期纯 PCB 上传 / 项目废弃;保留了 metadata + 附件 URL没有可编辑器源
- **license 未归一化**:保留 oshwhub 原始字段值;下游做白名单过滤时注意大小写 / 空格 / 同义词(如 "MIT" vs "MIT License" vs "mit"
- **小批次(< 100 项)抽样验证**:建议下游先抽 `batch1000_std.zip` 跑通解析 pipeline确认无误再吃全量
- **重新签发 URL**:临时脚本 `/tmp/gen_urls.py`dev1 + SG box 都有),改 `TTL` 后重跑