Go to file

Knowit fe6971f3f9 tools/epro2: add std/ writer — EPRO2 → EasyEDA Std-format JSON for downstream

The downstream colleague consumes oshwhub Std (lceda) dict-format JSON,
not KiCad. The EPRO2 decryption part (per-doc plaintext .epro2 streams
in data/raw/<uuid>/source/) is what we already provide; the missing
piece is converting EPRO2 op-streams into the same `dataStr.shape`
tilde-delimited format their parser already speaks.

New tools/epro2/std/ module, peer of tools/epro2/kicad/, kept
deliberately separate so the KiCad path stays untouched:

  - pcb_writer.write_pcb_std() — high-fidelity, validated against a Std
    PCB sample at data/raw/oshwhub/3e2f893d.../25931ddab8.json. Maps
    LINE→TRACK, VIA→VIA, POUR→COPPERAREA (with SVG `M..L..Z` path),
    POLY→CIRCLE/SOLIDREGION, COMPONENT+FOOTPRINT→LIB nested with
    #@$-separated PADs (placement rotation + translate applied so pad
    coords land at PCB-absolute positions). Layer-id mapping (EPRO2 5↔7
    flipped vs Std solder/paste, 11→10 outline, 12→11 multi, SIGNAL
    inner 15+ → Std 21+) noted inline.

  - sch_writer.write_sch_std() — best-effort. Our corpus has zero Std
    schematic samples (docType=1) so verb field orders follow the
    EasyEDA Std public spec, not direct observation. Emits W (wire),
    N (net flag, including the 5-Voltage Global Net Name power-port
    pattern), T (text), LIB (placement with #@$-nested PIN/T). If
    downstream's parser bails the fix is almost certainly a positional
    field tweak, not a re-architecture.

  - __main__.py — flat output `<doc_uuid>.json` per doc directly under
    --out (mirrors Std's own data layout); --all-pcb / --all-sch / --all.

Smoke test on ESP-VoCat: 6 PCB + 9 SCH = 15 JSON files, libs_unresolved=0
across the board. Compact JSON (separators=(",",":")) matches Std's
single-line format. Numbers use _num() — integers without trailing .0,
floats trimmed.

71 → 82 unit tests pass.

Open questions for downstream: (1) confirm SCH verb field orders, (2)
do they want any of the upstream metadata fields we drop (master,
owner, created_at, etc — those live on the crawler side, not the
schematic itself)?

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 01:16:39 +08:00

crawlers

crawler: drop SLEEP_SOURCE 5.0 -> 0.5 (Std doc endpoint probe)

2026-04-29 00:54:46 +08:00

data

crawler: --skip-ext + --max-source-mb gates for batch-50 expansion

2026-04-29 00:24:55 +08:00

docs

docs: consolidate rate-limit probe results into a proper benchmark report

2026-04-29 00:57:35 +08:00

schemas

Add EasyEDA Pro 2.x legacy source ingestion (5/5 batch closure)

2026-04-28 21:59:25 +08:00

scripts

crawler: drop SLEEP_SOURCE 5.0 -> 0.5 (Std doc endpoint probe)

2026-04-29 00:54:46 +08:00

tools

tools/epro2: add std/ writer — EPRO2 → EasyEDA Std-format JSON for downstream

2026-04-29 01:16:39 +08:00

.gitattributes

Phase 1 MVP: crawl 10 high-quality oshwhub projects into LFS

2026-04-23 19:34:09 +08:00

.gitignore

crawler: --skip-ext + --max-source-mb gates for batch-50 expansion

2026-04-29 00:24:55 +08:00

CLAUDE.md

update readme

2026-04-26 11:54:01 +08:00

log.md

tools/epro2: add std/ writer — EPRO2 → EasyEDA Std-format JSON for downstream

2026-04-29 01:16:39 +08:00

OSHWHUB_INGEST_SPEC.md

add epec md

2026-04-23 23:42:21 +08:00

plan.md

Allow login content; plan cloud infra, storage tiers, EDA→KiCad conversion

2026-04-23 20:57:30 +08:00

projects.md

projects.md: replace Comments column with 版本 (Std / Pro 3.x / Pro 2.x)

2026-04-28 22:01:41 +08:00

pyproject.toml

Add easyeda_pro_source.md: Pro 工程源完整链 + EPRO2 格式解析

2026-04-24 00:11:32 +08:00

README.md

update readme

2026-04-26 11:54:01 +08:00

README.md

FacereDataset

为 Facere 专有模型训练与硬件设计知识库提供数据支撑的开源硬件设计数据集。

目标

采集、清洗、结构化互联网公开可用的硬件设计资产（原理图、PCB、BOM、Gerber、3D 模型、固件、文档），输出：

训练数据集：可直接喂给 LLM / 多模态模型做预训练、SFT、RAG 的结构化语料。
检索型知识库：按元器件、拓扑、应用领域可查的设计参考库。
派生产物：元件封装库、常见子电路模板、BOM 成本曲线等。

数据来源（第一批）

站点	URL	覆盖	许可	复杂度	登录态
立创开源平台	oshwhub.com	12 493 公开项目（附件 + 元数据）	GPL 3.0 / Public Domain / CC-BY-SA 为主	中	不需要
立创 EDA 工程源	u.lceda.cn	原理图 + PCB + 组件 JSON	同 oshwhub 项目	中	需要（合法账号，见 CLAUDE.md）
HF `bshada/open-schematics`	huggingface.co	10K+ KiCad 已预处理 schematics	CC-BY-4.0	极低（整包镜像）	不需要
GitHub	github.com	KiCad / EasyEDA repo	各 repo 自定	低（gh API）	不需要
Hackaday.io	hackaday.io	项目叙事 + 文件	作者自定	中	不需要
CERN OHR	ohwr.org	高质量工业级	CERN-OHL	低	不需要
Wikifactory	wikifactory.com	社区项目	作者自定	中	不需要

运行环境：专用云服务器（广州），登录凭据集中在 ~/.secrets/。详情见 docs/infra.md（部署后创建）。

详细爬取计划见 plan.md；当前已入库项目清单见 projects.md。

仓库结构

FacereDataset/
├── README.md        项目简介（本文件）
├── CLAUDE.md        Claude Code 项目级指令
├── plan.md          分阶段爬取与处理计划
├── log.md           执行日志（时间倒序）
├── crawlers/        各站点爬虫（一站一子包）
├── schemas/         统一数据 schema（project.schema.json）
├── scripts/         去重、格式转换、完整性校验工具
├── data/            数据产出（raw/ processed/，大文件走 LFS 或外部存储）
└── docs/            设计笔记、法律合规、数据字典

合法与伦理

产出结果用于研究，不公开，不再分发
只抓取公开可访问、标注为开源或明确允许再分发的内容。
每条记录保留 source_url、author、license、crawled_at 作溯源。
后续按许可证逐条核对清洗（CC-BY 要求署名，CC-BY-SA 要求同许可分享，等）。

快速开始

# 克隆
git clone https://git.deepknow.site/Facere/FacereDataset.git
cd FacereDataset

# 安装（Python 3.11+，uv）
uv sync

# 运行某个爬虫
uv run python -m crawlers.oshwhub --limit 10

当前处于骨架初始化阶段，爬虫尚未实现。见 plan.md Phase 1。

维护

主要维护者：Charles（git.deepknow.site/Knowit）
远端：git.deepknow.site/Facere/FacereDataset
问题追踪：Gitea Issues

README.md Unescape Escape

FacereDataset

目标

数据来源（第一批）

仓库结构

合法与伦理

快速开始

维护

README.md