Files
FacereDataset/schemas/project.schema.json
Knowit 3282a028c4 Add EasyEDA Pro EPRO2 source ingestion (3/5 batch test)
打通 oshwhub origin=pro 现代 Pro 3.x 工程的 EPRO2 源抓取链路。3/5
modern Pro 项目完整解出(共 8423 docs / 542 MB plain):

- X86 主板        7374 docs / 481 MB plain (chain=85, editor=3.2.15)
- 220V 桌面电源     771 docs /  26 MB plain (chain=28, editor=3.2.69)
- ESP-VoCat       278 docs / 7.5 MB plain (chain=12, editor=3.2.91)

剩余 2/5 是 legacy Pro 2.x(立创泰山派 RK3566、梁山派),项目 meta
返回 branch_uuid=null + editorVersion="2.1.40",没有 git-style chain
模型,文档直接挂在 boards[].sch/pcb 字段上,访问端点暂未挖通;元
数据落库 metadata.json,source/ 留空。

实现要点:
- fetch_pro_source(): 4 步流程(project → branch HEAD → structures
  → /branches/<B>/histories/<HEAD> 即返完整 chain,无需 ?limit 批量
  端点)+ 逐 history 走 AES-128-GCM 解密(16 字节 IV,pycryptodome
  原生支持)+ gunzip + 按 DOCHEAD 切 per-doc EPRO2 流
- EPRO2 解析坑:行末单 `|` 是行终止符不是字段分隔符,必须先
  rstrip("|") 再 split("||"),否则 payload JSON 解析失败 silently
  swallow 导致 cur_doc 不设 → 第一轮 X86 板 7374 docs 抽出来只剩 2 个
- docType 实测远不止 BOARD/PCB/SCH/SCH_PAGE,还含 SYMBOL /
  FOOTPRINT / DEVICE / BLOB / FONT / CONFIG —— Pro 把组件库快照也
  随项目存到 history,下游做 EPRO2→KiCad 转换时必须先把这些 lib
  doc 加载进 symbol cache
- Pro 2.x vs 3.x 是不同存储模型 —— 3.x 走 branch 模型(已打通),
  2.x 走 boards[] 直链(未打通);判别条件:project meta 的
  branch_uuid 是否为 null

CLI 新增 --with-pro-source / --backfill-pro-source / --pro-cookie /
--origin(按 origin 字段服务端过滤 listing API),crawl_one() 按
origin=pro 自动 dispatch 到 Pro fetcher。

schema:docType 类型从 integer 放宽到 [integer, string, null]
(兼容 Std 的 1/3 + Pro 的 BOARD/SCH 等),新增 message_count 字段。

License 注意:本批 5 个项目全是 NC-SA / GPL,未达 Pro source doc
§4.2 Forge 白名单(MIT/BSD/Apache/CC0/CC-BY/CERN-OHL-P/Unlicense)。
按 CLAUDE.md "研究用、不再分发" 原则 raw 入库无碍;Forge 投影时
另过白名单。

详细技术细节见 docs/sources/easyeda_pro_source.md rev 3 + log.md。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 21:45:52 +08:00

129 lines
4.6 KiB
JSON
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://git.deepknow.site/Facere/FacereDataset/schemas/project.schema.json",
"title": "FacereDataset Project",
"description": "统一项目记录。跨源oshwhub / hackaday / github / ...)通用。",
"type": "object",
"required": [
"source",
"source_url",
"project_id",
"title",
"author",
"license",
"crawled_at",
"files"
],
"properties": {
"source": {
"type": "string",
"description": "数据源标识,如 'oshwhub'、'hackaday'、'github'",
"enum": ["oshwhub", "hackaday", "github", "cern_ohr", "wikifactory", "other"]
},
"source_url": { "type": "string", "format": "uri" },
"project_id": {
"type": "string",
"description": "源站点内部 IDoshwhub 用 uuid"
},
"title": { "type": "string" },
"description_short": {
"type": "string",
"description": "简介(< 200 chars"
},
"description_path": {
"type": "string",
"description": "长描述 markdown 相对本项目目录的路径,如 'description.md'"
},
"author": {
"type": "object",
"required": ["username"],
"properties": {
"username": { "type": "string" },
"display_name": { "type": "string" },
"user_id": { "type": "string" }
}
},
"license": {
"type": "string",
"description": "原始许可证字符串;下游做规范化映射。未知标 'unknown'"
},
"tags": {
"type": "array",
"items": { "type": "string" }
},
"created_at": { "type": "string", "format": "date-time" },
"updated_at": { "type": "string", "format": "date-time" },
"published_at": { "type": "string", "format": "date-time" },
"crawled_at": { "type": "string", "format": "date-time" },
"metrics": {
"type": "object",
"additionalProperties": true,
"description": "任意源站点统计likes/stars/views/forks 等"
},
"cover": {
"type": "object",
"properties": {
"url": { "type": "string", "format": "uri" },
"path": { "type": "string", "description": "本地相对路径,如 'cover.png'" }
}
},
"files": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "url"],
"properties": {
"name": { "type": "string" },
"url": { "type": "string", "format": "uri" },
"path": { "type": "string", "description": "本地相对路径,如 'files/xxx.pdf'。缺省表示只保留 URL" },
"size": { "type": "integer" },
"md5": { "type": "string" },
"sha256": { "type": "string" },
"ext": { "type": "string" },
"mime": { "type": "string" },
"original_id": { "type": "string", "description": "源站点内部文件 ID" }
}
}
},
"raw_fields": {
"type": "object",
"description": "不易规范化但想保留的源站原始字段grade、download_count 等)",
"additionalProperties": true
},
"source_format": {
"type": "string",
"description": "EDA 工程源格式标记。如 'easyeda-std'u.lceda.cn/ 'easyeda-pro'pro.lceda.cn EPRO2/ 'kicad'。",
"enum": ["easyeda-std", "easyeda-pro", "kicad", "altium", "eagle", "other"]
},
"source_path": {
"type": "string",
"description": "工程源文件目录,相对本项目目录,如 'source/'。"
},
"source_documents": {
"type": "array",
"description": "工程源文档清单。每条对应一个 schematic / pcb / sheet 文档。",
"items": {
"type": "object",
"required": ["doc_uuid", "path"],
"properties": {
"doc_uuid": { "type": "string" },
"docType": {
"type": ["integer", "string", "null"],
"description": "Std: integer (1=schematic, 3=pcb). Pro: string (BOARD/PCB/SCHEMATIC/SHEET/...). null when unknown."
},
"master": { "type": "string", "description": "Std: 当前 head history hash. Pro 按 doc 没有独立 master。" },
"path": { "type": "string", "description": "本地相对路径。Std: 'source/<doc_uuid>.json'; Pro: 'source/<doc_uuid>.epro2'" },
"size": { "type": "integer" },
"sha256": { "type": "string" },
"message_count": { "type": "integer", "description": "Pro EPRO2 流消息行数。Std 无此字段。" }
}
}
},
"editor_version": {
"type": "string",
"description": "EasyEDA / KiCad 编辑器版本(从 dataStr.head.editorVersion 抽取)。"
}
},
"additionalProperties": false
}