Files
FacereDataset/schemas/project.schema.json
Zhang Jiahao 5ffa10f256 Phase 1 MVP: crawl 10 high-quality oshwhub projects into LFS
Why:
- Charles 指定:先爬 10 个高质量项目存 Gitea LFS,一个项目一个文件夹,
  保留原文件和 URL。先以小批量验证 schema + LFS 流水线,放量前再拍板
  存储规模。

What:
- crawlers/oshwhub: 列表 API (`/api/project?sort=hot`) + SSR HTML 解析,
  一次性产出 metadata / description / cover / files / _urls
- schemas/project.schema.json: 跨源统一 schema
- docs/sources/oshwhub.md: API 入口 / 字段映射 / 陷阱调研
- pyproject.toml: httpx[http2] 单依赖
- .gitattributes: data/raw/**/files/** 一律走 LFS(规则写窄,避免误伤 schemas/*.json 等)
- .gitignore: 移除 data/raw/* 排除(改走 LFS 入库)

10 个项目覆盖:调试器 / 加热台 / 盖革计数器 / 数控电源 / 焊台 /
智能手表 / USB 测电流 / ZVS 感应加热 / AI 开发板 / 红外热成像。
共 52 附件 ≈ 524 MB 入 LFS,筛选判据 grade=4 & likes>=100 & 多样性。

Known gaps(见 plan.md § Phase 1.4):
- EasyEDA 源 JSON 需登录 (u.lceda.cn),v0.1 跳过
- fs-web-stream.jlc.com 的工程源下载未测
- scripts/validate.py 自动 schema 校验未实现

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:34:09 +08:00

96 lines
3.1 KiB
JSON
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://git.deepknow.site/Facere/FacereDataset/schemas/project.schema.json",
"title": "FacereDataset Project",
"description": "统一项目记录。跨源oshwhub / hackaday / github / ...)通用。",
"type": "object",
"required": [
"source",
"source_url",
"project_id",
"title",
"author",
"license",
"crawled_at",
"files"
],
"properties": {
"source": {
"type": "string",
"description": "数据源标识,如 'oshwhub'、'hackaday'、'github'",
"enum": ["oshwhub", "hackaday", "github", "cern_ohr", "wikifactory", "other"]
},
"source_url": { "type": "string", "format": "uri" },
"project_id": {
"type": "string",
"description": "源站点内部 IDoshwhub 用 uuid"
},
"title": { "type": "string" },
"description_short": {
"type": "string",
"description": "简介(< 200 chars"
},
"description_path": {
"type": "string",
"description": "长描述 markdown 相对本项目目录的路径,如 'description.md'"
},
"author": {
"type": "object",
"required": ["username"],
"properties": {
"username": { "type": "string" },
"display_name": { "type": "string" },
"user_id": { "type": "string" }
}
},
"license": {
"type": "string",
"description": "原始许可证字符串;下游做规范化映射。未知标 'unknown'"
},
"tags": {
"type": "array",
"items": { "type": "string" }
},
"created_at": { "type": "string", "format": "date-time" },
"updated_at": { "type": "string", "format": "date-time" },
"published_at": { "type": "string", "format": "date-time" },
"crawled_at": { "type": "string", "format": "date-time" },
"metrics": {
"type": "object",
"additionalProperties": true,
"description": "任意源站点统计likes/stars/views/forks 等"
},
"cover": {
"type": "object",
"properties": {
"url": { "type": "string", "format": "uri" },
"path": { "type": "string", "description": "本地相对路径,如 'cover.png'" }
}
},
"files": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "url"],
"properties": {
"name": { "type": "string" },
"url": { "type": "string", "format": "uri" },
"path": { "type": "string", "description": "本地相对路径,如 'files/xxx.pdf'。缺省表示只保留 URL" },
"size": { "type": "integer" },
"md5": { "type": "string" },
"sha256": { "type": "string" },
"ext": { "type": "string" },
"mime": { "type": "string" },
"original_id": { "type": "string", "description": "源站点内部文件 ID" }
}
}
},
"raw_fields": {
"type": "object",
"description": "不易规范化但想保留的源站原始字段grade、download_count 等)",
"additionalProperties": true
}
},
"additionalProperties": false
}