FacereDataset

Author	SHA1	Message	Date
Knowit	3720cd176a	tools/epro2/std: add Pro 2.x JSON path — Liangshan + Taishan SCH now exportable The downstream colleague's "encrypted_external" / "string old format" projects were Pro 2.x, not Pro 3.x EPRO2. Pro 2.x ships each doc as a JSON file whose `dataStr` is a plaintext op-stream — one JSON array per line, e.g. `["COMPONENT","e1","",0,0,0,0,{},0]`. Different wire format from EPRO2's binary tilde/pipe streams; same Std envelope works for output. - tools/epro2/std/pro2_writer.py: parses dataStr line-by-line, keys objects by id (position 1 for most ops, OPTYPE for singletons), extracts BBox by walking known coord positions per OPTYPE, derives layers from LAYER ops directly (Pro 2.x almost matches Std layer string format already). PCB blobs that are encrypted-external (`dataStrId` URL + `iv` + `key`, no inline dataStr — Taishan PCB) return None so the CLI skips with a message instead of stubbing. - tools/epro2/std/__main__.py: auto-detect via manifest's editor_version. "2.x" → Pro 2.x writer; otherwise the existing EPRO2 replay path. CLI surface and output layout unchanged. - docs/sources/epro2_to_std_mapping.md: adds a Pro 2.x section. Adapter dispatches on `head.epro_format`: absent / "epro2" gets dict-shaped objects values, "pro2" gets array-shaped values (`[OPTYPE, arg1, ...]`). Lists the Pro 2.x-specific OPTYPEs (FONTSTYLE / LINESTYLE / CONNECT / OBJ / REGION / DIMENSION / STRING / TEARDROP) the EPRO2 vocabulary doesn't have. Smoke (re-running --all on all 5 Pro projects): 191 → 222 JSON files. Liangshan adds 3 (2 SCH + inline 5357-object PCB). Taishan adds 28 (SCH only — PCB skipped, encrypted-external; source/<uuid>.json still keeps the dataStrId/iv/key for a later fetch+decrypt pass). 84 → 86 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:00:37 +08:00
Knowit	3866e24189	tools/epro2/std: rewrite to Option 2 (objects dump) per downstream spec Downstream came back with concrete requirements: don't pre-compute Std shape[] tilde strings, just dump the raw EPRO2 `objects: {id: payload}` dict and they'll write a ~100-LoC adapter on their side. Pulling the tilde-mapping work back saves us from second-guessing positional fields without their parser to verify against, and shortens our pcb_writer from ~500 lines to ~40. Output shape (Std envelope intact, just no `shape[]`): { "success": true, "code": 0, "result": { "uuid", "puuid", "title", "docType": 3 \| 1, "components": {}, "dataStr": { "head": { "docType": "3" \| "1", "editorVersion": "facere-epro2/0.1 (epro2 <X.Y.Z>)", "units": "mil", "epro2_doc_uuid": ..., "epro2_editor_version": ..., }, "BBox": {x, y, width, height}, # mil "layers": [...], # Std layer-string array "objects": dict(doc.objects), # raw EPRO2, 1:1 "preference": {}, "netColors": [], "DRCRULE": {}, } } } Per-doc spec downstream gave us: - shape[] dropped (empty placeholder misleads adapter) - all units mil (no mm conversion — Std canvas already declares mil) - head.units="mil" so adapter doesn't have to guess - BBox min/max across known x/y/startX/endX/centerX fields; adapter can refine by walking path arrays itself - layers[] keeps Std's 17-line default + inner SIGNAL layers actually used (21~Inner1.., 22~Inner2..) - empty stubs preference/netColors/DRCRULE for grep-based triage New: docs/sources/epro2_to_std_mapping.md with the full EPRO2 OPTYPE → Std verb table that downstream's adapter authors will copy from. Tables include the layer-id remapping (the 5↔7 paste/mask flip, 11→10 outline, 12→11 multi, SIGNAL 15+→21+), PCB op mappings, SCH op mappings (marked best-effort: no Std SCH samples in our corpus), and the 5-Voltage placeholder COMPONENT → extra net flag trick. Extracted from the previous Option-3 writer (commit `fe6971f`) so adapter writers don't have to reverse-engineer it from source. ESP-VoCat smoke: 6 PCB + 9 SCH = 15 JSON files, head.units=mil preserved, no shape[] field present. 82 → 84 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 01:41:12 +08:00
Knowit	ed713fa557	docs: consolidate rate-limit probe results into a proper benchmark report The doc had been growing incrementally as each host got probed; reshape it as a polished benchmark with TL;DR top, methodology section (including safety constraints + caveats), per-host detailed tables, final crawler settings, batch-50 walltime breakdown, and a reproduce recipe. Five hosts fully covered: pro.lceda.cn API 5.0s -> 0.5s (10×) lceda.cn doc 5.0s -> 0.5s (10×) oshwhub detail 2.0s -> 1.0s ( 2×) oshwhub listing 2.0s -> 1.0s ( 2×) modules.lceda CDN 0.2s (already optimized) Net effect on batch-50 plan: sleep total ~32min -> ~3min, walltime ~2h -> ~10-15min. Key finding: the original 5s/req on Pro was set out of "logged-in account is precious" caution with zero empirical evidence. Sustained burst probe (25 distinct UUIDs at 0.5s, no recovery) showed 0/25 errors and median latency 410ms — the caution was unjustified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:57:35 +08:00
Knowit	183f82a3be	crawler: drop SLEEP_SOURCE 5.0 -> 0.5 (Std doc endpoint probe) Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s) × 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency variance is dominated by payload size (Std docs span 4 KB to 4.5 MB) not server backpressure. Same posture as Pro API. Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19 min wall time (21min sleep -> 2min sleep). Combined plan now projects ~2h -> ~10min walltime exclusive of download bytes. scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json upstream_version_documents lists). Reusable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:54:46 +08:00
Knowit	cb868988b9	crawler: drop sleep rates 10x for Pro API, 2x for oshwhub detail Calibrated against ladder probes on 2026-04-29. Findings in docs/sources/probe_rate_limit_results.md. SLEEP_PRO 5.0 -> 0.5 (pro.lceda.cn API) SLEEP_BETWEEN 2.0 -> 1.0 (oshwhub detail/listing) SLEEP_SOURCE 5.0 unchanged (lceda.cn Std endpoints — not yet probed) SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized) The original 5s rate for Pro API was set out of caution because Pro requires a logged-in cookie. Empirical sustained-burst probe (25 distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was wrong — server tolerates QPS=2 cleanly. oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe water mark. Net effect on batch-50 estimate: ~1.5h -> ~30min. scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable for new endpoints (Std source still owes a probe). Designed for safety: 30s tier recovery, low rep counts on auth hosts, bail on first non-200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:45:34 +08:00
Knowit	fc2a45f658	docs: explain per-doc .epro2 crawl vs web-export .epro2 ZIP Colleague-facing explainer at docs/sources/pro_crawl_vs_export.md. Addresses the "I see 278 .epro2 files but my browser only downloaded one" confusion: web download is a ZIP container (extension is a UX choice, not a format), our crawl produces per-doc message streams. Both carry equivalent EPRO2 data; only real gap is IMAGE/ binary previews which we don't fetch yet. Why per-doc and not ZIP: the ZIP path has no public endpoint — three HARs confirm the export button fires zero HTTP requests, it's pure client-side JSZip on data already loaded by the editor. Our crawler hits the same chain endpoints the editor uses internally, which delivers per-doc streams. Log entry references the 278 vs 266 doc-count delta for ESP-VoCat (we walk full history chain, web export is a current snapshot). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:13:52 +08:00
Knowit	d89a7cdf9c	oshwhub: dump full listing index (33,695 projects) for batch sizing Probed listing API and learned: total field is exposed (Pro=21,202 / Std=12,493), pageSize accepts >=1000 (full corpus = 35 requests / 71s), sort param is silently ignored. Dump all listings via scripts/dump_listing_index.py to local jsonl so downstream batch-selection no longer hits the API. Why: needed quantitative anchors before scaling Pro batch beyond top-5. License is detail-page only (~19h serial scan), so we want to filter on grade/like locally first to shortlist before paying that cost. Quality-tier counts now known: A-tier (grade>=3 & like>=10) = 2,806 across both origins. - scripts/dump_listing_index.py: one-shot scraper, polite QPS, streams to jsonl - docs/sources/oshwhub_listing_full.md: human-readable report with growth trends, quality tiers, owner concentration, and storage-budget anchors Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:30:56 +08:00
Knowit	c6279bff08	Add EasyEDA Pro 2.x legacy source ingestion (5/5 batch closure) 补齐前一批失败的 2 个 legacy Pro 项目（立创·泰山派 RK3566、立创·梁山派），打通 Pro 2.x 旧版工程的源抓取链路。结合上一 commit 的 modern Pro 3.x 路径，本仓库 5/5 Pro 项目 EPRO2/dataStr 全部端到端打通。 Pro 2.x 与 Pro 3.x 是两个完全不同的存储模型： - Pro 3.x：git-style branch + linear history chain，AES-128-GCM 加密的 EPRO2 增量消息流，按 history 重放（已在前一 commit 打通） - Pro 2.x：无 branch / 无 history。文档以 EasyEDA Std plaintext dataStr 存储（同 ["DOCTYPE","SCH","1.1"] 格式），按 doc UUID 通过 /api/v2/documents/lists 批量 GET，主体无加密，只组件库走 AES Pro 2.x 抓取链由 HAR (tmp/prodownload3.har, 178 请求) 反推： GET /api/v4/projects/<P> → boards: [{sch, pcb, name}] GET /api/projects/<P>/ticket?uuid=&g_ticket=-1 → 完整项目 manifest POST /api/schematic/lists {uuids:[<sch>]} → sort: [{uuid:<sheet>}] POST /api/v2/documents/lists {uuids,docType:1} → schematic plaintext POST /api/v2/documents/lists {uuids,docType:3} → PCB plaintext POST /api/coppers/search {paths} → 铺铜层 POST /api/textpath/search {paths,project_uuid}→ 字体/文字 POST /api/v2/resources/search {hash,project_uuid} → BLOB 图片实现： - crawlers/oshwhub/crawler.py: - fetch_pro_source() refactor 成 dispatcher，先 GET project meta 检查 branch_uuid，null 即旧版走 _fetch_pro_legacy()，非空走 _fetch_pro_modern() - _fetch_pro_legacy() 新增（按上面 9 步流程拉所有 doc + 辅助层） - _pro_post_json() POST helper（与 _pro_get_json 对称） - schemas/project.schema.json: source_format enum 加 easyeda-pro-legacy - docs/sources/easyeda_pro_source.md rev 4: §1.1 旧版 vs 新版判别表更新、 §2.7 新增旧版抓取流程 + 实测数据落盘约定（旧版）： source/ticket.json 完整 manifest source/<sheet_uuid>.json 每张原理图（含 dataStr） source/pcb_<pcb_uuid>.json 每块 PCB source/coppers.json/textpath.json/blobs.json 辅助 PCB 层资源 source/manifest.json 索引实测：立创·梁山派 editor=2.1.30, 2 sheets+1 pcb, 1.0 MB, 78 sym/191 fp/128 dev 立创·泰山派 RK3566 editor=2.1.40, 29 sheets+1 pcb, 0.8 MB, 299 sym/524 fp/295 dev 旧版项目体量比新版小两个数量级（梁山派 1 MB vs RK3576 66 MB）—— 没有增量 history，组件库走单独端点，本身就是当前快照。 5/5 Pro 项目终极汇总： X86 主板 easyeda-pro 3.2.15 7374 docs / 481 MB 泰山派 RK3566 easyeda-pro-legacy 2.1.40 30 docs / 0.8 MB 梁山派 easyeda-pro-legacy 2.1.30 3 docs / 1.0 MB 220V 桌面电源 easyeda-pro 3.2.69 771 docs / 26 MB ESP-VoCat easyeda-pro 3.2.91 278 docs / 7.5 MB 共 8456 docs / ~516 MB plain。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:59:25 +08:00
Knowit	3282a028c4	Add EasyEDA Pro EPRO2 source ingestion (3/5 batch test) 打通 oshwhub origin=pro 现代 Pro 3.x 工程的 EPRO2 源抓取链路。3/5 modern Pro 项目完整解出（共 8423 docs / 542 MB plain）： - X86 主板 7374 docs / 481 MB plain (chain=85, editor=3.2.15) - 220V 桌面电源 771 docs / 26 MB plain (chain=28, editor=3.2.69) - ESP-VoCat 278 docs / 7.5 MB plain (chain=12, editor=3.2.91) 剩余 2/5 是 legacy Pro 2.x（立创泰山派 RK3566、梁山派），项目 meta 返回 branch_uuid=null + editorVersion="2.1.40"，没有 git-style chain 模型，文档直接挂在 boards[].sch/pcb 字段上，访问端点暂未挖通；元数据落库 metadata.json，source/ 留空。实现要点： - fetch_pro_source(): 4 步流程（project → branch HEAD → structures → /branches/<B>/histories/<HEAD> 即返完整 chain，无需 ?limit 批量端点）+ 逐 history 走 AES-128-GCM 解密（16 字节 IV，pycryptodome 原生支持）+ gunzip + 按 DOCHEAD 切 per-doc EPRO2 流 - EPRO2 解析坑：行末单 `\|` 是行终止符不是字段分隔符，必须先 rstrip("\|") 再 split("\|\|")，否则 payload JSON 解析失败 silently swallow 导致 cur_doc 不设 → 第一轮 X86 板 7374 docs 抽出来只剩 2 个 - docType 实测远不止 BOARD/PCB/SCH/SCH_PAGE，还含 SYMBOL / FOOTPRINT / DEVICE / BLOB / FONT / CONFIG —— Pro 把组件库快照也随项目存到 history，下游做 EPRO2→KiCad 转换时必须先把这些 lib doc 加载进 symbol cache - Pro 2.x vs 3.x 是不同存储模型 —— 3.x 走 branch 模型（已打通）， 2.x 走 boards[] 直链（未打通）；判别条件：project meta 的 branch_uuid 是否为 null CLI 新增 --with-pro-source / --backfill-pro-source / --pro-cookie / --origin（按 origin 字段服务端过滤 listing API），crawl_one() 按 origin=pro 自动 dispatch 到 Pro fetcher。 schema：docType 类型从 integer 放宽到 [integer, string, null] （兼容 Std 的 1/3 + Pro 的 BOARD/SCH 等），新增 message_count 字段。 License 注意：本批 5 个项目全是 NC-SA / GPL，未达 Pro source doc §4.2 Forge 白名单（MIT/BSD/Apache/CC0/CC-BY/CERN-OHL-P/Unlicense）。按 CLAUDE.md "研究用、不再分发" 原则 raw 入库无碍；Forge 投影时另过白名单。详细技术细节见 docs/sources/easyeda_pro_source.md rev 3 + log.md。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:45:52 +08:00
Knowit	d874278bc5	Add EasyEDA Std project source ingestion (10 boards backfilled) 打通 oshwhub origin=std 项目的工程源（schematic + PCB dataStr）抓取链路。原 plan.md §1.6 假设需要登录，实测 lceda.cn/api/documents/<doc>?uuid=<doc>&path=<doc> 对公开项目匿名可访问 —— 无需 cookie，无账号封禁风险。调研：4 轮探测留痕在 data/state/std_probe[1-5]/（gitignored）；翻 Std 编辑器 v6.5.51 的 main.min.js bundle 找到 ajaxDetail 端点；按 docType 区分两种响应 shape（schematic 项目视图 vs PCB 文档视图）。 Crawler: - make_source_client() 用浏览器 UA + lceda.cn/editor Referer，因为 oshwhub /api/project/<uuid> 端点拒绝 FacereDataset/0.1 UA（CLAUDE.md UA 例外条款：目标站主动封自定义 UA + 公开静态资源） - fetch_std_source(): 项目元 → version_documents → 逐文档 dataStr → 落 source/<doc>.json + source/manifest.json - --with-source（爬新项目时一并抓源）/ --backfill-source（仅扫已有） - QPS ≤ 0.2 (SLEEP_SOURCE = 5s) 自律 Schema: 加 source_format / source_path / source_documents / editor_version （前 3 进 enum 锁定，便于后续 Pro / KiCad 源对齐）。回填结果：10/10 成功，45 个文档，33.2 MB；schema validate 全通。 docTypes 主要是 1 (schematic) 与 3 (pcb)；USB 电压电流表只有 PCB 文档（4 个：主板+盖板+底板+面板，作者未上传原理图源）。完整调研：docs/sources/easyeda_std_source.md。	2026-04-28 20:07:40 +08:00
Zhang Jiahao	a3942c03df	Update EasyEDA Pro source research	2026-04-24 00:40:18 +08:00
Zhang Jiahao	a16cb11c7d	Add easyeda_pro_source.md: Pro 工程源完整链 + EPRO2 格式解析 Why: - pro.lceda.cn (立创 EDA 专业版) 的工程源抓取链已经打通：4 步 API + AES-128-GCM 解密 + gzip 解压 + EPRO2 消息流解析，所有信息需要落成文档独立保留，避免丢失；也为后续实现 EPRO2 → KiCad 转换器/选型铺路。 - 与 oshwhub.md（Std 版）并列成为独立调研文档 —— Pro 和 Std 是两套独立编辑器，cookie/API/格式都不同，混在一起反而乱。 What: - docs/sources/easyeda_pro_source.md: * TL;DR 表 + §1 Std vs Pro 对照 * §2 4 步 API 链 + 必需 headers (Editor-Version/path/Referer/Cookie) + Python 解密代码 + 实测数据（2.7 MB 源流 / 8357 条消息） * §3 EPRO2 格式完整分类：40 种 message type 按功能分组 (零件/几何/PCB/层/规则/...) + 每类样例 * §4 安全合规（风控 / license / 密钥泄漏语义） * §5 接入 Forge (OSHWHUB_INGEST_SPEC.md) 的 gap 表 * §6 已知未验证 7 条 * 附录 A 一键重跑命令 - pyproject.toml: + pycryptodome>=3.23.0（AES-GCM 解密依赖） - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:11:32 +08:00
Zhang Jiahao	b0ddcf3f14	Allow login content; plan cloud infra, storage tiers, EDA→KiCad conversion Why: - 策略调整：登录后才能访问的内容从"禁止"改为"纳入本项目范围"，同时明确凭据管理红线（合法账号、不入 git、云服务器隔离）。解锁 u.lceda.cn 工程源 JSON，这是训练数据质量的关键升级。 - 计划中"存储"和"运行环境"一直模糊，现在按 Charles 提供的广州云服务器 + 存储分级演进（Gitea LFS → 对象存储）给出清晰路径。 - 打通 oshwhub (EasyEDA) 与 bshada/open-schematics (KiCad) 两个生态，需要一个 EDA→KiCad 批转换脚本。先把它纳入 plan，等拿到工程源再实现。 What: - CLAUDE.md: 登录态条款从"不抓"改为"合法账号可抓"，凭据管理写死在 ~/.secrets/，事件记 docs/secrets.md；合规红线同步更新 - plan.md §0.5: 新增基础设施段（机器初始化 / 调度 / 登录态获取） - plan.md §1.4: 存储分级演进（< 50 GB 云盘，50-200 GB 评估，> 200 GB 迁对象存储） - plan.md §1.6: 登录态抓 u.lceda.cn 工程源 - plan.md §1.7: scripts/convert_to_kicad.py 批处理，候选 easyeda2kicad.py - plan.md 风险表: 加账号封禁 / 转换失败 / 云服务器单点故障三条 - docs/sources/oshwhub.md: u.lceda.cn 从"未开放"移到"需登录，已纳入范围" - README.md 数据源表: 加"登录态"列 + 运行环境说明 - log.md: 本次策略变更记录未改：未新增 docs/infra.md（等机器到位 + 真实细节后再写），scripts/convert_to_kicad.py 尚未实现（等拿到工程源样本再实现）。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:57:30 +08:00
Zhang Jiahao	ba501c328c	Remove personal name from suggestion/decision phrasing Why: - "给 Charles 的建议"、"待 Charles 拍板"、"需要 Charles 决策" 这些写法把具体人绑到了文档里，换维护者就失准。改成中性的 "建议 / 待决策 / 待拍板"，文档对未来协作者和 agent 都更通用。 What: - log.md: 四处去掉 "给 Charles / 还是需要 Charles 决策 / 等 Charles 拍板" - plan.md: 三处去掉 "待 Charles / Charles 定目标 / 需要 Charles 定" - docs/sources/hf_bshada_open_schematics.md: "待 Charles 决策" → "待决策" - scripts/estimate_size.py: docstring 去掉 "给 Charles 一个估计" - CLAUDE.md: 数据删除确认规则从 "先跟 Charles 确认" 改成 "先跟用户确认" 保留的 Charles 提及都是事实性的： - README/plan 里的 "维护者：Charles"（身份字段） - log.md 历史条目里 "Charles 要求..." / "Charles 点名..."（历史事件记录） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:01:52 +08:00
Zhang Jiahao	ed4837dedf	Rewrite oshwhub.md as canonical data source investigation Why: - Charles 要求把 12493 总数验证 + 90 项目采样结果合进主调研文档，消除 oshwhub_corpus_estimate.md 与 oshwhub.md 的重复与分散。 - 一份高质量的数据源调查应该独立完备：任何人（人或 agent）读完就能复现爬取 / 估算 / 合规判断，不用跨文件拼凑。 What: - docs/sources/oshwhub.md 重写为 9 节 + 附录： - TL;DR 表（一页纸核心事实） - 站点架构 / robots / API 入口 / 项目详情 SSR / 附件 CDN - 排除项：fs-web-stream.jlc.com 推广图标 / u.lceda.cn 登录源 - §4 项目总数验证（新）：三路 sort 一致 12493 + 分页二分边界 ≈250 页 + grade 覆盖抽样 - §5 抽样语料特征（从 corpus_estimate 并入）：体积 median 9MB/p90 54MB、视频占 54%、license 分布 GPL 3.0 49%/Public Domain 21% - 风险表 7 条、附录重跑命令 - 删除 docs/sources/oshwhub_corpus_estimate.md（内容已并入 §5） - log.md: 本次记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:59:05 +08:00
Zhang Jiahao	53b7648984	Add HF bshada/open-schematics to Phase 1 plan Why: - Charles 点名把该 HF 数据集纳入第一批。它是已预处理包（非待爬网站），和 oshwhub 的抓取逻辑不一样，先把决策面在 plan 里讲清楚，再动手拉。 - 与 oshwhub (EasyEDA 生态) 互补，补 KiCad 原生路径。 What: - docs/sources/hf_bshada_open_schematics.md: 调研文档 - 78 parquet shards, 6.4 GB 总量 - CC-BY-4.0 商用友好 - 字段：.kicad_sch 源 / PNG / 组件列表 / JSON / YAML / name / desc - 镜像方案（整包存 data/external/..., 不拆 per-project） - .gitattributes 建议（data/external/*/.{parquet,png} → LFS） - plan.md §1.5: 阶段说明 + 待 Charles 批 6.4 GB 预算 - README.md 数据源表: 加一行 - log.md: 本次记录下载未触发，等 Charles 拍板。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:51:24 +08:00
Zhang Jiahao	e222b08f27	Add corpus size/license estimator; snapshot 90-project findings Why: - 放量决策需要比"52MB/项目 × 12493 = 650GB"更扎实的数据。用 scripts/estimate_size.py 采样 90 个 hot 项目的 attachments[].size 得到真实分布（median 9MB / p90 54MB），全量 median 估算 110GB， p90 上界 660GB。这给 Charles 一个可信的存储预算。 - 附带 license 和 ext 分布采出两个重要洞察： (1) mp4+qt 视频占 54% 存储，加 --skip-ext 开关可节省一半； (2) NC (Non-Commercial) 许可 ~11%，下游必须按 whitelist 过滤。 What: - scripts/estimate_size.py: 无下载的元数据采样器，复用 crawler.parse_detail_html - docs/sources/oshwhub_corpus_estimate.md: 结果快照 + 决策建议 - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:45:54 +08:00
Zhang Jiahao	c8d55a22eb	Add schema+file validator; pin down fs-web-stream as ad icons Why: - schema 必须能自动校验，否则后续放量无法防腐。现在 scripts/validate.py 对全部 metadata.json 做两层检查（schema + 本地文件 sha256），跑一次即可对全量数据签收；10/10 项目已通过。 - docs/sources/oshwhub.md 之前把 fs-web-stream.jlc.com 标为"工程源待查"，排查后确认那些 URL 全部是嘉立创服务侧栏/推广图标，与项目无关。 image.lceda.cn/attachments/ 是项目附件的唯一入口，现在调研文档闭合。 What: - scripts/validate.py: jsonschema 校验 + optional --check-files 核 sha256 - pyproject.toml: 加 jsonschema>=4.26 依赖 - docs/sources/oshwhub.md: fs-web-stream 归类为推广资源（已排除），附 context 证据 - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:40:55 +08:00
Zhang Jiahao	5ffa10f256	Phase 1 MVP: crawl 10 high-quality oshwhub projects into LFS Why: - Charles 指定：先爬 10 个高质量项目存 Gitea LFS，一个项目一个文件夹，保留原文件和 URL。先以小批量验证 schema + LFS 流水线，放量前再拍板存储规模。 What: - crawlers/oshwhub: 列表 API (`/api/project?sort=hot`) + SSR HTML 解析，一次性产出 metadata / description / cover / files / _urls - schemas/project.schema.json: 跨源统一 schema - docs/sources/oshwhub.md: API 入口 / 字段映射 / 陷阱调研 - pyproject.toml: httpx[http2] 单依赖 - .gitattributes: data/raw//files/ 一律走 LFS（规则写窄，避免误伤 schemas/.json 等） - .gitignore: 移除 data/raw/ 排除（改走 LFS 入库） 10 个项目覆盖：调试器 / 加热台 / 盖革计数器 / 数控电源 / 焊台 / 智能手表 / USB 测电流 / ZVS 感应加热 / AI 开发板 / 红外热成像。共 52 附件 ≈ 524 MB 入 LFS，筛选判据 grade=4 & likes>=100 & 多样性。 Known gaps（见 plan.md § Phase 1.4）： - EasyEDA 源 JSON 需登录 (u.lceda.cn)，v0.1 跳过 - fs-web-stream.jlc.com 的工程源下载未测 - scripts/validate.py 自动 schema 校验未实现 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:34:09 +08:00
Zhang Jiahao	bf2370f83b	Initial skeleton for FacereDataset Why: - Facere 需要一个统一的开源硬件设计数据源，用于训练专有模型与构建检索型知识库。仓库先立骨架，把合规红线、数据 schema 要求、爬虫规约写在 CLAUDE.md 里，避免后续实现时各站点爬虫写法发散。 - plan.md 用阶段化路线图明确"先广度后深度、先合规后规模"的策略，让放量前必须经过 Charles 对齐一次，降低存储与法律风险。 Contents: - README.md: 项目简介、数据源表、仓库结构、合规声明 - CLAUDE.md: 项目级 Claude 指令（工作流 / 爬虫规约 / 合规红线） - plan.md: Phase 0-6 分阶段计划 + 风险与未决项 - log.md: 首条日志（调研 + 初始化记录） - .gitignore: 排除 data/{raw,processed,state} 内容，保留目录占位 - 目录骨架: crawlers/ schemas/ scripts/ data/ docs/sources/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:58:10 +08:00

20 Commits