docs: explain per-doc .epro2 crawl vs web-export .epro2 ZIP

Colleague-facing explainer at docs/sources/pro_crawl_vs_export.md.
Addresses the "I see 278 .epro2 files but my browser only downloaded
one" confusion: web download is a ZIP container (extension is a UX
choice, not a format), our crawl produces per-doc message streams.
Both carry equivalent EPRO2 data; only real gap is IMAGE/ binary
previews which we don't fetch yet.

Why per-doc and not ZIP: the ZIP path has no public endpoint —
three HARs confirm the export button fires zero HTTP requests, it's
pure client-side JSZip on data already loaded by the editor. Our
crawler hits the same chain endpoints the editor uses internally,
which delivers per-doc streams.

Log entry references the 278 vs 266 doc-count delta for ESP-VoCat
(we walk full history chain, web export is a current snapshot).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 00:13:52 +08:00
parent 1e06ba6582
commit fc2a45f658
2 changed files with 182 additions and 0 deletions

16
log.md
View File

@@ -4,6 +4,22 @@
---
## 2026-04-29 01:00 科普文档:爬取 per-doc .epro2 vs 网页端 .epro2 ZIP 整包
**Claude 会话**
接 chain replay sleep 优化commit `1e06ba6`)后续。同事看到 `data/raw/oshwhub/<uuid>/source/` 下面躺着 278 个 `.epro2` 而不是 1 个,会直觉以为抓错了——他们认知里的 `.epro2` 是网页端"下载工程包"那个 1.4 MB 单文件。
实际上:
- **网页端 `.epro2`** = ZIP 容器(扩展名误导),里面 `.epru`(拼成一坨的 EPRO2 流)+ `project2.json` + `IMAGE/` 6 张组件预览图
- **爬取 `.epro2`** = 工程内每个文档SYMBOL / FOOTPRINT / DEVICE / SCH_PAGE / PCB ……)自己的 EPRO2 消息流per-doc 一文件
- 两者**信息量基本等价**ESP-VoCat 我们 278 vs 网页 266多的 12 个是 history chain 上演化掉的容器层旧版本);唯一真实差异是 IMAGE/ 二进制图(我们 blob 引用爬到了但没拉本体——已知 gap
- 我们走 per-doc 不走 ZIP 的硬约束:**ZIP 那条路服务端没有公开端点**,是纯前端 JS 现拼现压(三份 HAR 验证:导出按钮零 HTTP 流量)
写到 `docs/sources/pro_crawl_vs_export.md`给同事看。结构TL;DR → ESP-VoCat 实例 → docType 分布对比表 → 数量差异解释 → 体积对比 → 选型决策表。
---
## 2026-04-29 00:30 KiCad 导出 Phase 3 hierarchicalroot + global_label + 5-Voltage 电源端口
**Claude 会话**