FacereDataset

Author	SHA1	Message	Date
Knowit	183f82a3be	crawler: drop SLEEP_SOURCE 5.0 -> 0.5 (Std doc endpoint probe) Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s) × 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency variance is dominated by payload size (Std docs span 4 KB to 4.5 MB) not server backpressure. Same posture as Pro API. Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19 min wall time (21min sleep -> 2min sleep). Combined plan now projects ~2h -> ~10min walltime exclusive of download bytes. scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json upstream_version_documents lists). Reusable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:54:46 +08:00
Knowit	cb868988b9	crawler: drop sleep rates 10x for Pro API, 2x for oshwhub detail Calibrated against ladder probes on 2026-04-29. Findings in docs/sources/probe_rate_limit_results.md. SLEEP_PRO 5.0 -> 0.5 (pro.lceda.cn API) SLEEP_BETWEEN 2.0 -> 1.0 (oshwhub detail/listing) SLEEP_SOURCE 5.0 unchanged (lceda.cn Std endpoints — not yet probed) SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized) The original 5s rate for Pro API was set out of caution because Pro requires a logged-in cookie. Empirical sustained-burst probe (25 distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was wrong — server tolerates QPS=2 cleanly. oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe water mark. Net effect on batch-50 estimate: ~1.5h -> ~30min. scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable for new endpoints (Std source still owes a probe). Designed for safety: 30s tier recovery, low rep counts on auth hosts, bail on first non-200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:45:34 +08:00
Knowit	d89a7cdf9c	oshwhub: dump full listing index (33,695 projects) for batch sizing Probed listing API and learned: total field is exposed (Pro=21,202 / Std=12,493), pageSize accepts >=1000 (full corpus = 35 requests / 71s), sort param is silently ignored. Dump all listings via scripts/dump_listing_index.py to local jsonl so downstream batch-selection no longer hits the API. Why: needed quantitative anchors before scaling Pro batch beyond top-5. License is detail-page only (~19h serial scan), so we want to filter on grade/like locally first to shortlist before paying that cost. Quality-tier counts now known: A-tier (grade>=3 & like>=10) = 2,806 across both origins. - scripts/dump_listing_index.py: one-shot scraper, polite QPS, streams to jsonl - docs/sources/oshwhub_listing_full.md: human-readable report with growth trends, quality tiers, owner concentration, and storage-budget anchors Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:30:56 +08:00
Knowit	c721e08c93	projects.md: replace Comments column with 版本 (Std / Pro 3.x / Pro 2.x) Comments 那列对工程"品质"信号弱（评论量主要看话题热度）；换成"版本" 列直接告诉读者每个项目源是哪种 EDA 格式 + 编辑器版本号。当前 15 个项目里 10 Std / 3 Pro 3.x / 2 Pro 2.x。 source_format 字段映射： easyeda-std → Std easyeda-pro → Pro 3.x easyeda-pro-legacy → Pro 2.x 其它 → 透传 editor_version（如 6.5.43 / 3.2.91 / 2.1.40）作为子标签放第二行。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:01:41 +08:00
Zhang Jiahao	ba501c328c	Remove personal name from suggestion/decision phrasing Why: - "给 Charles 的建议"、"待 Charles 拍板"、"需要 Charles 决策" 这些写法把具体人绑到了文档里，换维护者就失准。改成中性的 "建议 / 待决策 / 待拍板"，文档对未来协作者和 agent 都更通用。 What: - log.md: 四处去掉 "给 Charles / 还是需要 Charles 决策 / 等 Charles 拍板" - plan.md: 三处去掉 "待 Charles / Charles 定目标 / 需要 Charles 定" - docs/sources/hf_bshada_open_schematics.md: "待 Charles 决策" → "待决策" - scripts/estimate_size.py: docstring 去掉 "给 Charles 一个估计" - CLAUDE.md: 数据删除确认规则从 "先跟 Charles 确认" 改成 "先跟用户确认" 保留的 Charles 提及都是事实性的： - README/plan 里的 "维护者：Charles"（身份字段） - log.md 历史条目里 "Charles 要求..." / "Charles 点名..."（历史事件记录） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:01:52 +08:00
Zhang Jiahao	ce22717288	Add projects.md index (stars-sorted) + build_index.py generator Why: - Charles 要一个索引页看入库项目 + 他们的 stars。手工维护会漂移，所以 scripts/build_index.py 直接读 metadata.json 重新生成，保证 projects.md 永远是 data/raw/ 的镜像。 What: - projects.md: 10 个项目按 Stars 倒序（最高 3293 的加热台量产计划 → 最低 236 的柚子爱 AI 相机），含 stars/likes/forks/views/comments/ files/size，+ License 与数据源分布 - scripts/build_index.py: 扫 metadata.json 渲染 markdown，支持未来多数据源（source 字段区分），下次新增 oshwhub / github / hackaday 项目后重跑即可 - README.md: 加 projects.md 链接 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:48:21 +08:00
Zhang Jiahao	e222b08f27	Add corpus size/license estimator; snapshot 90-project findings Why: - 放量决策需要比"52MB/项目 × 12493 = 650GB"更扎实的数据。用 scripts/estimate_size.py 采样 90 个 hot 项目的 attachments[].size 得到真实分布（median 9MB / p90 54MB），全量 median 估算 110GB， p90 上界 660GB。这给 Charles 一个可信的存储预算。 - 附带 license 和 ext 分布采出两个重要洞察： (1) mp4+qt 视频占 54% 存储，加 --skip-ext 开关可节省一半； (2) NC (Non-Commercial) 许可 ~11%，下游必须按 whitelist 过滤。 What: - scripts/estimate_size.py: 无下载的元数据采样器，复用 crawler.parse_detail_html - docs/sources/oshwhub_corpus_estimate.md: 结果快照 + 决策建议 - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:45:54 +08:00
Zhang Jiahao	c8d55a22eb	Add schema+file validator; pin down fs-web-stream as ad icons Why: - schema 必须能自动校验，否则后续放量无法防腐。现在 scripts/validate.py 对全部 metadata.json 做两层检查（schema + 本地文件 sha256），跑一次即可对全量数据签收；10/10 项目已通过。 - docs/sources/oshwhub.md 之前把 fs-web-stream.jlc.com 标为"工程源待查"，排查后确认那些 URL 全部是嘉立创服务侧栏/推广图标，与项目无关。 image.lceda.cn/attachments/ 是项目附件的唯一入口，现在调研文档闭合。 What: - scripts/validate.py: jsonschema 校验 + optional --check-files 核 sha256 - pyproject.toml: 加 jsonschema>=4.26 依赖 - docs/sources/oshwhub.md: fs-web-stream 归类为推广资源（已排除），附 context 证据 - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:40:55 +08:00
Zhang Jiahao	bf2370f83b	Initial skeleton for FacereDataset Why: - Facere 需要一个统一的开源硬件设计数据源，用于训练专有模型与构建检索型知识库。仓库先立骨架，把合规红线、数据 schema 要求、爬虫规约写在 CLAUDE.md 里，避免后续实现时各站点爬虫写法发散。 - plan.md 用阶段化路线图明确"先广度后深度、先合规后规模"的策略，让放量前必须经过 Charles 对齐一次，降低存储与法律风险。 Contents: - README.md: 项目简介、数据源表、仓库结构、合规声明 - CLAUDE.md: 项目级 Claude 指令（工作流 / 爬虫规约 / 合规红线） - plan.md: Phase 0-6 分阶段计划 + 风险与未决项 - log.md: 首条日志（调研 + 初始化记录） - .gitignore: 排除 data/{raw,processed,state} 内容，保留目录占位 - 目录骨架: crawlers/ schemas/ scripts/ data/ docs/sources/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:58:10 +08:00

9 Commits