FacereDataset

Author	SHA1	Message	Date
Knowit	c6fd111d6d	crawler: --no-cover, --concurrency, drop cross-host sleep + batch-50 Step 1 done Three crawler ergonomics for batch operations: --no-cover Skip cover image download. For scan-only modes (license/meta scrape) this drops ~1.3s/project and avoids slow-CDN hangs. --concurrency N ThreadPoolExecutor wrapping the per-project loop. Default 1 = serial (current behavior). Anonymous endpoints tolerate 5+ comfortably; output uses a print lock for readable interleaved progress. fetch_cover plumbs through crawl_one. Drop cross-host sleep #1: in crawl_one between detail HTML (oshwhub.com) and cover image (image.lceda.cn). Different hosts — sleep was unnecessary. Saves ~1s/project. Sleep #2 (post-cover, before next iteration) stays — it gates the next oshwhub.com hit. download_to gains max_seconds wall budget (default 60s, cover uses 15s). Defends against pathologically slow CDN connections — observed 10 KB/s on image.lceda.cn for one project, would have hung 6+ min on a 3.6 MB cover otherwise. httpx default timeout resets per chunk, so streaming downloads need an external wall-clock guard. batch-50 Step 1 (license/meta scrape) shipped: 50/50 candidates have metadata.json + license recorded License distribution: GPL 3.0 32, Public Domain 6, NC variants 8, CERN-OHL 1, MIT 1, CC BY 3.0 1 Forge-friendly (non-NC): 41/50 (82%) Declared attachments: 180 files / 2.36 GB (median 18 MB/proj, max 304 MB) Walltime: 3min 26s for 28 projects at concurrency=5 (server-side HTML render bound, not sleep-bound) One orphan partial cover (a670e60a...) cleaned up — leftover from the first aborted run before the timeout fix landed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 01:35:11 +08:00
Knowit	fe6971f3f9	tools/epro2: add std/ writer — EPRO2 → EasyEDA Std-format JSON for downstream The downstream colleague consumes oshwhub Std (lceda) dict-format JSON, not KiCad. The EPRO2 decryption part (per-doc plaintext .epro2 streams in data/raw/<uuid>/source/) is what we already provide; the missing piece is converting EPRO2 op-streams into the same `dataStr.shape` tilde-delimited format their parser already speaks. New tools/epro2/std/ module, peer of tools/epro2/kicad/, kept deliberately separate so the KiCad path stays untouched: - pcb_writer.write_pcb_std() — high-fidelity, validated against a Std PCB sample at data/raw/oshwhub/3e2f893d.../25931ddab8.json. Maps LINE→TRACK, VIA→VIA, POUR→COPPERAREA (with SVG `M..L..Z` path), POLY→CIRCLE/SOLIDREGION, COMPONENT+FOOTPRINT→LIB nested with #@$-separated PADs (placement rotation + translate applied so pad coords land at PCB-absolute positions). Layer-id mapping (EPRO2 5↔7 flipped vs Std solder/paste, 11→10 outline, 12→11 multi, SIGNAL inner 15+ → Std 21+) noted inline. - sch_writer.write_sch_std() — best-effort. Our corpus has zero Std schematic samples (docType=1) so verb field orders follow the EasyEDA Std public spec, not direct observation. Emits W (wire), N (net flag, including the 5-Voltage Global Net Name power-port pattern), T (text), LIB (placement with #@$-nested PIN/T). If downstream's parser bails the fix is almost certainly a positional field tweak, not a re-architecture. - __main__.py — flat output `<doc_uuid>.json` per doc directly under --out (mirrors Std's own data layout); --all-pcb / --all-sch / --all. Smoke test on ESP-VoCat: 6 PCB + 9 SCH = 15 JSON files, libs_unresolved=0 across the board. Compact JSON (separators=(",",":")) matches Std's single-line format. Numbers use _num() — integers without trailing .0, floats trimmed. 71 → 82 unit tests pass. Open questions for downstream: (1) confirm SCH verb field orders, (2) do they want any of the upstream metadata fields we drop (master, owner, created_at, etc — those live on the crawler side, not the schematic itself)? Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 01:16:39 +08:00
Knowit	ed713fa557	docs: consolidate rate-limit probe results into a proper benchmark report The doc had been growing incrementally as each host got probed; reshape it as a polished benchmark with TL;DR top, methodology section (including safety constraints + caveats), per-host detailed tables, final crawler settings, batch-50 walltime breakdown, and a reproduce recipe. Five hosts fully covered: pro.lceda.cn API 5.0s -> 0.5s (10×) lceda.cn doc 5.0s -> 0.5s (10×) oshwhub detail 2.0s -> 1.0s ( 2×) oshwhub listing 2.0s -> 1.0s ( 2×) modules.lceda CDN 0.2s (already optimized) Net effect on batch-50 plan: sleep total ~32min -> ~3min, walltime ~2h -> ~10-15min. Key finding: the original 5s/req on Pro was set out of "logged-in account is precious" caution with zero empirical evidence. Sustained burst probe (25 distinct UUIDs at 0.5s, no recovery) showed 0/25 errors and median latency 410ms — the caution was unjustified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:57:35 +08:00
Knowit	c474f8ad83	log: X86 motherboard hit OOM swap death-spiral on its CPU board PCB write Killed at the 14-min mark — VmRSS 1.96 GB + VmSwap 1.41 GB on a 3.3 GB RAM box with 4 GB swap (3.6 GB used), read_bytes 24 GB (pure swap thrash), process state D (uninterruptible disk sleep). The CPU board PCB doc (8K+ objects, 35+ child schematic pages) overflowed our current all-in-memory build pattern: pcb_writer builds the full output list before to_sexpr serializes once at the end, plus the 35 write_sch_page calls each build their own Relations + lib_symbols dedup state. Saved what finished: 4/5 X86 boards complete (Sch-CAM-IMX415, Schematic1, SCHEMATIC1, Sch-VTX-SSC338Q), the CPU board SCHEMATIC1_1 has all its 35 child .kicad_sch but no .kicad_pcb. Final downstream delivery: 17 board projects across the 3 supported Pro projects, 32/32 files pass kicad-cli (sch erc + pcb export svg). Streaming-write fix is the next logical follow-up but out of scope for this turn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:56:17 +08:00
Knowit	183f82a3be	crawler: drop SLEEP_SOURCE 5.0 -> 0.5 (Std doc endpoint probe) Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s) × 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency variance is dominated by payload size (Std docs span 4 KB to 4.5 MB) not server backpressure. Same posture as Pro API. Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19 min wall time (21min sleep -> 2min sleep). Combined plan now projects ~2h -> ~10min walltime exclusive of download bytes. scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json upstream_version_documents lists). Reusable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:54:46 +08:00
Knowit	8b857428e3	log: post-mortem of running --all on the other 4 Pro projects Captures the two new --all crash paths fixed in `61fd3ff` (odd inner copper layers, duplicate BOARD titles) plus the Pro 2.x scope gap (Taishan + Liangshan are JSON-format, not EPRO2 streams, so our replay_project reads the bytes but doc_type stays None and _group_by_board returns no SCH/PCB groupings — needs a separate Pro 2.x writer). Status as of this commit: ESP-VoCat 6 boards + 220V power 7 boards = 13 project dirs ready for downstream corpus. X86 motherboard is the largest of the five (7374 docs, 1.9 GB RAM in flight) and still running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:53:08 +08:00
Knowit	61fd3ff072	tools/epro2/kicad: fix two --all crashes found running the other 4 Pro projects Running the new --all on the remaining 4 Pro projects (X86 motherboard, 220V power supply, Taishan Pi, Liangshan Pi) surfaced two crash modes not covered by ESP-VoCat: 1. Odd inner-layer count → KiCad rejects the file at load with "3 is not a valid layer count". The 220V power boards have one used inner SIGNAL layer (3 copper total: F.Cu / In1.Cu / B.Cu), but KiCad requires an even copper count. Fixed pcb_writer to pad with one empty inner layer when the inner count is odd, so the total stays even (2, 4, 6, ...). 2. Two BOARDs sharing the same META.title — twin "显示板" boards in the 220V power project — landed in the same project directory and the second silently overwrote the first's .kicad_sch / .kicad_pcb / .kicad_pro. Fixed --all to detect title collisions and suffix every colliding basename with the BOARD uuid prefix (so both '显示板' boards become '显示板_52e8cc76' and '显示板_55d32906' rather than one quietly winning). 71 → 73 unit tests pass (test_odd_inner_signal_count_padded_to_even_total + test_duplicate_board_titles_get_distinct_basenames). Tangentially noted while running this: Taishan Pi and Liangshan Pi are Pro 2.x JSON, not EPRO2 streams — our replay layer reads the files but doesn't decode docType, so SCH/PCB grouping returns nothing. Pro 2.x needs a separate writer; out of scope for this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:48:46 +08:00
Knowit	cb868988b9	crawler: drop sleep rates 10x for Pro API, 2x for oshwhub detail Calibrated against ladder probes on 2026-04-29. Findings in docs/sources/probe_rate_limit_results.md. SLEEP_PRO 5.0 -> 0.5 (pro.lceda.cn API) SLEEP_BETWEEN 2.0 -> 1.0 (oshwhub detail/listing) SLEEP_SOURCE 5.0 unchanged (lceda.cn Std endpoints — not yet probed) SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized) The original 5s rate for Pro API was set out of caution because Pro requires a logged-in cookie. Empirical sustained-burst probe (25 distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was wrong — server tolerates QPS=2 cleanly. oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe water mark. Net effect on batch-50 estimate: ~1.5h -> ~30min. scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable for new endpoints (Std source still owes a probe). Designed for safety: 30s tier recovery, low rep counts on auth hosts, bail on first non-200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:45:34 +08:00
Knowit	3c00edf6db	tools/epro2/kicad: --all emits paired .kicad_pro + .kicad_sch + .kicad_pcb per BOARD KiCad pairs project files purely by basename + same directory: a folder holding `Foo.kicad_pro`, `Foo.kicad_sch`, `Foo.kicad_pcb` opens as one project on double-click of the .kicad_pro, with cross-tool navigation (open footprint from schematic etc) wired up automatically. - pro_writer.write_kicad_pro() renders the minimal KiCad 8 JSON we need: meta.filename pinning the basename, sheets=[[<root_uuid>, ""]] binding the schematic root, and stub blocks for board / schematic / net_settings / erc that KiCad expects to find on the first GUI load. - root_sch_writer.write_root_sheet() now accepts an optional root_uuid so the caller can pass the same uuid into the .kicad_pro and .kicad_sch (the binding fails silently with mismatched ids). - CLI gains `--all`: groups SCH/PCB docs by their META.board uuid (1:1 in EPRO2), strips SCH-/PCB- editor prefixes from titles to derive a shared project basename, and emits one directory per BOARD with paired files. BOARDs whose SCH is DELETE_DOC (LCD-BD on ESP-VoCat) still get a .kicad_pro with sheets:[] + .kicad_pcb so pcbnew opens cleanly. ESP-VoCat smoke: 6 boards → 6 project dirs, all pairs validated by kicad-cli sch erc / pcb export svg. The CoreBoard pro/sch/pcb trio shares root uuid 366d3e53...c2fccbe4330b end-to-end. 68 → 71 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:39:58 +08:00
Knowit	adc5dc5e1b	tools/epro2/kicad: PCB Phase-2 — POUR → (zone), CoreBoard unconnected -43% Phase-1 left 75-358 unconnected_items per board (DRC), dominated by GND/AGND/POWER nets that EPRO2 routes through copper pour, not discrete traces. Phase-2 lands those: - pcb_writer._decode_zone_path handles the three POUR.path encodings seen in ESP-VoCat: rectangle (['R', x, y, w, h, ...]), circle (['CIRCLE', cx, cy, r]) approximated as a 36-segment polygon, and polyline (numeric pairs with 'L'/'ARC' verb tokens). - Each POUR on a copper layer turns into a (zone (polygon ...) ...) block plus a (filled_polygon ...) that mirrors the boundary. Why mirror, not auto-fill: kicad-cli pcb drc does NOT run the zone filler before checking — only the KiCad GUI does. Without a pre-computed (filled_polygon ...), DRC sees zones as empty regions and reports the entire net as unconnected. Mirroring the boundary as the fill is "connectivity-correct, clearance-imprecise" — KiCad users can still hit Edit > Fill Zones to refine thermals and pad clearances. We chose this over reading EPRO2's POURED.pourFill (the editor's own post-fill polygons) because POURED paths use ARC tokens we'd need to fully decode, and the user-drawn POUR boundary is already the authoritative "intended copper" region. ESP-VoCat DRC totals: 883 → 730 unconnected_items (-17% project-wide). CoreBoard, the 4-layer board with the most pour coverage, drops 358 → 205 (-43%). Other boards see no movement because their unconnected items are non-pour issues — pads outside the user-drawn POUR rectangle, or internal $1N nets via vias on the wrong net (separate problem, separate fix). 65 → 68 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:27:33 +08:00
Knowit	eee1a9b97e	crawler: --skip-ext + --max-source-mb gates for batch-50 expansion Two CLI gates needed before scaling Pro batch beyond top-5: --skip-ext mp4,qt,mov (attachment filter) Skips video extensions in attachment download. Phase 1 measurements showed mp4+qt occupy ~54% of attachment storage. Entry still recorded in metadata.json with skipped:ext:<token> so we can re-fetch later if the policy changes. Honors both server-declared `ext` and filename suffix, case-insensitively. --max-source-mb N (Pro source size cap) Trips inside the chain replay loop on encrypted-blob total. On trip: raise ProjectOversizeError, wipe partial source/, append a row to data/state/oshwhub_pro_oversize.jsonl. Lets us shortlist 50+ Pro projects without one X86-board-class outlier (~500 MB) blowing the LFS budget. Std and Pro 2.x legacy are not capped (both <2 MB in sample). Verified: - cap=0 trips on first blob (1.2 MB), source/ wiped, state recorded - cap=100 runs full ESP-VoCat (7.5 MB plain, 278 docs) - skip-ext microtest: 8/8 cases (case-insensitive, declared/suffix fallback, empty-token edge cases) Plan + frozen candidate list for the next 50 projects: - docs/plans/oshwhub_batch50.md - data/state/oshwhub_batch50_candidates.jsonl (gitignore exception added) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:24:55 +08:00
Knowit	e61404478e	tools/epro2/kicad: Phase-1 .kicad_pcb exporter — 6/6 boards open in KiCad 8 Phase-1 scope: produce a .kicad_pcb that kicad-cli loads cleanly and that has the right geometry (nets, footprints, tracks, vias, board outline) — not a 1:1 EDA round-trip. Skipped on purpose for Phase 2: copper pours (POUR/POURED), manual FILL, teardrops, board-level strings/images, ARC circle-center recovery. What lands: - pcb_writer.write_pcb(): header/general, data-driven layer table (F.Cu = ord 0; B.Cu = ord 31; SIGNAL inner ids 15+ allocated to In1.Cu/In2.Cu/... in EPRO2-id sorted order so used inner layers stay contiguous), net-name → integer id map (id 0 reserved for the empty net per KiCad convention), LINE→segment / LINE→gr_line on Edge.Cuts, layer-11 POLY paths walked into Edge.Cuts gr_line chains (the actual board outline lives on POLY here, not LINE — without this stats showed edge=0), VIA→via. - footprint_writer.write_footprint_placement(): inline (footprint ...) blocks per PCB COMPONENT. EPRO2 RECT/ELLIPSE/OVAL/POLYGON pad shapes mapped to KiCad rect/circle/oval/custom; SMD vs THT detected by PAD.hole presence; SLOT holes use (drill oval w h). Pad nets resolved cross-doc via the existing PCB.PAD_NET → footprint.pad chain in ProjectRelations. layerId=2 component → (layer B.Cu) + text on B.SilkS so bottom-side parts render correctly. Smoke test on ESP-VoCat (6 PCBs): all 6 pass `kicad-cli pcb export svg` and render. DRC on smallest (MicBoard) reports 145 violations + 75 unconnected — most of the unconnected are GND nets that the EPRO2 source resolves through POUR copper, which Phase 2 will export. CLI: `python -m tools.epro2.kicad <project> --all-pcb --out <dir>` emits one .kicad_pcb per PCB doc. 52 → 65 unit tests pass. Float comparisons in tests use math.isclose because the s-expr 6-decimal trim doesn't preserve strict equality through `value * MIL_TO_MM` round-trips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:18:32 +08:00
Knowit	fc2a45f658	docs: explain per-doc .epro2 crawl vs web-export .epro2 ZIP Colleague-facing explainer at docs/sources/pro_crawl_vs_export.md. Addresses the "I see 278 .epro2 files but my browser only downloaded one" confusion: web download is a ZIP container (extension is a UX choice, not a format), our crawl produces per-doc message streams. Both carry equivalent EPRO2 data; only real gap is IMAGE/ binary previews which we don't fetch yet. Why per-doc and not ZIP: the ZIP path has no public endpoint — three HARs confirm the export button fires zero HTTP requests, it's pure client-side JSZip on data already loaded by the editor. Our crawler hits the same chain endpoints the editor uses internally, which delivers per-doc streams. Log entry references the 278 vs 266 doc-count delta for ESP-VoCat (we walk full history chain, web export is a current snapshot). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:13:52 +08:00
Knowit	1e06ba6582	crawler: split sleep policy by host — chain blob fetches drop 5s -> 0.2s The Pro modern fetch_pro_modern walks a per-history blob loop on modules.lceda.cn (CDN-flavored host serving AES-encrypted EPRO2 streams). We were sleeping 5s between every blob — same rate we use for the rate-sensitive pro.lceda.cn API host. HAR analysis (proexportNew2.har) shows the editor fires these blobs back-to-back without throttling, so 0.2s is plenty. Walltime drops linearly with chain length: ESP-VoCat (chain=12): 80s sleep -> 22s sleep (-72%) 220V power (chain=28): 160s sleep -> 26s sleep (-84%) X86 board (chain~700, projection): ~1h -> ~3min Verified by re-fetching ESP-VoCat + 220V power: byte-identical output across all per-doc .epro2 files (sha256 match), only fetched_at timestamp differs in manifest.json. Two manifest files re-stamped as proof of the validation runs. API host sleeps (4x 5s in modern fetcher, 7x 5s in legacy fetcher) are unchanged — those go to pro.lceda.cn /api/ which still wants polite QPS<=0.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:09:19 +08:00
Knowit	ff5553fb06	tools/epro2/kicad: hierarchical export + global_label + 5-Voltage power ports Three coupled changes so kicad-cli sch erc runs at the project level (across all sheets of one schematic) instead of single-sheet: 1. (label) → (global_label (shape passive)). EPRO2 nets are project-global by construction (named rails span every page in the SCH and physically wire across PCBs); KiCad's local label is sheet- scoped and triggers `label_dangling` for any name not duplicated on the same page. 2. New root_sch_writer that groups SCH_PAGE docs by their parent SCH (META.schematic), emits one root .kicad_sch per group with one (sheet ...) entry per child, and threads the root-assigned uuid back into each child's (sheet_instances) so KiCad can bind them. --all-sch now defaults to this; --flat falls back to one-file-per-page. 3. EPRO2's "5-Voltage" placeholder COMPONENT (partId pid8a0e77bacb214e, 365 instances on ESP-VoCat) is the editor's power port. The rail name lives in the placement's `Global Net Name` ATTR, not in the PART. We now emit a (global_label "<rail>") at the placement coords whenever that attr is set (101/365 of them on ESP-VoCat — the rest are unconfigured drafts). ESP-VoCat 5 hierarchical roots: 2325 → 2265 violations. Modest because 5 of 6 SCHs are single-page (no cross-sheet nets to resolve), and the one 4-page schematic (CoreBoard) shares only a handful of names across sheets — most net names are de-facto sheet-local. The remaining ~190 pin_not_connected are dominated by 0402-style passives whose pin tip lies on a wire's interior, not at an endpoint; KiCad needs an explicit (junction) at those points and we don't yet emit one. Marked as the next follow-up in log.md. 47 → 52 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:05:47 +08:00
Knowit	54f0173947	tools/epro2/kicad: fix two structural ERC bugs — wire_dangling -88%, pin_not_connected -52% Bisect found two semantics mismatches between EPRO2 and KiCad that cause the 850 real-connectivity ERC violations on the ESP-VoCat ref project: 1. sym_writer was emitting lib coords without negating Y, but KiCad lib uses Y-up and re-flips Y on placement (Y-down schematic). So vertically arranged pins ended up at Y-mirrored absolute positions and wires that reach the geometric pin tip in EPRO2 missed the rendered pin tip in KiCad. Fix: lib_y = -epro2_y, lib_rot = (360 - rot) % 360 for pin/text. 2. sch_writer was treating each LINE as an isolated wire — but EPRO2 binds segments into nets by NAME (WIRE.NET attr), not just geometry. Multi-segment nets like GND/VBUS show up as N disconnected stubs to KiCad. Fix: per-LINE, look up lineGroup → WIRE → NET attr and emit a `(label "<NET>")` at the LINE's start. Same-named labels on distinct physical wires is how KiCad's ERC recognizes a multi-segment net. ESP-VoCat 9 sheets: wire_dangling 444 → 52 (-88%) pin_not_connected 406 → 196 (-52%) real connectivity total 850 → 248 (-71%) Why we did NOT round to grid (the obvious-looking fix): EPRO2 places some pins on a 10-mil pitch (e.g. magnetic socket); rounding to KiCad's default 50-mil ERC grid would collapse those pins. The 248 residual is fundamentally cross-sheet — single-sheet ERC can't see a net's other endpoints on sibling sheets — and is a Phase-3 (hierarchical sheet) problem, not a per-sheet one. 41 → 46 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:43:11 +08:00
Knowit	5e63924474	oshwhub: pin listing index snapshot (33,695 rows, 29 MB) into git Previous commit added the dump script + report but the actual jsonl was caught by data/state/* gitignore. Add a targeted exception so the snapshot travels with the repo — anyone who clones can do local filtering without re-hitting the API. The data is regenerable (scripts/dump_listing_index.py is one-shot, ~1 min), but pinning a dated snapshot lets us reason about "the state of the corpus on 2026-04-28" reproducibly. Future re-dumps overwrite the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:32:58 +08:00
Knowit	d89a7cdf9c	oshwhub: dump full listing index (33,695 projects) for batch sizing Probed listing API and learned: total field is exposed (Pro=21,202 / Std=12,493), pageSize accepts >=1000 (full corpus = 35 requests / 71s), sort param is silently ignored. Dump all listings via scripts/dump_listing_index.py to local jsonl so downstream batch-selection no longer hits the API. Why: needed quantitative anchors before scaling Pro batch beyond top-5. License is detail-page only (~19h serial scan), so we want to filter on grade/like locally first to shortlist before paying that cost. Quality-tier counts now known: A-tier (grade>=3 & like>=10) = 2,806 across both origins. - scripts/dump_listing_index.py: one-shot scraper, polite QPS, streams to jsonl - docs/sources/oshwhub_listing_full.md: human-readable report with growth trends, quality tiers, owner concentration, and storage-budget anchors Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:30:56 +08:00
Knowit	67a2d0448b	.gitignore: add .rpt for kicad-cli sch erc default output bisect KiCad 8 语法时跑 kicad-cli sch erc 没传 --output 参数会把报告写到当前目录的 <input>.rpt，跑十几次主目录就堆了 20 个 .rpt 垃圾。加进 ignore 防回流。同时清掉本次留下的： 20 个 .rpt 报告（已 rm） data/state/std_probe[1-5]/ 5 个旧 probe 状态目录（~8.5 MB stale，这些目录里的 probe scripts 在前一会话已删；状态本身也没用了） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:11:20 +08:00
Knowit	fb577cc89f	tools/epro2/kicad: fix two KiCad 8 parse blockers (newline + pin_numbers) 装 kicad 8.0.9 (apt PPA) 后跑 kicad-cli sch erc 校验我们 emit 的 .kicad_sch 文件，发现 9/9 sheets 一开始全部报 "Failed to load schematic file" — 父节点解析就挂掉。Bisect 找到两个语法 bug： 1. (pin_numbers (hide no)) 不被 KiCad 8 接受 KiCad 8 lib_symbols 里 `pin_numbers` 是 token-form，不接受 (hide yes/no) 子块。要么省略整个 block 默认 visible，要么 `(pin_numbers hide)` 表示隐藏。原来的 `(hide no)` 风格是 KiCad 7 旧语法。 Fix: tools/epro2/kicad/sym_writer.py 删掉 (pin_numbers (hide no)) 行；KiCad 默认 visible 行为正是我们想要的。 2. String 里的字面 \n / \r / \t 让 KiCad 解析器中止 ESP-VoCat 的 Overview sheet 有 TEXT "Battary\n3.7V 700mAH"（多行电池标签），EPRO2 里以字面 0x0a 字符存储。我们把它原样 emit 成 "..." 包住的字符串 → KiCad reader 在 quoted string 内遇到 \n 就报 parse error 不给 message。 Fix: tools/epro2/kicad/sexpr.py 在 str escape 路径加 \n / \r / \t 转义；reader 加 \r 解码（roundtrip 用）。修完后： 9/9 sheets parse OK in KiCad 8.0.9 ERC 跑通，9 个 sheet 共 2793 violations，分布： 1372 endpoint_off_grid (49%, cosmetic — 30-mil EPRO2 grid 不 snap KiCad 默认 50-mil grid) 571 lib_symbol_issues (20%, cosmetic — facere 库未注册到 user library table；库已 embed 在 .kicad_sch 内联可用) 444 wire_dangling (16%, real — wire 端点没精确对齐 pin) 406 pin_not_connected (15%, 同上的另一面) Cosmetic 占 70%，real connectivity 30%，下个 phase 处理： - grid 校准（把 coord 精确 round 到统一 grid 上） - pin tip 端点匹配（KiCad 需要 wire 端点 == pin (at) 字段对应的绝对坐标，浮点必须精确相等） - 生成 sym-lib-table 注册 facere 库（消 lib_symbol_issues）测试： + test_string_escapes_newlines_and_tabs + test_lib_symbol_omits_pin_numbers_block reader 加 \r 解码 41/41 通过（39 旧 + 2 新）。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:04:58 +08:00
Knowit	8a91ce43f4	tools/epro2/kicad: Phase-2 lib_symbols — render symbol bodies from SYMBOL docs Phase 1 emit的 .kicad_sch 里组件位置 + 属性都对，但 lib_symbols 是空 stub —— KiCad 渲染时每个组件显示成红色 "?"。Phase 2 把 SYMBOL 文档里的 PART + RECT/POLY/CIRCLE/TEXT/PIN primitives 翻成 KiCad lib symbol 块，填到 lib_symbols 里，让 KiCad 显示真正的原理图符号。新增 tools/epro2/kicad/sym_writer.py: write_lib_symbol(symbol_doc) → S-expr list 形如: (symbol "facere:<partId>" (pin_numbers (hide no)) (pin_names (offset 1.016)) (in_bom yes) (on_board yes) (property "Reference" "U" ...) (property "Value" "<title>" ...) (property "Footprint" "" hide) (property "Datasheet" "" hide) (symbol "<partId>_1_1" (rectangle ...) ← from RECT.dotX1/Y1/dotX2/Y2 (polyline (pts ...)) ← from POLY.points + closed → fill (circle ...) ← from CIRCLE.center/radius (text "..." ...) ← from TEXT.value/x/y/rotation (pin <type> line (at ...) (length ...) (name ...) (number ...)) ← from PIN + sibling ATTR ops )) PIN 名字/编号/电气类型解析（这是关键数据探测点）： EPRO2 PIN 不直接带 number/name/type 字段；这些信息存为独立 ATTR 操作 (parentId=<pin_id>, key="Pin Name"/"Pin Number"/"Pin Type") Pin Type 取值映射：IN→input, OUT→output, BIDIR→bidirectional, POWER_IN→power_in, POWER_OUT→power_out, NC→no_connect, ... 默认 passive（保守） sch_writer 集成（lib_symbols 自动填）： write_sch_page(doc, project_relations=pr) — 增 pr 可选参数内部 _build_lib_symbols(): 收集本 sheet 用到的 partIds → 通过 ProjectRelations.parts_by_id 解析到 SYMBOL 文档 → write_lib_symbol → 组装 (lib_symbols ...) 块；同 partId 多 SYMBOL 候选取第一个，去重 WriteStats 增 lib_symbols_embedded / lib_symbols_missing CLI 加 --no-lib-symbols 用于回到 Phase-1 行为（占位符调试用）。 ESP-VoCat 重导出验证：9/9 SCH_PAGE 全部 0 lib_miss P1_45092758.kicad_sch wires=187 symbols=138 lib_emb=29 codec_0b0163fa.kicad_sch wires=190 symbols=112 lib_emb=20 Interface_b336a7c7.kicad_sch symbols=95 lib_emb=13 ... P1_408c9f4f.kicad_sch wires= 6 symbols= 10 lib_emb= 3 测试：6 个新单测覆盖 outer wrapper / pin ATTR pull / 多形状 primitives / sch_writer 集成路径 / 缺失 lib 计数 / no-pr 回退到 Phase 1。合计 39/39 通过（parser 6 + relations 9 + project_relations 6 + sexpr 6 + sch_writer 6 + sym_writer 6）。下一步 Phase 3：footprint library + .kicad_pcb 导出。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:45:19 +08:00
Knowit	9213429a57	tools/epro2/kicad: Phase-1 EPRO2 → KiCad schematic exporter 写第一版 EPRO2 → .kicad_sch 转换：把 SCH_PAGE Document 的 wires + COMPONENT placements + TEXT 输出到一个可被 KiCad 7+ 打开的 sch 文件。不含 symbol 主体（lib_symbols 留空 stub），所以 KiCad 里组件会渲染成红色 "?" 占位，但布线 + 位置 + Designator/Value 属性都正确。完整 symbol 库导出留 Phase 2。模块结构： tools/epro2/kicad/sexpr.py 手写 S-expr emitter，Sym 标记裸符号， str 自动加引号 + 转义；float 去尾零； bool→yes/no；NaN/Inf 主动报错 tools/epro2/kicad/_sexpr_reader.py 极简 S-expr parser，仅给 round-trip 测试用（非完整 KiCad reader） tools/epro2/kicad/sch_writer.py write_sch_page(doc) → str；处理： LINE → (wire (pts ...) ...) COMPONENT → (symbol (lib_id facere:<partId>) (at x y rot) (property Reference ...) ...) TEXT → (text "..." (at ...)) 单位 mil → mm × 0.0254；零长 wire 跳过 tools/epro2/kicad/__main__.py CLI: --doc <uuid> \| --all-sch ESP-VoCat 验证（python -m tools.epro2.kicad <project> --all-sch）： 9 SCH_PAGE 全部转换成功 P1_408c9f4f.kicad_sch wires= 6 symbols= 10 text= 0 skipped= 2 (370 lines) P1_ee409917.kicad_sch wires= 20 symbols= 14 text= 0 skipped= 3 P1_54743d77.kicad_sch wires= 42 symbols= 30 text= 3 Overview_dc13d6d2.kicad_sch wires= 0 symbols= 1 text= 34 (说明页) MCU_510cff33.kicad_sch wires= 91 symbols= 86 text= 9 Interface_b336a7c7.kicad_sch wires= 99 symbols= 95 text= 6 P1_5c38f45b.kicad_sch wires=179 symbols= 86 text= 9 P1_45092758.kicad_sch wires=187 symbols=138 text= 10 (主图) codec_0b0163fa.kicad_sch wires=190 symbols=112 text= 10 输出落在 data/processed/kicad_sch/<filename>.kicad_sch（gitignore 内，可重新生成；不入库）。测试：6 个 sexpr 测 + 6 个 sch_writer 测，含 round-trip parse 验证。 parser/relations/project_relations 的旧 21 个不动，合计 33/33 通过。下一步： 1. Phase 2 — symbol library 导出 (.kicad_sym)，把 SYMBOL doc 的 PIN/RECT/ TEXT primitives 转 KiCad symbol 主体；填 lib_symbols 块让组件渲染出真正的 schematic 符号 2. footprint library + .kicad_pcb 导出 3. 用 KiCad CLI (kicad-cli sch erc) 跑 ERC 校验 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:29:15 +08:00
Knowit	3052e42991	tools/epro2: add ProjectRelations for cross-document resolution per-doc Relations 在大量 cross-doc 引用前是不够的：PCB 的 PAD_NET 复合 id [PAD_NET, comp, pin, pad] 里的 pad 实际是 FOOTPRINT 文档里的 pad 实例；SCH_PAGE 的 COMPONENT.partId 指向某个 SYMBOL 文档的 PART.id。 ProjectRelations 在 per-doc Relations 之上做项目级聚合，把这些跨文档引用拼起来。 Probe 阶段（ESP-VoCat）发现的映射规则（已写入 docstring）： 1. SCH_PAGE COMPONENT.partId === PART.id in some SYMBOL doc - 命名两种风格：'pid<hex>' (anonymous/系统 part) + '<name>.<n>' (具名 SKU)，但都直接相等 PART.id，不是不同 namespace - 同一 PART.id 可能出现在多个 SYMBOL 文档里（库快照）， parts_by_id 保留全部，consumer 通常取第一个 2. PCB COMPONENT.id → FOOTPRINT 文档 UUID via 单独 ATTR op: ATTR(parentId=<comp>, key="Footprint", value=<fp_doc_uuid>) COMPONENT.attrs 子 dict 只有内务字段（Unique ID / Channel ID / ...），不含 footprint 引用。这跟 schematic 的 partId 在 COMPONENT 上的做法不一样，是 EPRO2 流的一处不对称 3. PCB PAD_NET[comp,pin,pad] 里的 pad 是 FOOTPRINT 文档内部的 pad id；解析链: comp → ATTR Footprint → FOOTPRINT relations.pads[pad] API： ProjectRelations.build(project) — 单遍构建 resolve_symbol_docs(sch_uuid, comp_id) → [SYMBOL doc uuids] resolve_footprint_doc(pcb_uuid, comp_id) → FOOTPRINT doc uuid \| None pad_in_footprint(fp_uuid, pad_id) → PAD payload \| None resolve_pcb_pad_net(pcb_uuid, comp, pin, pad) → {footprint, pad} \| None attrs_for_pcb_component(pcb_uuid, comp_id) → {key: value} 折叠 CLI 加 --project-relations，跑 ESP-VoCat: documents 278 distinct_parts 87 duplicated_parts 9 pcb_components_with_footprint 206 pcb_components_unresolved_footprint 0 sch_components_with_partid 572 sch_components_unresolved_part 0 PCB 样本验证：comp=e0 → fp=1069352d81c6 Designator='U8'， PAD_NET pin=1 pad=e7 net=GND 跨文档解到坐标 (-37.4,-45.24)。测试：6 个新单测覆盖 partId→symbol、comp→footprint、PAD_NET 跨文档、 attrs 折叠、unresolved 计数。parser + relations + project_relations 共 21/21 通过。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:22:39 +08:00
Knowit	7f9e2fad73	tools/epro2: add Relations layer for cross-object navigation 在 replay 的扁平 objects[id] -> payload 之上盖一层 Relations，建索引和反向引用，把孤立对象拼成可遍历的图，是后续 EPRO2 → KiCad 转换器的中间表示前置。 Relations.build(doc) 单遍扫所有对象，得到：主集合（按类型分桶）： parts / components / pins / pads / wires / nets / layers / rules 复合 ID 解析（关键）： '["LAYER",1]' → layers[1] '["NET","GND"]' → nets["GND"] '["PAD_NET","e0","1","e7"]' → pad_nets_by_pad/by_net '["RULE","SAFE","copperThickness1oz"]' → rules[("RULE","SAFE",...)] 反向引用： obj_ids_by_part partId → 引用对象 ids（lib 内 RECT/TEXT/PIN 都带 partId） components_by_part partId → component ids attrs_by_parent parentId → ATTR ids lines_by_wire WIRE.id → LINE ids（wire 由若干 LINE 段组成） pad_nets_by_pad PAD.id → PAD_NET 记录 pad_nets_by_net net name → PAD_NET 记录 objects_on_layer / objects_in_net 字段反查便捷 accessor： attrs_dict(parent_id) 折叠所有 ATTR ops 到 {key: value} dict（last write wins），KiCad 转换时按 component 拿 Designator/Value/Footprint 的常用入口 ATTR.parentId 解析（实测发现的两种坑）： 1. 不仅指向 COMPONENT/PART —— 也大量指向 WIRE（schematic 上的网络标签 / 网络属性）。原查重函数漏算，636 个 false positive unresolved；改为"任意 doc.objects[parentId] 命中即算 resolved" 2. 复合形式 `<comp_id>-<pin_id>` 用于把 ATTR 挂在某 component 的某个 pin 上（如 PinName）。`_resolve_parent()` 用 split("-",1) 兜底 CLI 加 --relations，按 docType 聚合 stats： uv run python -m tools.epro2 data/raw/oshwhub/<uuid> --relations ESP-VoCat 验证： SCH_PAGE 9 docs : 572 components, 563 wires, 934 lines_grouped, 4111 attrs_attached, 0 unresolved_parents PCB 6 docs : 206 components, 807 pad_nets, 173 nets, 544 layers SYMBOL 105 docs : 106 parts, 560 pins, 1680 attrs_attached FOOTPRINT 55 docs: 496 pads, 9 nets, 1771 layers, 140 rules 注：PCB 内 pads=6 vs pad_nets=807 不矛盾 —— PAD 实例存在 FOOTPRINT 文档里，PCB stream 用 ["PAD_NET",comp,pin,pad] 复合 id 跨文档引用；解析"comp 的某 pin 通过哪个 footprint 的哪个 pad"需要 project-级 Relations 聚合（下个 task）。测试：tools/epro2/tests/test_relations.py 9 个单测覆盖复合 id 解析、 lineGroup 链接、parentId 直/复合解析、partId 反查、attrs 折叠。 parser + relations 共 15/15 通过。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:17:28 +08:00
Knowit	3c57e75d51	Add tools/epro2 — EPRO2 parser + replay prototype 为 Pro 3.x .epro2 工程源数据写解析骨架，下游做 EPRO2→KiCad 转换器前的基础设施。在 ESP-VoCat (278 docs / 7.5 MB) + 220V 桌面电源 (771 docs / 26 MB) 端到端跑通，0 parse errors。模块结构： tools/epro2/parser.py 单行 → Op：rstrip("\|") + split("\|\|") + json.loads tools/epro2/replay.py state-machine：DOCHEAD 设头；其它 op 按 id 做 upsert（payload=None 当 delete）；EDIT_HEAD/ META/CANVAS/PREFERENCE/PANELIZE 当 doc 级单例存 tools/epro2/__main__.py CLI：传项目目录走 manifest.json 重放每个 doc，按 docType 聚合输出 + 可选 --dump-doc 看单文档详情 tools/epro2/tests/ 6 个单测 pin 死 trailing-pipe / 三段消息 / id-only-no-payload / 嵌入管道符等坑 ESP-VoCat 输出示例： Documents: 278 (parse_errors=0) count docType objects ops deletes untyped_ops 105 SYMBOL 4124 4439 0 0 88 DEVICE 88 264 0 0 55 FOOTPRINT 4641 4855 0 0 9 SCH_PAGE 7982 8167 42 0 6 PCB 8428 8547 38 0 6 BOARD 9 18 0 0 6 SCH 9 26 0 0 1 BLOB 4 8 0 0 1 FONT 16 28 0 0 1 CONFIG 2 3 0 0 Top ops: ATTR 7035 / ELE_PLACEHOLDER 4225 / LINE 3005 / LAYER 2318 ... PCB 文档单 dump 验证语义正确：META 含 title (PCB-EchoEar-CoreBoard-V1_0) + board 引用；CANVAS 含 origin/grid/unit (mm)；LAYER 1/2/3 = TOP/BOTTOM/ TOP_SILK 配色齐全。跑法： uv run python -m tools.epro2 data/raw/oshwhub/<project_uuid> uv run python -m tools.epro2 data/raw/oshwhub/<uuid> --dump-doc <doc_uuid> 下一步（不在本 commit）： 1. 把对象间关系建起来（COMPONENT.partId → PART；LINE.lineGroup → WIRE； PAD_NET id → PAD + NET 三方关联）—— 当前 replay 只做扁平 dict 2. EPRO2 → KiCad 序列化层（Forge 投影硬门槛） 3. 在 Pro 3.x 三个项目做整体回归（X86 主板 7374 docs 可作压力测试） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:10:27 +08:00
Knowit	c721e08c93	projects.md: replace Comments column with 版本 (Std / Pro 3.x / Pro 2.x) Comments 那列对工程"品质"信号弱（评论量主要看话题热度）；换成"版本" 列直接告诉读者每个项目源是哪种 EDA 格式 + 编辑器版本号。当前 15 个项目里 10 Std / 3 Pro 3.x / 2 Pro 2.x。 source_format 字段映射： easyeda-std → Std easyeda-pro → Pro 3.x easyeda-pro-legacy → Pro 2.x 其它 → 透传 editor_version（如 6.5.43 / 3.2.91 / 2.1.40）作为子标签放第二行。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:01:41 +08:00
Knowit	c6279bff08	Add EasyEDA Pro 2.x legacy source ingestion (5/5 batch closure) 补齐前一批失败的 2 个 legacy Pro 项目（立创·泰山派 RK3566、立创·梁山派），打通 Pro 2.x 旧版工程的源抓取链路。结合上一 commit 的 modern Pro 3.x 路径，本仓库 5/5 Pro 项目 EPRO2/dataStr 全部端到端打通。 Pro 2.x 与 Pro 3.x 是两个完全不同的存储模型： - Pro 3.x：git-style branch + linear history chain，AES-128-GCM 加密的 EPRO2 增量消息流，按 history 重放（已在前一 commit 打通） - Pro 2.x：无 branch / 无 history。文档以 EasyEDA Std plaintext dataStr 存储（同 ["DOCTYPE","SCH","1.1"] 格式），按 doc UUID 通过 /api/v2/documents/lists 批量 GET，主体无加密，只组件库走 AES Pro 2.x 抓取链由 HAR (tmp/prodownload3.har, 178 请求) 反推： GET /api/v4/projects/<P> → boards: [{sch, pcb, name}] GET /api/projects/<P>/ticket?uuid=&g_ticket=-1 → 完整项目 manifest POST /api/schematic/lists {uuids:[<sch>]} → sort: [{uuid:<sheet>}] POST /api/v2/documents/lists {uuids,docType:1} → schematic plaintext POST /api/v2/documents/lists {uuids,docType:3} → PCB plaintext POST /api/coppers/search {paths} → 铺铜层 POST /api/textpath/search {paths,project_uuid}→ 字体/文字 POST /api/v2/resources/search {hash,project_uuid} → BLOB 图片实现： - crawlers/oshwhub/crawler.py: - fetch_pro_source() refactor 成 dispatcher，先 GET project meta 检查 branch_uuid，null 即旧版走 _fetch_pro_legacy()，非空走 _fetch_pro_modern() - _fetch_pro_legacy() 新增（按上面 9 步流程拉所有 doc + 辅助层） - _pro_post_json() POST helper（与 _pro_get_json 对称） - schemas/project.schema.json: source_format enum 加 easyeda-pro-legacy - docs/sources/easyeda_pro_source.md rev 4: §1.1 旧版 vs 新版判别表更新、 §2.7 新增旧版抓取流程 + 实测数据落盘约定（旧版）： source/ticket.json 完整 manifest source/<sheet_uuid>.json 每张原理图（含 dataStr） source/pcb_<pcb_uuid>.json 每块 PCB source/coppers.json/textpath.json/blobs.json 辅助 PCB 层资源 source/manifest.json 索引实测：立创·梁山派 editor=2.1.30, 2 sheets+1 pcb, 1.0 MB, 78 sym/191 fp/128 dev 立创·泰山派 RK3566 editor=2.1.40, 29 sheets+1 pcb, 0.8 MB, 299 sym/524 fp/295 dev 旧版项目体量比新版小两个数量级（梁山派 1 MB vs RK3576 66 MB）—— 没有增量 history，组件库走单独端点，本身就是当前快照。 5/5 Pro 项目终极汇总： X86 主板 easyeda-pro 3.2.15 7374 docs / 481 MB 泰山派 RK3566 easyeda-pro-legacy 2.1.40 30 docs / 0.8 MB 梁山派 easyeda-pro-legacy 2.1.30 3 docs / 1.0 MB 220V 桌面电源 easyeda-pro 3.2.69 771 docs / 26 MB ESP-VoCat easyeda-pro 3.2.91 278 docs / 7.5 MB 共 8456 docs / ~516 MB plain。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:59:25 +08:00
Knowit	3282a028c4	Add EasyEDA Pro EPRO2 source ingestion (3/5 batch test) 打通 oshwhub origin=pro 现代 Pro 3.x 工程的 EPRO2 源抓取链路。3/5 modern Pro 项目完整解出（共 8423 docs / 542 MB plain）： - X86 主板 7374 docs / 481 MB plain (chain=85, editor=3.2.15) - 220V 桌面电源 771 docs / 26 MB plain (chain=28, editor=3.2.69) - ESP-VoCat 278 docs / 7.5 MB plain (chain=12, editor=3.2.91) 剩余 2/5 是 legacy Pro 2.x（立创泰山派 RK3566、梁山派），项目 meta 返回 branch_uuid=null + editorVersion="2.1.40"，没有 git-style chain 模型，文档直接挂在 boards[].sch/pcb 字段上，访问端点暂未挖通；元数据落库 metadata.json，source/ 留空。实现要点： - fetch_pro_source(): 4 步流程（project → branch HEAD → structures → /branches/<B>/histories/<HEAD> 即返完整 chain，无需 ?limit 批量端点）+ 逐 history 走 AES-128-GCM 解密（16 字节 IV，pycryptodome 原生支持）+ gunzip + 按 DOCHEAD 切 per-doc EPRO2 流 - EPRO2 解析坑：行末单 `\|` 是行终止符不是字段分隔符，必须先 rstrip("\|") 再 split("\|\|")，否则 payload JSON 解析失败 silently swallow 导致 cur_doc 不设 → 第一轮 X86 板 7374 docs 抽出来只剩 2 个 - docType 实测远不止 BOARD/PCB/SCH/SCH_PAGE，还含 SYMBOL / FOOTPRINT / DEVICE / BLOB / FONT / CONFIG —— Pro 把组件库快照也随项目存到 history，下游做 EPRO2→KiCad 转换时必须先把这些 lib doc 加载进 symbol cache - Pro 2.x vs 3.x 是不同存储模型 —— 3.x 走 branch 模型（已打通）， 2.x 走 boards[] 直链（未打通）；判别条件：project meta 的 branch_uuid 是否为 null CLI 新增 --with-pro-source / --backfill-pro-source / --pro-cookie / --origin（按 origin 字段服务端过滤 listing API），crawl_one() 按 origin=pro 自动 dispatch 到 Pro fetcher。 schema：docType 类型从 integer 放宽到 [integer, string, null] （兼容 Std 的 1/3 + Pro 的 BOARD/SCH 等），新增 message_count 字段。 License 注意：本批 5 个项目全是 NC-SA / GPL，未达 Pro source doc §4.2 Forge 白名单（MIT/BSD/Apache/CC0/CC-BY/CERN-OHL-P/Unlicense）。按 CLAUDE.md "研究用、不再分发" 原则 raw 入库无碍；Forge 投影时另过白名单。详细技术细节见 docs/sources/easyeda_pro_source.md rev 3 + log.md。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:45:52 +08:00
Knowit	d874278bc5	Add EasyEDA Std project source ingestion (10 boards backfilled) 打通 oshwhub origin=std 项目的工程源（schematic + PCB dataStr）抓取链路。原 plan.md §1.6 假设需要登录，实测 lceda.cn/api/documents/<doc>?uuid=<doc>&path=<doc> 对公开项目匿名可访问 —— 无需 cookie，无账号封禁风险。调研：4 轮探测留痕在 data/state/std_probe[1-5]/（gitignored）；翻 Std 编辑器 v6.5.51 的 main.min.js bundle 找到 ajaxDetail 端点；按 docType 区分两种响应 shape（schematic 项目视图 vs PCB 文档视图）。 Crawler: - make_source_client() 用浏览器 UA + lceda.cn/editor Referer，因为 oshwhub /api/project/<uuid> 端点拒绝 FacereDataset/0.1 UA（CLAUDE.md UA 例外条款：目标站主动封自定义 UA + 公开静态资源） - fetch_std_source(): 项目元 → version_documents → 逐文档 dataStr → 落 source/<doc>.json + source/manifest.json - --with-source（爬新项目时一并抓源）/ --backfill-source（仅扫已有） - QPS ≤ 0.2 (SLEEP_SOURCE = 5s) 自律 Schema: 加 source_format / source_path / source_documents / editor_version （前 3 进 enum 锁定，便于后续 Pro / KiCad 源对齐）。回填结果：10/10 成功，45 个文档，33.2 MB；schema validate 全通。 docTypes 主要是 1 (schematic) 与 3 (pcb)；USB 电压电流表只有 PCB 文档（4 个：主板+盖板+底板+面板，作者未上传原理图源）。完整调研：docs/sources/easyeda_std_source.md。	2026-04-28 20:07:40 +08:00
Knowit	b0d3afd2a9	update readme	2026-04-26 11:54:01 +08:00
Zhang Jiahao	a3942c03df	Update EasyEDA Pro source research	2026-04-24 00:40:18 +08:00
Zhang Jiahao	a16cb11c7d	Add easyeda_pro_source.md: Pro 工程源完整链 + EPRO2 格式解析 Why: - pro.lceda.cn (立创 EDA 专业版) 的工程源抓取链已经打通：4 步 API + AES-128-GCM 解密 + gzip 解压 + EPRO2 消息流解析，所有信息需要落成文档独立保留，避免丢失；也为后续实现 EPRO2 → KiCad 转换器/选型铺路。 - 与 oshwhub.md（Std 版）并列成为独立调研文档 —— Pro 和 Std 是两套独立编辑器，cookie/API/格式都不同，混在一起反而乱。 What: - docs/sources/easyeda_pro_source.md: * TL;DR 表 + §1 Std vs Pro 对照 * §2 4 步 API 链 + 必需 headers (Editor-Version/path/Referer/Cookie) + Python 解密代码 + 实测数据（2.7 MB 源流 / 8357 条消息） * §3 EPRO2 格式完整分类：40 种 message type 按功能分组 (零件/几何/PCB/层/规则/...) + 每类样例 * §4 安全合规（风控 / license / 密钥泄漏语义） * §5 接入 Forge (OSHWHUB_INGEST_SPEC.md) 的 gap 表 * §6 已知未验证 7 条 * 附录 A 一键重跑命令 - pyproject.toml: + pycryptodome>=3.23.0（AES-GCM 解密依赖） - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:11:32 +08:00
Charles Zhang	f797205dc8	add epec md	2026-04-23 23:42:21 +08:00
Zhang Jiahao	1a67df44ba	Add docs/infra.md: dev1 (广州) 部署记录 Why: - Phase 0.5 要求的基础设施文档落地。机器已部署好，记录机器规格、 SSH 约定、仓库位置、依赖装法、凭据管理规则、环境变量、长跑与日志策略、磁盘应急阈值，方便后续自己或新协作者快速接手。 - 严格不包含任何凭据值（token / cookie / 密码），只写事件与结构。 Deployment notes: - GIT_LFS_SKIP_SMUDGE=1 克隆：省带宽和磁盘（本地仓库 32MB 而非 535MB+），历史 LFS 对象按需 pull - uv 走清华镜像 + only-system python 3.10：sync 秒完成，避免下载独立 Python - ~/.secrets/ mode 700 就绪 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:36:21 +08:00
Zhang Jiahao	fb7c0488bc	Relax Python requirement 3.11 → 3.10 Why: - Ubuntu 22.04（广州云服务器 dev1 的发行版）自带 python3.10。之前要求 3.11 会让 uv 去下载独立 Python 解释器，从国内拉 Astral 的 python-build release 很慢。放到 3.10 直接用系统 Python，sync 秒完成。 - 代码没有用任何 3.11+ 特性（用了 `from __future__ import annotations` 支持 PEP 604/585 在 3.10 上下文），降版本零代价。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:34:54 +08:00
Zhang Jiahao	b0ddcf3f14	Allow login content; plan cloud infra, storage tiers, EDA→KiCad conversion Why: - 策略调整：登录后才能访问的内容从"禁止"改为"纳入本项目范围"，同时明确凭据管理红线（合法账号、不入 git、云服务器隔离）。解锁 u.lceda.cn 工程源 JSON，这是训练数据质量的关键升级。 - 计划中"存储"和"运行环境"一直模糊，现在按 Charles 提供的广州云服务器 + 存储分级演进（Gitea LFS → 对象存储）给出清晰路径。 - 打通 oshwhub (EasyEDA) 与 bshada/open-schematics (KiCad) 两个生态，需要一个 EDA→KiCad 批转换脚本。先把它纳入 plan，等拿到工程源再实现。 What: - CLAUDE.md: 登录态条款从"不抓"改为"合法账号可抓"，凭据管理写死在 ~/.secrets/，事件记 docs/secrets.md；合规红线同步更新 - plan.md §0.5: 新增基础设施段（机器初始化 / 调度 / 登录态获取） - plan.md §1.4: 存储分级演进（< 50 GB 云盘，50-200 GB 评估，> 200 GB 迁对象存储） - plan.md §1.6: 登录态抓 u.lceda.cn 工程源 - plan.md §1.7: scripts/convert_to_kicad.py 批处理，候选 easyeda2kicad.py - plan.md 风险表: 加账号封禁 / 转换失败 / 云服务器单点故障三条 - docs/sources/oshwhub.md: u.lceda.cn 从"未开放"移到"需登录，已纳入范围" - README.md 数据源表: 加"登录态"列 + 运行环境说明 - log.md: 本次策略变更记录未改：未新增 docs/infra.md（等机器到位 + 真实细节后再写），scripts/convert_to_kicad.py 尚未实现（等拿到工程源样本再实现）。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:57:30 +08:00
Zhang Jiahao	ba501c328c	Remove personal name from suggestion/decision phrasing Why: - "给 Charles 的建议"、"待 Charles 拍板"、"需要 Charles 决策" 这些写法把具体人绑到了文档里，换维护者就失准。改成中性的 "建议 / 待决策 / 待拍板"，文档对未来协作者和 agent 都更通用。 What: - log.md: 四处去掉 "给 Charles / 还是需要 Charles 决策 / 等 Charles 拍板" - plan.md: 三处去掉 "待 Charles / Charles 定目标 / 需要 Charles 定" - docs/sources/hf_bshada_open_schematics.md: "待 Charles 决策" → "待决策" - scripts/estimate_size.py: docstring 去掉 "给 Charles 一个估计" - CLAUDE.md: 数据删除确认规则从 "先跟 Charles 确认" 改成 "先跟用户确认" 保留的 Charles 提及都是事实性的： - README/plan 里的 "维护者：Charles"（身份字段） - log.md 历史条目里 "Charles 要求..." / "Charles 点名..."（历史事件记录） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 20:01:52 +08:00
Zhang Jiahao	ed4837dedf	Rewrite oshwhub.md as canonical data source investigation Why: - Charles 要求把 12493 总数验证 + 90 项目采样结果合进主调研文档，消除 oshwhub_corpus_estimate.md 与 oshwhub.md 的重复与分散。 - 一份高质量的数据源调查应该独立完备：任何人（人或 agent）读完就能复现爬取 / 估算 / 合规判断，不用跨文件拼凑。 What: - docs/sources/oshwhub.md 重写为 9 节 + 附录： - TL;DR 表（一页纸核心事实） - 站点架构 / robots / API 入口 / 项目详情 SSR / 附件 CDN - 排除项：fs-web-stream.jlc.com 推广图标 / u.lceda.cn 登录源 - §4 项目总数验证（新）：三路 sort 一致 12493 + 分页二分边界 ≈250 页 + grade 覆盖抽样 - §5 抽样语料特征（从 corpus_estimate 并入）：体积 median 9MB/p90 54MB、视频占 54%、license 分布 GPL 3.0 49%/Public Domain 21% - 风险表 7 条、附录重跑命令 - 删除 docs/sources/oshwhub_corpus_estimate.md（内容已并入 §5） - log.md: 本次记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:59:05 +08:00
Zhang Jiahao	53b7648984	Add HF bshada/open-schematics to Phase 1 plan Why: - Charles 点名把该 HF 数据集纳入第一批。它是已预处理包（非待爬网站），和 oshwhub 的抓取逻辑不一样，先把决策面在 plan 里讲清楚，再动手拉。 - 与 oshwhub (EasyEDA 生态) 互补，补 KiCad 原生路径。 What: - docs/sources/hf_bshada_open_schematics.md: 调研文档 - 78 parquet shards, 6.4 GB 总量 - CC-BY-4.0 商用友好 - 字段：.kicad_sch 源 / PNG / 组件列表 / JSON / YAML / name / desc - 镜像方案（整包存 data/external/..., 不拆 per-project） - .gitattributes 建议（data/external/*/.{parquet,png} → LFS） - plan.md §1.5: 阶段说明 + 待 Charles 批 6.4 GB 预算 - README.md 数据源表: 加一行 - log.md: 本次记录下载未触发，等 Charles 拍板。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:51:24 +08:00
Zhang Jiahao	ce22717288	Add projects.md index (stars-sorted) + build_index.py generator Why: - Charles 要一个索引页看入库项目 + 他们的 stars。手工维护会漂移，所以 scripts/build_index.py 直接读 metadata.json 重新生成，保证 projects.md 永远是 data/raw/ 的镜像。 What: - projects.md: 10 个项目按 Stars 倒序（最高 3293 的加热台量产计划 → 最低 236 的柚子爱 AI 相机），含 stars/likes/forks/views/comments/ files/size，+ License 与数据源分布 - scripts/build_index.py: 扫 metadata.json 渲染 markdown，支持未来多数据源（source 字段区分），下次新增 oshwhub / github / hackaday 项目后重跑即可 - README.md: 加 projects.md 链接 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:48:21 +08:00
Zhang Jiahao	e222b08f27	Add corpus size/license estimator; snapshot 90-project findings Why: - 放量决策需要比"52MB/项目 × 12493 = 650GB"更扎实的数据。用 scripts/estimate_size.py 采样 90 个 hot 项目的 attachments[].size 得到真实分布（median 9MB / p90 54MB），全量 median 估算 110GB， p90 上界 660GB。这给 Charles 一个可信的存储预算。 - 附带 license 和 ext 分布采出两个重要洞察： (1) mp4+qt 视频占 54% 存储，加 --skip-ext 开关可节省一半； (2) NC (Non-Commercial) 许可 ~11%，下游必须按 whitelist 过滤。 What: - scripts/estimate_size.py: 无下载的元数据采样器，复用 crawler.parse_detail_html - docs/sources/oshwhub_corpus_estimate.md: 结果快照 + 决策建议 - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:45:54 +08:00
Zhang Jiahao	c8d55a22eb	Add schema+file validator; pin down fs-web-stream as ad icons Why: - schema 必须能自动校验，否则后续放量无法防腐。现在 scripts/validate.py 对全部 metadata.json 做两层检查（schema + 本地文件 sha256），跑一次即可对全量数据签收；10/10 项目已通过。 - docs/sources/oshwhub.md 之前把 fs-web-stream.jlc.com 标为"工程源待查"，排查后确认那些 URL 全部是嘉立创服务侧栏/推广图标，与项目无关。 image.lceda.cn/attachments/ 是项目附件的唯一入口，现在调研文档闭合。 What: - scripts/validate.py: jsonschema 校验 + optional --check-files 核 sha256 - pyproject.toml: 加 jsonschema>=4.26 依赖 - docs/sources/oshwhub.md: fs-web-stream 归类为推广资源（已排除），附 context 证据 - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:40:55 +08:00
Zhang Jiahao	5ffa10f256	Phase 1 MVP: crawl 10 high-quality oshwhub projects into LFS Why: - Charles 指定：先爬 10 个高质量项目存 Gitea LFS，一个项目一个文件夹，保留原文件和 URL。先以小批量验证 schema + LFS 流水线，放量前再拍板存储规模。 What: - crawlers/oshwhub: 列表 API (`/api/project?sort=hot`) + SSR HTML 解析，一次性产出 metadata / description / cover / files / _urls - schemas/project.schema.json: 跨源统一 schema - docs/sources/oshwhub.md: API 入口 / 字段映射 / 陷阱调研 - pyproject.toml: httpx[http2] 单依赖 - .gitattributes: data/raw//files/ 一律走 LFS（规则写窄，避免误伤 schemas/.json 等） - .gitignore: 移除 data/raw/ 排除（改走 LFS 入库） 10 个项目覆盖：调试器 / 加热台 / 盖革计数器 / 数控电源 / 焊台 / 智能手表 / USB 测电流 / ZVS 感应加热 / AI 开发板 / 红外热成像。共 52 附件 ≈ 524 MB 入 LFS，筛选判据 grade=4 & likes>=100 & 多样性。 Known gaps（见 plan.md § Phase 1.4）： - EasyEDA 源 JSON 需登录 (u.lceda.cn)，v0.1 跳过 - fs-web-stream.jlc.com 的工程源下载未测 - scripts/validate.py 自动 schema 校验未实现 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 19:34:09 +08:00
Zhang Jiahao	bf2370f83b	Initial skeleton for FacereDataset Why: - Facere 需要一个统一的开源硬件设计数据源，用于训练专有模型与构建检索型知识库。仓库先立骨架，把合规红线、数据 schema 要求、爬虫规约写在 CLAUDE.md 里，避免后续实现时各站点爬虫写法发散。 - plan.md 用阶段化路线图明确"先广度后深度、先合规后规模"的策略，让放量前必须经过 Charles 对齐一次，降低存储与法律风险。 Contents: - README.md: 项目简介、数据源表、仓库结构、合规声明 - CLAUDE.md: 项目级 Claude 指令（工作流 / 爬虫规约 / 合规红线） - plan.md: Phase 0-6 分阶段计划 + 风险与未决项 - log.md: 首条日志（调研 + 初始化记录） - .gitignore: 排除 data/{raw,processed,state} 内容，保留目录占位 - 目录骨架: crawlers/ schemas/ scripts/ data/ docs/sources/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:58:10 +08:00

44 Commits