Commit Graph

44 Commits

Author SHA1 Message Date
c6fd111d6d crawler: --no-cover, --concurrency, drop cross-host sleep + batch-50 Step 1 done
Three crawler ergonomics for batch operations:

--no-cover  Skip cover image download. For scan-only modes (license/meta
            scrape) this drops ~1.3s/project and avoids slow-CDN hangs.

--concurrency N  ThreadPoolExecutor wrapping the per-project loop. Default
                 1 = serial (current behavior). Anonymous endpoints tolerate
                 5+ comfortably; output uses a print lock for readable
                 interleaved progress. fetch_cover plumbs through crawl_one.

Drop cross-host sleep #1: in crawl_one between detail HTML (oshwhub.com)
and cover image (image.lceda.cn). Different hosts — sleep was unnecessary.
Saves ~1s/project. Sleep #2 (post-cover, before next iteration) stays — it
gates the next oshwhub.com hit.

download_to gains max_seconds wall budget (default 60s, cover uses 15s).
Defends against pathologically slow CDN connections — observed 10 KB/s
on image.lceda.cn for one project, would have hung 6+ min on a 3.6 MB
cover otherwise. httpx default timeout resets per chunk, so streaming
downloads need an external wall-clock guard.

batch-50 Step 1 (license/meta scrape) shipped:
  50/50 candidates have metadata.json + license recorded
  License distribution: GPL 3.0 32, Public Domain 6, NC variants 8,
                        CERN-OHL 1, MIT 1, CC BY 3.0 1
  Forge-friendly (non-NC): 41/50 (82%)
  Declared attachments: 180 files / 2.36 GB (median 18 MB/proj, max 304 MB)
  Walltime: 3min 26s for 28 projects at concurrency=5 (server-side
            HTML render bound, not sleep-bound)

One orphan partial cover (a670e60a...) cleaned up — leftover from the
first aborted run before the timeout fix landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 01:35:11 +08:00
fe6971f3f9 tools/epro2: add std/ writer — EPRO2 → EasyEDA Std-format JSON for downstream
The downstream colleague consumes oshwhub Std (lceda) dict-format JSON,
not KiCad. The EPRO2 decryption part (per-doc plaintext .epro2 streams
in data/raw/<uuid>/source/) is what we already provide; the missing
piece is converting EPRO2 op-streams into the same `dataStr.shape`
tilde-delimited format their parser already speaks.

New tools/epro2/std/ module, peer of tools/epro2/kicad/, kept
deliberately separate so the KiCad path stays untouched:

  - pcb_writer.write_pcb_std() — high-fidelity, validated against a Std
    PCB sample at data/raw/oshwhub/3e2f893d.../25931ddab8.json. Maps
    LINE→TRACK, VIA→VIA, POUR→COPPERAREA (with SVG `M..L..Z` path),
    POLY→CIRCLE/SOLIDREGION, COMPONENT+FOOTPRINT→LIB nested with
    #@$-separated PADs (placement rotation + translate applied so pad
    coords land at PCB-absolute positions). Layer-id mapping (EPRO2 5↔7
    flipped vs Std solder/paste, 11→10 outline, 12→11 multi, SIGNAL
    inner 15+ → Std 21+) noted inline.

  - sch_writer.write_sch_std() — best-effort. Our corpus has zero Std
    schematic samples (docType=1) so verb field orders follow the
    EasyEDA Std public spec, not direct observation. Emits W (wire),
    N (net flag, including the 5-Voltage Global Net Name power-port
    pattern), T (text), LIB (placement with #@$-nested PIN/T). If
    downstream's parser bails the fix is almost certainly a positional
    field tweak, not a re-architecture.

  - __main__.py — flat output `<doc_uuid>.json` per doc directly under
    --out (mirrors Std's own data layout); --all-pcb / --all-sch / --all.

Smoke test on ESP-VoCat: 6 PCB + 9 SCH = 15 JSON files, libs_unresolved=0
across the board. Compact JSON (separators=(",",":")) matches Std's
single-line format. Numbers use _num() — integers without trailing .0,
floats trimmed.

71 → 82 unit tests pass.

Open questions for downstream: (1) confirm SCH verb field orders, (2)
do they want any of the upstream metadata fields we drop (master,
owner, created_at, etc — those live on the crawler side, not the
schematic itself)?

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 01:16:39 +08:00
ed713fa557 docs: consolidate rate-limit probe results into a proper benchmark report
The doc had been growing incrementally as each host got probed; reshape
it as a polished benchmark with TL;DR top, methodology section
(including safety constraints + caveats), per-host detailed tables,
final crawler settings, batch-50 walltime breakdown, and a reproduce
recipe.

Five hosts fully covered:
  pro.lceda.cn API   5.0s -> 0.5s  (10×)
  lceda.cn doc       5.0s -> 0.5s  (10×)
  oshwhub detail     2.0s -> 1.0s  ( 2×)
  oshwhub listing    2.0s -> 1.0s  ( 2×)
  modules.lceda CDN  0.2s          (already optimized)

Net effect on batch-50 plan: sleep total ~32min -> ~3min, walltime
~2h -> ~10-15min.

Key finding: the original 5s/req on Pro was set out of "logged-in
account is precious" caution with zero empirical evidence. Sustained
burst probe (25 distinct UUIDs at 0.5s, no recovery) showed 0/25 errors
and median latency 410ms — the caution was unjustified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:57:35 +08:00
c474f8ad83 log: X86 motherboard hit OOM swap death-spiral on its CPU board PCB write
Killed at the 14-min mark — VmRSS 1.96 GB + VmSwap 1.41 GB on a 3.3 GB
RAM box with 4 GB swap (3.6 GB used), read_bytes 24 GB (pure swap thrash),
process state D (uninterruptible disk sleep). The CPU board PCB doc
(8K+ objects, 35+ child schematic pages) overflowed our current
all-in-memory build pattern: pcb_writer builds the full output list
before to_sexpr serializes once at the end, plus the 35 write_sch_page
calls each build their own Relations + lib_symbols dedup state.

Saved what finished: 4/5 X86 boards complete (Sch-CAM-IMX415,
Schematic1, SCHEMATIC1, Sch-VTX-SSC338Q), the CPU board SCHEMATIC1_1
has all its 35 child .kicad_sch but no .kicad_pcb. Final downstream
delivery: 17 board projects across the 3 supported Pro projects, 32/32
files pass kicad-cli (sch erc + pcb export svg).

Streaming-write fix is the next logical follow-up but out of scope
for this turn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:56:17 +08:00
183f82a3be crawler: drop SLEEP_SOURCE 5.0 -> 0.5 (Std doc endpoint probe)
Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s)
× 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency
variance is dominated by payload size (Std docs span 4 KB to 4.5 MB)
not server backpressure. Same posture as Pro API.

Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19
min wall time (21min sleep -> 2min sleep). Combined plan now projects
~2h -> ~10min walltime exclusive of download bytes.

scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs
from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json
upstream_version_documents lists). Reusable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:54:46 +08:00
8b857428e3 log: post-mortem of running --all on the other 4 Pro projects
Captures the two new --all crash paths fixed in 61fd3ff (odd inner
copper layers, duplicate BOARD titles) plus the Pro 2.x scope gap
(Taishan + Liangshan are JSON-format, not EPRO2 streams, so our
replay_project reads the bytes but doc_type stays None and
_group_by_board returns no SCH/PCB groupings — needs a separate
Pro 2.x writer).

Status as of this commit: ESP-VoCat 6 boards + 220V power 7 boards =
13 project dirs ready for downstream corpus. X86 motherboard is the
largest of the five (7374 docs, 1.9 GB RAM in flight) and still
running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:53:08 +08:00
61fd3ff072 tools/epro2/kicad: fix two --all crashes found running the other 4 Pro projects
Running the new --all on the remaining 4 Pro projects (X86 motherboard,
220V power supply, Taishan Pi, Liangshan Pi) surfaced two crash modes
not covered by ESP-VoCat:

1. Odd inner-layer count → KiCad rejects the file at load with
   "3 is not a valid layer count". The 220V power boards have one
   used inner SIGNAL layer (3 copper total: F.Cu / In1.Cu / B.Cu),
   but KiCad requires an even copper count. Fixed pcb_writer to pad
   with one empty inner layer when the inner count is odd, so the
   total stays even (2, 4, 6, ...).

2. Two BOARDs sharing the same META.title — twin "显示板" boards in
   the 220V power project — landed in the same project directory
   and the second silently overwrote the first's .kicad_sch /
   .kicad_pcb / .kicad_pro. Fixed --all to detect title collisions
   and suffix every colliding basename with the BOARD uuid prefix
   (so both '显示板' boards become '显示板_52e8cc76' and
   '显示板_55d32906' rather than one quietly winning).

71 → 73 unit tests pass (test_odd_inner_signal_count_padded_to_even_total
+ test_duplicate_board_titles_get_distinct_basenames).

Tangentially noted while running this: Taishan Pi and Liangshan Pi are
Pro 2.x JSON, not EPRO2 streams — our replay layer reads the files but
doesn't decode docType, so SCH/PCB grouping returns nothing. Pro 2.x
needs a separate writer; out of scope for this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:48:46 +08:00
cb868988b9 crawler: drop sleep rates 10x for Pro API, 2x for oshwhub detail
Calibrated against ladder probes on 2026-04-29. Findings in
docs/sources/probe_rate_limit_results.md.

  SLEEP_PRO     5.0 -> 0.5  (pro.lceda.cn API)
  SLEEP_BETWEEN 2.0 -> 1.0  (oshwhub detail/listing)
  SLEEP_SOURCE  5.0 unchanged (lceda.cn Std endpoints — not yet probed)
  SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized)

The original 5s rate for Pro API was set out of caution because Pro
requires a logged-in cookie. Empirical sustained-burst probe (25
distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median
latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was
wrong — server tolerates QPS=2 cleanly.

oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to
p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe
water mark.

Net effect on batch-50 estimate: ~1.5h -> ~30min.

scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable
for new endpoints (Std source still owes a probe). Designed for safety:
30s tier recovery, low rep counts on auth hosts, bail on first non-200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:45:34 +08:00
3c00edf6db tools/epro2/kicad: --all emits paired .kicad_pro + .kicad_sch + .kicad_pcb per BOARD
KiCad pairs project files purely by basename + same directory: a folder
holding `Foo.kicad_pro`, `Foo.kicad_sch`, `Foo.kicad_pcb` opens as one
project on double-click of the .kicad_pro, with cross-tool navigation
(open footprint from schematic etc) wired up automatically.

  - pro_writer.write_kicad_pro() renders the minimal KiCad 8 JSON we
    need: meta.filename pinning the basename, sheets=[[<root_uuid>,
    ""]] binding the schematic root, and stub blocks for board /
    schematic / net_settings / erc that KiCad expects to find on the
    first GUI load.
  - root_sch_writer.write_root_sheet() now accepts an optional
    root_uuid so the caller can pass the same uuid into the .kicad_pro
    and .kicad_sch (the binding fails silently with mismatched ids).
  - CLI gains `--all`: groups SCH/PCB docs by their META.board uuid
    (1:1 in EPRO2), strips SCH-/PCB- editor prefixes from titles to
    derive a shared project basename, and emits one directory per
    BOARD with paired files. BOARDs whose SCH is DELETE_DOC (LCD-BD on
    ESP-VoCat) still get a .kicad_pro with sheets:[] + .kicad_pcb so
    pcbnew opens cleanly.

ESP-VoCat smoke: 6 boards → 6 project dirs, all pairs validated by
kicad-cli sch erc / pcb export svg. The CoreBoard pro/sch/pcb trio
shares root uuid 366d3e53...c2fccbe4330b end-to-end.

68 → 71 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:39:58 +08:00
adc5dc5e1b tools/epro2/kicad: PCB Phase-2 — POUR → (zone), CoreBoard unconnected -43%
Phase-1 left 75-358 unconnected_items per board (DRC), dominated by
GND/AGND/POWER nets that EPRO2 routes through copper pour, not discrete
traces. Phase-2 lands those:

  - pcb_writer._decode_zone_path handles the three POUR.path encodings
    seen in ESP-VoCat: rectangle (['R', x, y, w, h, ...]), circle
    (['CIRCLE', cx, cy, r]) approximated as a 36-segment polygon, and
    polyline (numeric pairs with 'L'/'ARC' verb tokens).
  - Each POUR on a copper layer turns into a (zone (polygon ...) ...)
    block plus a (filled_polygon ...) that mirrors the boundary.

Why mirror, not auto-fill: kicad-cli pcb drc does NOT run the zone
filler before checking — only the KiCad GUI does. Without a
pre-computed (filled_polygon ...), DRC sees zones as empty regions and
reports the entire net as unconnected. Mirroring the boundary as the
fill is "connectivity-correct, clearance-imprecise" — KiCad users can
still hit Edit > Fill Zones to refine thermals and pad clearances. We
chose this over reading EPRO2's POURED.pourFill (the editor's own
post-fill polygons) because POURED paths use ARC tokens we'd need to
fully decode, and the user-drawn POUR boundary is already the
authoritative "intended copper" region.

ESP-VoCat DRC totals: 883 → 730 unconnected_items (-17% project-wide).
CoreBoard, the 4-layer board with the most pour coverage, drops 358 →
205 (-43%). Other boards see no movement because their unconnected
items are non-pour issues — pads outside the user-drawn POUR
rectangle, or internal $1N nets via vias on the wrong net (separate
problem, separate fix).

65 → 68 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:27:33 +08:00
eee1a9b97e crawler: --skip-ext + --max-source-mb gates for batch-50 expansion
Two CLI gates needed before scaling Pro batch beyond top-5:

--skip-ext mp4,qt,mov  (attachment filter)
  Skips video extensions in attachment download. Phase 1 measurements
  showed mp4+qt occupy ~54% of attachment storage. Entry still recorded
  in metadata.json with skipped:ext:<token> so we can re-fetch later if
  the policy changes. Honors both server-declared `ext` and filename
  suffix, case-insensitively.

--max-source-mb N  (Pro source size cap)
  Trips inside the chain replay loop on encrypted-blob total. On trip:
  raise ProjectOversizeError, wipe partial source/, append a row to
  data/state/oshwhub_pro_oversize.jsonl. Lets us shortlist 50+ Pro
  projects without one X86-board-class outlier (~500 MB) blowing the
  LFS budget. Std and Pro 2.x legacy are not capped (both <2 MB in
  sample).

Verified:
  - cap=0 trips on first blob (1.2 MB), source/ wiped, state recorded
  - cap=100 runs full ESP-VoCat (7.5 MB plain, 278 docs)
  - skip-ext microtest: 8/8 cases (case-insensitive, declared/suffix
    fallback, empty-token edge cases)

Plan + frozen candidate list for the next 50 projects:
  - docs/plans/oshwhub_batch50.md
  - data/state/oshwhub_batch50_candidates.jsonl (gitignore exception added)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:24:55 +08:00
e61404478e tools/epro2/kicad: Phase-1 .kicad_pcb exporter — 6/6 boards open in KiCad 8
Phase-1 scope: produce a .kicad_pcb that kicad-cli loads cleanly and
that has the right geometry (nets, footprints, tracks, vias, board
outline) — not a 1:1 EDA round-trip. Skipped on purpose for Phase 2:
copper pours (POUR/POURED), manual FILL, teardrops, board-level
strings/images, ARC circle-center recovery.

What lands:
  - pcb_writer.write_pcb(): header/general, data-driven layer table
    (F.Cu = ord 0; B.Cu = ord 31; SIGNAL inner ids 15+ allocated to
    In1.Cu/In2.Cu/... in EPRO2-id sorted order so used inner layers
    stay contiguous), net-name → integer id map (id 0 reserved for the
    empty net per KiCad convention), LINE→segment / LINE→gr_line on
    Edge.Cuts, layer-11 POLY paths walked into Edge.Cuts gr_line chains
    (the actual board outline lives on POLY here, not LINE — without
    this stats showed edge=0), VIA→via.
  - footprint_writer.write_footprint_placement(): inline (footprint ...)
    blocks per PCB COMPONENT. EPRO2 RECT/ELLIPSE/OVAL/POLYGON pad
    shapes mapped to KiCad rect/circle/oval/custom; SMD vs THT detected
    by PAD.hole presence; SLOT holes use (drill oval w h). Pad nets
    resolved cross-doc via the existing PCB.PAD_NET → footprint.pad
    chain in ProjectRelations. layerId=2 component → (layer B.Cu) +
    text on B.SilkS so bottom-side parts render correctly.

Smoke test on ESP-VoCat (6 PCBs): all 6 pass `kicad-cli pcb export svg`
and render. DRC on smallest (MicBoard) reports 145 violations + 75
unconnected — most of the unconnected are GND nets that the EPRO2
source resolves through POUR copper, which Phase 2 will export.

CLI: `python -m tools.epro2.kicad <project> --all-pcb --out <dir>`
emits one .kicad_pcb per PCB doc.

52 → 65 unit tests pass. Float comparisons in tests use math.isclose
because the s-expr 6-decimal trim doesn't preserve strict equality
through `value * MIL_TO_MM` round-trips.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:18:32 +08:00
fc2a45f658 docs: explain per-doc .epro2 crawl vs web-export .epro2 ZIP
Colleague-facing explainer at docs/sources/pro_crawl_vs_export.md.
Addresses the "I see 278 .epro2 files but my browser only downloaded
one" confusion: web download is a ZIP container (extension is a UX
choice, not a format), our crawl produces per-doc message streams.
Both carry equivalent EPRO2 data; only real gap is IMAGE/ binary
previews which we don't fetch yet.

Why per-doc and not ZIP: the ZIP path has no public endpoint —
three HARs confirm the export button fires zero HTTP requests, it's
pure client-side JSZip on data already loaded by the editor. Our
crawler hits the same chain endpoints the editor uses internally,
which delivers per-doc streams.

Log entry references the 278 vs 266 doc-count delta for ESP-VoCat
(we walk full history chain, web export is a current snapshot).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:13:52 +08:00
1e06ba6582 crawler: split sleep policy by host — chain blob fetches drop 5s -> 0.2s
The Pro modern fetch_pro_modern walks a per-history blob loop on
modules.lceda.cn (CDN-flavored host serving AES-encrypted EPRO2 streams).
We were sleeping 5s between every blob — same rate we use for the
rate-sensitive pro.lceda.cn API host. HAR analysis (proexportNew2.har)
shows the editor fires these blobs back-to-back without throttling, so
0.2s is plenty.

Walltime drops linearly with chain length:
  ESP-VoCat (chain=12):    80s sleep -> 22s sleep  (-72%)
  220V power (chain=28):  160s sleep -> 26s sleep  (-84%)
  X86 board  (chain~700, projection):  ~1h -> ~3min

Verified by re-fetching ESP-VoCat + 220V power: byte-identical output
across all per-doc .epro2 files (sha256 match), only fetched_at
timestamp differs in manifest.json. Two manifest files re-stamped as
proof of the validation runs.

API host sleeps (4x 5s in modern fetcher, 7x 5s in legacy fetcher) are
unchanged — those go to pro.lceda.cn /api/ which still wants polite
QPS<=0.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:09:19 +08:00
ff5553fb06 tools/epro2/kicad: hierarchical export + global_label + 5-Voltage power ports
Three coupled changes so kicad-cli sch erc runs at the project level
(across all sheets of one schematic) instead of single-sheet:

1. (label) → (global_label (shape passive)). EPRO2 nets are
   project-global by construction (named rails span every page in the
   SCH and physically wire across PCBs); KiCad's local label is sheet-
   scoped and triggers `label_dangling` for any name not duplicated on
   the same page.

2. New root_sch_writer that groups SCH_PAGE docs by their parent SCH
   (META.schematic), emits one root .kicad_sch per group with one
   (sheet ...) entry per child, and threads the root-assigned uuid back
   into each child's (sheet_instances) so KiCad can bind them.
   --all-sch now defaults to this; --flat falls back to one-file-per-page.

3. EPRO2's "5-Voltage" placeholder COMPONENT (partId
   pid8a0e77bacb214e, 365 instances on ESP-VoCat) is the editor's power
   port. The rail name lives in the placement's `Global Net Name` ATTR,
   not in the PART. We now emit a (global_label "<rail>") at the
   placement coords whenever that attr is set (101/365 of them on
   ESP-VoCat — the rest are unconfigured drafts).

ESP-VoCat 5 hierarchical roots: 2325 → 2265 violations. Modest because
5 of 6 SCHs are single-page (no cross-sheet nets to resolve), and the
one 4-page schematic (CoreBoard) shares only a handful of names across
sheets — most net names are de-facto sheet-local. The remaining ~190
pin_not_connected are dominated by 0402-style passives whose pin tip
lies on a wire's interior, not at an endpoint; KiCad needs an explicit
(junction) at those points and we don't yet emit one. Marked as the
next follow-up in log.md.

47 → 52 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 00:05:47 +08:00
54f0173947 tools/epro2/kicad: fix two structural ERC bugs — wire_dangling -88%, pin_not_connected -52%
Bisect found two semantics mismatches between EPRO2 and KiCad that cause
the 850 real-connectivity ERC violations on the ESP-VoCat ref project:

1. sym_writer was emitting lib coords without negating Y, but KiCad lib
   uses Y-up and re-flips Y on placement (Y-down schematic). So vertically
   arranged pins ended up at Y-mirrored absolute positions and wires that
   reach the geometric pin tip in EPRO2 missed the rendered pin tip in
   KiCad. Fix: lib_y = -epro2_y, lib_rot = (360 - rot) % 360 for pin/text.

2. sch_writer was treating each LINE as an isolated wire — but EPRO2
   binds segments into nets by NAME (WIRE.NET attr), not just geometry.
   Multi-segment nets like GND/VBUS show up as N disconnected stubs to
   KiCad. Fix: per-LINE, look up lineGroup → WIRE → NET attr and emit a
   `(label "<NET>")` at the LINE's start. Same-named labels on distinct
   physical wires is how KiCad's ERC recognizes a multi-segment net.

ESP-VoCat 9 sheets:
  wire_dangling           444 →  52  (-88%)
  pin_not_connected       406 → 196  (-52%)
  real connectivity total 850 → 248  (-71%)

Why we did NOT round to grid (the obvious-looking fix): EPRO2 places
some pins on a 10-mil pitch (e.g. magnetic socket); rounding to KiCad's
default 50-mil ERC grid would collapse those pins. The 248 residual is
fundamentally cross-sheet — single-sheet ERC can't see a net's other
endpoints on sibling sheets — and is a Phase-3 (hierarchical sheet)
problem, not a per-sheet one.

41 → 46 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:43:11 +08:00
5e63924474 oshwhub: pin listing index snapshot (33,695 rows, 29 MB) into git
Previous commit added the dump script + report but the actual jsonl was caught
by data/state/* gitignore. Add a targeted exception so the snapshot travels with
the repo — anyone who clones can do local filtering without re-hitting the API.

The data is regenerable (scripts/dump_listing_index.py is one-shot, ~1 min), but
pinning a dated snapshot lets us reason about "the state of the corpus on
2026-04-28" reproducibly. Future re-dumps overwrite the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:32:58 +08:00
d89a7cdf9c oshwhub: dump full listing index (33,695 projects) for batch sizing
Probed listing API and learned: total field is exposed (Pro=21,202 / Std=12,493),
pageSize accepts >=1000 (full corpus = 35 requests / 71s), sort param is silently
ignored. Dump all listings via scripts/dump_listing_index.py to local jsonl so
downstream batch-selection no longer hits the API.

Why: needed quantitative anchors before scaling Pro batch beyond top-5. License
is detail-page only (~19h serial scan), so we want to filter on grade/like
*locally* first to shortlist before paying that cost. Quality-tier counts now
known: A-tier (grade>=3 & like>=10) = 2,806 across both origins.

- scripts/dump_listing_index.py: one-shot scraper, polite QPS, streams to jsonl
- docs/sources/oshwhub_listing_full.md: human-readable report with growth
  trends, quality tiers, owner concentration, and storage-budget anchors

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:30:56 +08:00
67a2d0448b .gitignore: add *.rpt for kicad-cli sch erc default output
bisect KiCad 8 语法时跑 kicad-cli sch erc 没传 --output 参数会把
报告写到当前目录的 <input>.rpt,跑十几次主目录就堆了 20 个 .rpt
垃圾。加进 ignore 防回流。

同时清掉本次留下的:
  20 个 *.rpt 报告(已 rm)
  data/state/std_probe[1-5]/ 5 个旧 probe 状态目录(~8.5 MB stale,
  这些目录里的 probe scripts 在前一会话已删;状态本身也没用了)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:11:20 +08:00
fb577cc89f tools/epro2/kicad: fix two KiCad 8 parse blockers (newline + pin_numbers)
装 kicad 8.0.9 (apt PPA) 后跑 kicad-cli sch erc 校验我们 emit 的
.kicad_sch 文件,发现 9/9 sheets 一开始全部报 "Failed to load schematic
file" — 父节点解析就挂掉。Bisect 找到两个语法 bug:

1. **(pin_numbers (hide no)) 不被 KiCad 8 接受**
   KiCad 8 lib_symbols 里 `pin_numbers` 是 token-form,不接受 (hide
   yes/no) 子块。要么省略整个 block 默认 visible,要么 `(pin_numbers
   hide)` 表示隐藏。原来的 `(hide no)` 风格是 KiCad 7 旧语法。

   Fix: tools/epro2/kicad/sym_writer.py 删掉 (pin_numbers (hide no))
        行;KiCad 默认 visible 行为正是我们想要的。

2. **String 里的字面 \n / \r / \t 让 KiCad 解析器中止**
   ESP-VoCat 的 Overview sheet 有 TEXT "Battary\n3.7V 700mAH"(多行
   电池标签),EPRO2 里以**字面 0x0a 字符**存储。我们把它原样 emit
   成 "..." 包住的字符串 → KiCad reader 在 quoted string 内遇到 \n
   就报 parse error 不给 message。

   Fix: tools/epro2/kicad/sexpr.py 在 str escape 路径加 \n / \r / \t
        转义;reader 加 \r 解码(roundtrip 用)。

修完后:

  9/9 sheets parse OK in KiCad 8.0.9
  ERC 跑通,9 个 sheet 共 2793 violations,分布:
     1372 endpoint_off_grid        (49%, cosmetic — 30-mil EPRO2 grid 不
                                    snap KiCad 默认 50-mil grid)
      571 lib_symbol_issues        (20%, cosmetic — facere 库未注册到
                                    user library table;库已 embed 在
                                    .kicad_sch 内联可用)
      444 wire_dangling            (16%, real — wire 端点没精确对齐 pin)
      406 pin_not_connected        (15%, 同上的另一面)

  Cosmetic 占 70%,real connectivity 30%,下个 phase 处理:
    - grid 校准(把 coord 精确 round 到统一 grid 上)
    - pin tip 端点匹配(KiCad 需要 wire 端点 == pin (at) 字段对应的
      绝对坐标,浮点必须精确相等)
    - 生成 sym-lib-table 注册 facere 库(消 lib_symbol_issues)

测试:
  + test_string_escapes_newlines_and_tabs
  + test_lib_symbol_omits_pin_numbers_block
  reader 加 \r 解码

41/41 通过(39 旧 + 2 新)。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:04:58 +08:00
8a91ce43f4 tools/epro2/kicad: Phase-2 lib_symbols — render symbol bodies from SYMBOL docs
Phase 1 emit的 .kicad_sch 里组件位置 + 属性都对,但 lib_symbols 是空
stub —— KiCad 渲染时每个组件显示成红色 "?"。Phase 2 把 SYMBOL 文档
里的 PART + RECT/POLY/CIRCLE/TEXT/PIN primitives 翻成 KiCad lib symbol
块,填到 lib_symbols 里,让 KiCad 显示真正的原理图符号。

新增 tools/epro2/kicad/sym_writer.py:
  write_lib_symbol(symbol_doc) → S-expr list 形如:
    (symbol "facere:<partId>"
      (pin_numbers (hide no))
      (pin_names (offset 1.016))
      (in_bom yes) (on_board yes)
      (property "Reference" "U" ...)
      (property "Value" "<title>" ...)
      (property "Footprint" "" hide)
      (property "Datasheet" "" hide)
      (symbol "<partId>_1_1"
        (rectangle ...)        ← from RECT.dotX1/Y1/dotX2/Y2
        (polyline (pts ...))   ← from POLY.points + closed → fill
        (circle ...)            ← from CIRCLE.center/radius
        (text "..." ...)        ← from TEXT.value/x/y/rotation
        (pin <type> line (at ...) (length ...) (name ...) (number ...))
                                ← from PIN + sibling ATTR ops
      ))

PIN 名字/编号/电气类型解析(这是关键数据探测点):
  EPRO2 PIN 不直接带 number/name/type 字段;这些信息存为独立 ATTR 操作
  (parentId=<pin_id>, key="Pin Name"/"Pin Number"/"Pin Type")
  Pin Type 取值映射:IN→input, OUT→output, BIDIR→bidirectional,
  POWER_IN→power_in, POWER_OUT→power_out, NC→no_connect, ...
  默认 passive(保守)

sch_writer 集成(lib_symbols 自动填):
  write_sch_page(doc, project_relations=pr) — 增 pr 可选参数
  内部 _build_lib_symbols(): 收集本 sheet 用到的 partIds → 通过
  ProjectRelations.parts_by_id 解析到 SYMBOL 文档 → write_lib_symbol →
  组装 (lib_symbols ...) 块;同 partId 多 SYMBOL 候选取第一个,去重
  WriteStats 增 lib_symbols_embedded / lib_symbols_missing

CLI 加 --no-lib-symbols 用于回到 Phase-1 行为(占位符调试用)。

ESP-VoCat 重导出验证:9/9 SCH_PAGE 全部 0 lib_miss
  P1_45092758.kicad_sch    wires=187 symbols=138 lib_emb=29
  codec_0b0163fa.kicad_sch wires=190 symbols=112 lib_emb=20
  Interface_b336a7c7.kicad_sch        symbols=95  lib_emb=13
  ...
  P1_408c9f4f.kicad_sch    wires=  6 symbols= 10 lib_emb= 3

测试:6 个新单测覆盖 outer wrapper / pin ATTR pull / 多形状 primitives /
sch_writer 集成路径 / 缺失 lib 计数 / no-pr 回退到 Phase 1。
合计 **39/39 通过**(parser 6 + relations 9 + project_relations 6 +
sexpr 6 + sch_writer 6 + sym_writer 6)。

下一步 Phase 3:footprint library + .kicad_pcb 导出。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 22:45:19 +08:00
9213429a57 tools/epro2/kicad: Phase-1 EPRO2 → KiCad schematic exporter
写第一版 EPRO2 → .kicad_sch 转换:把 SCH_PAGE Document 的 wires +
COMPONENT placements + TEXT 输出到一个可被 KiCad 7+ 打开的 sch 文件。
不含 symbol 主体(lib_symbols 留空 stub),所以 KiCad 里组件会渲染
成红色 "?" 占位,但布线 + 位置 + Designator/Value 属性都正确。完整
symbol 库导出留 Phase 2。

模块结构:
  tools/epro2/kicad/sexpr.py        手写 S-expr emitter,Sym 标记裸符号,
                                    str 自动加引号 + 转义;float 去尾零;
                                    bool→yes/no;NaN/Inf 主动报错
  tools/epro2/kicad/_sexpr_reader.py  极简 S-expr parser,仅给 round-trip
                                    测试用(非完整 KiCad reader)
  tools/epro2/kicad/sch_writer.py   write_sch_page(doc) → str;处理:
                                      LINE  → (wire (pts ...) ...)
                                      COMPONENT → (symbol (lib_id facere:<partId>)
                                                  (at x y rot) (property Reference ...) ...)
                                      TEXT  → (text "..." (at ...))
                                    单位 mil → mm × 0.0254;零长 wire 跳过
  tools/epro2/kicad/__main__.py     CLI: --doc <uuid> | --all-sch

ESP-VoCat 验证(python -m tools.epro2.kicad <project> --all-sch):
  9 SCH_PAGE 全部转换成功
  P1_408c9f4f.kicad_sch    wires=  6  symbols= 10  text=  0  skipped= 2  (370 lines)
  P1_ee409917.kicad_sch    wires= 20  symbols= 14  text=  0  skipped= 3
  P1_54743d77.kicad_sch    wires= 42  symbols= 30  text=  3
  Overview_dc13d6d2.kicad_sch wires=  0  symbols=  1  text= 34   (说明页)
  MCU_510cff33.kicad_sch   wires= 91  symbols= 86  text=  9
  Interface_b336a7c7.kicad_sch wires= 99  symbols= 95  text=  6
  P1_5c38f45b.kicad_sch    wires=179  symbols= 86  text=  9
  P1_45092758.kicad_sch    wires=187  symbols=138  text= 10  (主图)
  codec_0b0163fa.kicad_sch wires=190  symbols=112  text= 10

输出落在 data/processed/kicad_sch/<filename>.kicad_sch(gitignore 内,
可重新生成;不入库)。

测试:6 个 sexpr 测 + 6 个 sch_writer 测,含 round-trip parse 验证。
parser/relations/project_relations 的旧 21 个不动,合计 **33/33 通过**。

下一步:
1. Phase 2 — symbol library 导出 (.kicad_sym),把 SYMBOL doc 的 PIN/RECT/
   TEXT primitives 转 KiCad symbol 主体;填 lib_symbols 块让组件渲染
   出真正的 schematic 符号
2. footprint library + .kicad_pcb 导出
3. 用 KiCad CLI (kicad-cli sch erc) 跑 ERC 校验

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 22:29:15 +08:00
3052e42991 tools/epro2: add ProjectRelations for cross-document resolution
per-doc Relations 在大量 cross-doc 引用前是不够的:PCB 的 PAD_NET 复合
id [PAD_NET, comp, pin, pad] 里的 pad 实际是 FOOTPRINT 文档里的 pad
实例;SCH_PAGE 的 COMPONENT.partId 指向某个 SYMBOL 文档的 PART.id。

ProjectRelations 在 per-doc Relations 之上做项目级聚合,把这些跨文档
引用拼起来。

Probe 阶段(ESP-VoCat)发现的映射规则(已写入 docstring):

1. SCH_PAGE COMPONENT.partId  ===  PART.id in some SYMBOL doc
   - 命名两种风格:'pid<hex>' (anonymous/系统 part) + '<name>.<n>' (具
     名 SKU),但都直接相等 PART.id,**不**是不同 namespace
   - 同一 PART.id 可能出现在多个 SYMBOL 文档里(库快照),
     parts_by_id 保留全部,consumer 通常取第一个

2. PCB COMPONENT.id  →  FOOTPRINT 文档 UUID  via 单独 ATTR op:
       ATTR(parentId=<comp>, key="Footprint", value=<fp_doc_uuid>)
   COMPONENT.attrs 子 dict 只有内务字段(Unique ID / Channel ID / ...),
   **不**含 footprint 引用。这跟 schematic 的 partId 在 COMPONENT 上的
   做法不一样,是 EPRO2 流的一处不对称

3. PCB PAD_NET[comp,pin,pad] 里的 pad 是 FOOTPRINT 文档内部的 pad id;
   解析链: comp → ATTR Footprint → FOOTPRINT relations.pads[pad]

API:
  ProjectRelations.build(project) — 单遍构建
  resolve_symbol_docs(sch_uuid, comp_id) → [SYMBOL doc uuids]
  resolve_footprint_doc(pcb_uuid, comp_id) → FOOTPRINT doc uuid | None
  pad_in_footprint(fp_uuid, pad_id) → PAD payload | None
  resolve_pcb_pad_net(pcb_uuid, comp, pin, pad) → {footprint, pad} | None
  attrs_for_pcb_component(pcb_uuid, comp_id) → {key: value} 折叠

CLI 加 --project-relations,跑 ESP-VoCat:
  documents                                 278
  distinct_parts                             87
  duplicated_parts                            9
  pcb_components_with_footprint             206
  pcb_components_unresolved_footprint         0
  sch_components_with_partid                572
  sch_components_unresolved_part              0

PCB 样本验证:comp=e0 → fp=1069352d81c6 Designator='U8',
PAD_NET pin=1 pad=e7 net=GND 跨文档解到坐标 (-37.4,-45.24)。

测试:6 个新单测覆盖 partId→symbol、comp→footprint、PAD_NET 跨文档、
attrs 折叠、unresolved 计数。parser + relations + project_relations
共 21/21 通过。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 22:22:39 +08:00
7f9e2fad73 tools/epro2: add Relations layer for cross-object navigation
在 replay 的扁平 objects[id] -> payload 之上盖一层 Relations,建索引和
反向引用,把孤立对象拼成可遍历的图,是后续 EPRO2 → KiCad 转换器的
中间表示前置。

Relations.build(doc) 单遍扫所有对象,得到:

主集合(按类型分桶):
  parts / components / pins / pads / wires / nets / layers / rules

复合 ID 解析(关键):
  '["LAYER",1]'                          → layers[1]
  '["NET","GND"]'                        → nets["GND"]
  '["PAD_NET","e0","1","e7"]'            → pad_nets_by_pad/by_net
  '["RULE","SAFE","copperThickness1oz"]' → rules[("RULE","SAFE",...)]

反向引用:
  obj_ids_by_part         partId            → 引用对象 ids(lib 内 RECT/TEXT/PIN 都带 partId)
  components_by_part      partId            → component ids
  attrs_by_parent         parentId          → ATTR ids
  lines_by_wire           WIRE.id           → LINE ids(wire 由若干 LINE 段组成)
  pad_nets_by_pad         PAD.id            → PAD_NET 记录
  pad_nets_by_net         net name          → PAD_NET 记录
  objects_on_layer / objects_in_net  字段反查

便捷 accessor:
  attrs_dict(parent_id)   折叠所有 ATTR ops 到 {key: value} dict(last
                          write wins),KiCad 转换时按 component 拿
                          Designator/Value/Footprint 的常用入口

ATTR.parentId 解析(实测发现的两种坑):
1. 不仅指向 COMPONENT/PART —— 也大量指向 WIRE(schematic 上的网络
   标签 / 网络属性)。原查重函数漏算,636 个 false positive
   unresolved;改为"任意 doc.objects[parentId] 命中即算 resolved"
2. 复合形式 `<comp_id>-<pin_id>` 用于把 ATTR 挂在某 component 的某个
   pin 上(如 PinName)。`_resolve_parent()` 用 split("-",1) 兜底

CLI 加 --relations,按 docType 聚合 stats:
  uv run python -m tools.epro2 data/raw/oshwhub/<uuid> --relations

ESP-VoCat 验证:
  SCH_PAGE 9 docs : 572 components, 563 wires, 934 lines_grouped,
                    4111 attrs_attached, 0 unresolved_parents
  PCB      6 docs : 206 components, 807 pad_nets, 173 nets, 544 layers
  SYMBOL 105 docs : 106 parts, 560 pins, 1680 attrs_attached
  FOOTPRINT 55 docs: 496 pads, 9 nets, 1771 layers, 140 rules

注:PCB 内 pads=6 vs pad_nets=807 不矛盾 —— PAD 实例存在 FOOTPRINT
文档里,PCB stream 用 ["PAD_NET",comp,pin,pad] 复合 id 跨文档引用;
解析"comp 的某 pin 通过哪个 footprint 的哪个 pad"需要 project-级
Relations 聚合(下个 task)。

测试:tools/epro2/tests/test_relations.py 9 个单测覆盖复合 id 解析、
lineGroup 链接、parentId 直/复合解析、partId 反查、attrs 折叠。
parser + relations 共 15/15 通过。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 22:17:28 +08:00
3c57e75d51 Add tools/epro2 — EPRO2 parser + replay prototype
为 Pro 3.x .epro2 工程源数据写解析骨架,下游做 EPRO2→KiCad 转换器
前的基础设施。在 ESP-VoCat (278 docs / 7.5 MB) + 220V 桌面电源
(771 docs / 26 MB) 端到端跑通,0 parse errors。

模块结构:
  tools/epro2/parser.py    单行 → Op:rstrip("|") + split("||") + json.loads
  tools/epro2/replay.py    state-machine:DOCHEAD 设头;其它 op 按 id 做
                           upsert(payload=None 当 delete);EDIT_HEAD/
                           META/CANVAS/PREFERENCE/PANELIZE 当 doc 级单
                           例存
  tools/epro2/__main__.py  CLI:传项目目录走 manifest.json 重放每个 doc,
                           按 docType 聚合输出 + 可选 --dump-doc 看单文
                           档详情
  tools/epro2/tests/       6 个单测 pin 死 trailing-pipe / 三段消息 /
                           id-only-no-payload / 嵌入管道符等坑

ESP-VoCat 输出示例:
  Documents: 278  (parse_errors=0)
   count  docType         objects        ops  deletes  untyped_ops
     105  SYMBOL             4124       4439        0            0
      88  DEVICE               88        264        0            0
      55  FOOTPRINT          4641       4855        0            0
       9  SCH_PAGE           7982       8167       42            0
       6  PCB                8428       8547       38            0
       6  BOARD                 9         18        0            0
       6  SCH                   9         26        0            0
       1  BLOB                  4          8        0            0
       1  FONT                 16         28        0            0
       1  CONFIG                2          3        0            0
  Top ops: ATTR 7035 / ELE_PLACEHOLDER 4225 / LINE 3005 / LAYER 2318 ...

PCB 文档单 dump 验证语义正确:META 含 title (PCB-EchoEar-CoreBoard-V1_0)
+ board 引用;CANVAS 含 origin/grid/unit (mm);LAYER 1/2/3 = TOP/BOTTOM/
TOP_SILK 配色齐全。

跑法:
  uv run python -m tools.epro2 data/raw/oshwhub/<project_uuid>
  uv run python -m tools.epro2 data/raw/oshwhub/<uuid> --dump-doc <doc_uuid>

下一步(不在本 commit):
1. 把对象间关系建起来(COMPONENT.partId → PART;LINE.lineGroup → WIRE;
   PAD_NET id → PAD + NET 三方关联)—— 当前 replay 只做扁平 dict
2. EPRO2 → KiCad 序列化层(Forge 投影硬门槛)
3. 在 Pro 3.x 三个项目做整体回归(X86 主板 7374 docs 可作压力测试)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 22:10:27 +08:00
c721e08c93 projects.md: replace Comments column with 版本 (Std / Pro 3.x / Pro 2.x)
Comments 那列对工程"品质"信号弱(评论量主要看话题热度);换成"版本"
列直接告诉读者每个项目源是哪种 EDA 格式 + 编辑器版本号。当前 15
个项目里 10 Std / 3 Pro 3.x / 2 Pro 2.x。

source_format 字段映射:
  easyeda-std        → Std
  easyeda-pro        → Pro 3.x
  easyeda-pro-legacy → Pro 2.x
  其它               → 透传

editor_version(如 6.5.43 / 3.2.91 / 2.1.40)作为子标签放第二行。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 22:01:41 +08:00
c6279bff08 Add EasyEDA Pro 2.x legacy source ingestion (5/5 batch closure)
补齐前一批失败的 2 个 legacy Pro 项目(立创·泰山派 RK3566、立创·梁山派),
打通 Pro 2.x 旧版工程的源抓取链路。结合上一 commit 的 modern Pro 3.x
路径,本仓库 5/5 Pro 项目 EPRO2/dataStr 全部端到端打通。

Pro 2.x 与 Pro 3.x 是两个完全不同的存储模型:
- Pro 3.x:git-style branch + linear history chain,AES-128-GCM 加密的
  EPRO2 增量消息流,按 history 重放(已在前一 commit 打通)
- Pro 2.x:无 branch / 无 history。文档以 EasyEDA Std plaintext dataStr
  存储(同 ["DOCTYPE","SCH","1.1"] 格式),按 doc UUID 通过
  /api/v2/documents/lists 批量 GET,主体无加密,只组件库走 AES

Pro 2.x 抓取链由 HAR (tmp/prodownload3.har, 178 请求) 反推:

  GET  /api/v4/projects/<P>                     → boards: [{sch, pcb, name}]
  GET  /api/projects/<P>/ticket?uuid=&g_ticket=-1
                                                → 完整项目 manifest
  POST /api/schematic/lists {uuids:[<sch>]}     → sort: [{uuid:<sheet>}]
  POST /api/v2/documents/lists {uuids,docType:1} → schematic plaintext
  POST /api/v2/documents/lists {uuids,docType:3} → PCB plaintext
  POST /api/coppers/search {paths}              → 铺铜层
  POST /api/textpath/search {paths,project_uuid}→ 字体/文字
  POST /api/v2/resources/search {hash,project_uuid} → BLOB 图片

实现:
- crawlers/oshwhub/crawler.py:
  - fetch_pro_source() refactor 成 dispatcher,先 GET project meta
    检查 branch_uuid,null 即旧版走 _fetch_pro_legacy(),非空走
    _fetch_pro_modern()
  - _fetch_pro_legacy() 新增(按上面 9 步流程拉所有 doc + 辅助层)
  - _pro_post_json() POST helper(与 _pro_get_json 对称)
- schemas/project.schema.json: source_format enum 加 easyeda-pro-legacy
- docs/sources/easyeda_pro_source.md rev 4: §1.1 旧版 vs 新版判别表更新、
  §2.7 新增旧版抓取流程 + 实测数据

落盘约定(旧版):
  source/ticket.json                     完整 manifest
  source/<sheet_uuid>.json               每张原理图(含 dataStr)
  source/pcb_<pcb_uuid>.json             每块 PCB
  source/coppers.json/textpath.json/blobs.json  辅助 PCB 层资源
  source/manifest.json                   索引

实测:
  立创·梁山派      editor=2.1.30, 2 sheets+1 pcb,    1.0 MB,  78 sym/191 fp/128 dev
  立创·泰山派 RK3566 editor=2.1.40, 29 sheets+1 pcb, 0.8 MB, 299 sym/524 fp/295 dev

旧版项目体量比新版小两个数量级(梁山派 1 MB vs RK3576 66 MB)—— 没有
增量 history,组件库走单独端点,本身就是当前快照。

5/5 Pro 项目终极汇总:
  X86 主板          easyeda-pro        3.2.15  7374 docs / 481 MB
  泰山派 RK3566     easyeda-pro-legacy 2.1.40    30 docs / 0.8 MB
  梁山派            easyeda-pro-legacy 2.1.30     3 docs / 1.0 MB
  220V 桌面电源     easyeda-pro        3.2.69   771 docs /  26 MB
  ESP-VoCat         easyeda-pro        3.2.91   278 docs / 7.5 MB

共 8456 docs / ~516 MB plain。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 21:59:25 +08:00
3282a028c4 Add EasyEDA Pro EPRO2 source ingestion (3/5 batch test)
打通 oshwhub origin=pro 现代 Pro 3.x 工程的 EPRO2 源抓取链路。3/5
modern Pro 项目完整解出(共 8423 docs / 542 MB plain):

- X86 主板        7374 docs / 481 MB plain (chain=85, editor=3.2.15)
- 220V 桌面电源     771 docs /  26 MB plain (chain=28, editor=3.2.69)
- ESP-VoCat       278 docs / 7.5 MB plain (chain=12, editor=3.2.91)

剩余 2/5 是 legacy Pro 2.x(立创泰山派 RK3566、梁山派),项目 meta
返回 branch_uuid=null + editorVersion="2.1.40",没有 git-style chain
模型,文档直接挂在 boards[].sch/pcb 字段上,访问端点暂未挖通;元
数据落库 metadata.json,source/ 留空。

实现要点:
- fetch_pro_source(): 4 步流程(project → branch HEAD → structures
  → /branches/<B>/histories/<HEAD> 即返完整 chain,无需 ?limit 批量
  端点)+ 逐 history 走 AES-128-GCM 解密(16 字节 IV,pycryptodome
  原生支持)+ gunzip + 按 DOCHEAD 切 per-doc EPRO2 流
- EPRO2 解析坑:行末单 `|` 是行终止符不是字段分隔符,必须先
  rstrip("|") 再 split("||"),否则 payload JSON 解析失败 silently
  swallow 导致 cur_doc 不设 → 第一轮 X86 板 7374 docs 抽出来只剩 2 个
- docType 实测远不止 BOARD/PCB/SCH/SCH_PAGE,还含 SYMBOL /
  FOOTPRINT / DEVICE / BLOB / FONT / CONFIG —— Pro 把组件库快照也
  随项目存到 history,下游做 EPRO2→KiCad 转换时必须先把这些 lib
  doc 加载进 symbol cache
- Pro 2.x vs 3.x 是不同存储模型 —— 3.x 走 branch 模型(已打通),
  2.x 走 boards[] 直链(未打通);判别条件:project meta 的
  branch_uuid 是否为 null

CLI 新增 --with-pro-source / --backfill-pro-source / --pro-cookie /
--origin(按 origin 字段服务端过滤 listing API),crawl_one() 按
origin=pro 自动 dispatch 到 Pro fetcher。

schema:docType 类型从 integer 放宽到 [integer, string, null]
(兼容 Std 的 1/3 + Pro 的 BOARD/SCH 等),新增 message_count 字段。

License 注意:本批 5 个项目全是 NC-SA / GPL,未达 Pro source doc
§4.2 Forge 白名单(MIT/BSD/Apache/CC0/CC-BY/CERN-OHL-P/Unlicense)。
按 CLAUDE.md "研究用、不再分发" 原则 raw 入库无碍;Forge 投影时
另过白名单。

详细技术细节见 docs/sources/easyeda_pro_source.md rev 3 + log.md。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 21:45:52 +08:00
d874278bc5 Add EasyEDA Std project source ingestion (10 boards backfilled)
打通 oshwhub origin=std 项目的工程源(schematic + PCB dataStr)抓取链路。原
plan.md §1.6 假设需要登录,实测 lceda.cn/api/documents/<doc>?uuid=<doc>&path=<doc>
对公开项目匿名可访问 —— 无需 cookie,无账号封禁风险。

调研:4 轮探测留痕在 data/state/std_probe[1-5]/(gitignored);翻 Std 编辑器
v6.5.51 的 main.min.js bundle 找到 ajaxDetail 端点;按 docType 区分两种
响应 shape(schematic 项目视图 vs PCB 文档视图)。

Crawler:
  - make_source_client() 用浏览器 UA + lceda.cn/editor Referer,因为
    oshwhub /api/project/<uuid> 端点拒绝 FacereDataset/0.1 UA(CLAUDE.md
    UA 例外条款:目标站主动封自定义 UA + 公开静态资源)
  - fetch_std_source(): 项目元 → version_documents → 逐文档 dataStr → 落
    source/<doc>.json + source/manifest.json
  - --with-source(爬新项目时一并抓源)/ --backfill-source(仅扫已有)
  - QPS ≤ 0.2 (SLEEP_SOURCE = 5s) 自律

Schema: 加 source_format / source_path / source_documents / editor_version
(前 3 进 enum 锁定,便于后续 Pro / KiCad 源对齐)。

回填结果:10/10 成功,45 个文档,33.2 MB;schema validate 全通。
docTypes 主要是 1 (schematic) 与 3 (pcb);USB 电压电流表只有 PCB 文档(4 个:
主板+盖板+底板+面板,作者未上传原理图源)。

完整调研:docs/sources/easyeda_std_source.md。
2026-04-28 20:07:40 +08:00
b0d3afd2a9 update readme 2026-04-26 11:54:01 +08:00
Zhang Jiahao
a3942c03df Update EasyEDA Pro source research 2026-04-24 00:40:18 +08:00
Zhang Jiahao
a16cb11c7d Add easyeda_pro_source.md: Pro 工程源完整链 + EPRO2 格式解析
Why:
- pro.lceda.cn (立创 EDA 专业版) 的工程源抓取链已经打通:4 步 API +
  AES-128-GCM 解密 + gzip 解压 + EPRO2 消息流解析,所有信息需要落成
  文档独立保留,避免丢失;也为后续实现 EPRO2 → KiCad 转换器/选型铺路。
- 与 oshwhub.md(Std 版)并列成为独立调研文档 —— Pro 和 Std 是两套
  独立编辑器,cookie/API/格式都不同,混在一起反而乱。

What:
- docs/sources/easyeda_pro_source.md:
  * TL;DR 表 + §1 Std vs Pro 对照
  * §2 4 步 API 链 + 必需 headers (Editor-Version/path/Referer/Cookie)
    + Python 解密代码 + 实测数据(2.7 MB 源流 / 8357 条消息)
  * §3 EPRO2 格式完整分类:40 种 message type 按功能分组
    (零件/几何/PCB/层/规则/...) + 每类样例
  * §4 安全合规(风控 / license / 密钥泄漏语义)
  * §5 接入 Forge (OSHWHUB_INGEST_SPEC.md) 的 gap 表
  * §6 已知未验证 7 条
  * 附录 A 一键重跑命令

- pyproject.toml: + pycryptodome>=3.23.0(AES-GCM 解密依赖)
- log.md: 本次会话记录

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:11:32 +08:00
f797205dc8 add epec md 2026-04-23 23:42:21 +08:00
Zhang Jiahao
1a67df44ba Add docs/infra.md: dev1 (广州) 部署记录
Why:
- Phase 0.5 要求的基础设施文档落地。机器已部署好,记录机器规格、
  SSH 约定、仓库位置、依赖装法、凭据管理规则、环境变量、长跑与日志
  策略、磁盘应急阈值,方便后续自己或新协作者快速接手。
- 严格不包含任何凭据值(token / cookie / 密码),只写事件与结构。

Deployment notes:
- GIT_LFS_SKIP_SMUDGE=1 克隆:省带宽和磁盘(本地仓库 32MB 而非 535MB+),
  历史 LFS 对象按需 pull
- uv 走清华镜像 + only-system python 3.10:sync 秒完成,避免下载独立 Python
- ~/.secrets/ mode 700 就绪

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:36:21 +08:00
Zhang Jiahao
fb7c0488bc Relax Python requirement 3.11 → 3.10
Why:
- Ubuntu 22.04(广州云服务器 dev1 的发行版)自带 python3.10。之前要求 3.11
  会让 uv 去下载独立 Python 解释器,从国内拉 Astral 的 python-build
  release 很慢。放到 3.10 直接用系统 Python,sync 秒完成。
- 代码没有用任何 3.11+ 特性(用了 `from __future__ import annotations`
  支持 PEP 604/585 在 3.10 上下文),降版本零代价。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 23:34:54 +08:00
Zhang Jiahao
b0ddcf3f14 Allow login content; plan cloud infra, storage tiers, EDA→KiCad conversion
Why:
- 策略调整:登录后才能访问的内容从"禁止"改为"纳入本项目范围",
  同时明确凭据管理红线(合法账号、不入 git、云服务器隔离)。
  解锁 u.lceda.cn 工程源 JSON,这是训练数据质量的关键升级。
- 计划中"存储"和"运行环境"一直模糊,现在按 Charles 提供的广州云服务器
  + 存储分级演进(Gitea LFS → 对象存储)给出清晰路径。
- 打通 oshwhub (EasyEDA) 与 bshada/open-schematics (KiCad) 两个生态,
  需要一个 EDA→KiCad 批转换脚本。先把它纳入 plan,等拿到工程源再实现。

What:
- CLAUDE.md: 登录态条款从"不抓"改为"合法账号可抓",凭据管理写死在
  ~/.secrets/,事件记 docs/secrets.md;合规红线同步更新
- plan.md §0.5: 新增 基础设施段(机器初始化 / 调度 / 登录态获取)
- plan.md §1.4: 存储分级演进(< 50 GB 云盘,50-200 GB 评估,> 200 GB 迁对象存储)
- plan.md §1.6: 登录态抓 u.lceda.cn 工程源
- plan.md §1.7: scripts/convert_to_kicad.py 批处理,候选 easyeda2kicad.py
- plan.md 风险表: 加账号封禁 / 转换失败 / 云服务器单点故障三条
- docs/sources/oshwhub.md: u.lceda.cn 从"未开放"移到"需登录,已纳入范围"
- README.md 数据源表: 加"登录态"列 + 运行环境说明
- log.md: 本次策略变更记录

未改:未新增 docs/infra.md(等机器到位 + 真实细节后再写),scripts/convert_to_kicad.py
尚未实现(等拿到工程源样本再实现)。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:57:30 +08:00
Zhang Jiahao
ba501c328c Remove personal name from suggestion/decision phrasing
Why:
- "给 Charles 的建议"、"待 Charles 拍板"、"需要 Charles 决策" 这些写法
  把具体人绑到了文档里,换维护者就失准。改成中性的 "建议 / 待决策 /
  待拍板",文档对未来协作者和 agent 都更通用。

What:
- log.md: 四处去掉 "给 Charles / 还是需要 Charles 决策 / 等 Charles 拍板"
- plan.md: 三处去掉 "待 Charles / Charles 定目标 / 需要 Charles 定"
- docs/sources/hf_bshada_open_schematics.md: "待 Charles 决策" → "待决策"
- scripts/estimate_size.py: docstring 去掉 "给 Charles 一个估计"
- CLAUDE.md: 数据删除确认规则从 "先跟 Charles 确认" 改成 "先跟用户确认"

保留的 Charles 提及都是事实性的:
- README/plan 里的 "维护者:Charles"(身份字段)
- log.md 历史条目里 "Charles 要求..." / "Charles 点名..."(历史事件记录)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:01:52 +08:00
Zhang Jiahao
ed4837dedf Rewrite oshwhub.md as canonical data source investigation
Why:
- Charles 要求把 12493 总数验证 + 90 项目采样结果合进主调研文档,消除
  oshwhub_corpus_estimate.md 与 oshwhub.md 的重复与分散。
- 一份高质量的数据源调查应该独立完备:任何人(人或 agent)读完就能
  复现爬取 / 估算 / 合规判断,不用跨文件拼凑。

What:
- docs/sources/oshwhub.md 重写为 9 节 + 附录:
  - TL;DR 表(一页纸核心事实)
  - 站点架构 / robots / API 入口 / 项目详情 SSR / 附件 CDN
  - 排除项:fs-web-stream.jlc.com 推广图标 / u.lceda.cn 登录源
  - §4 项目总数验证(新):三路 sort 一致 12493 + 分页二分边界 ≈250 页 + grade 覆盖抽样
  - §5 抽样语料特征(从 corpus_estimate 并入):体积 median 9MB/p90 54MB、
    视频占 54%、license 分布 GPL 3.0 49%/Public Domain 21%
  - 风险表 7 条、附录重跑命令
- 删除 docs/sources/oshwhub_corpus_estimate.md(内容已并入 §5)
- log.md: 本次记录

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:59:05 +08:00
Zhang Jiahao
53b7648984 Add HF bshada/open-schematics to Phase 1 plan
Why:
- Charles 点名把该 HF 数据集纳入第一批。它是已预处理包(非待爬网站),
  和 oshwhub 的抓取逻辑不一样,先把决策面在 plan 里讲清楚,再动手拉。
- 与 oshwhub (EasyEDA 生态) 互补,补 KiCad 原生路径。

What:
- docs/sources/hf_bshada_open_schematics.md: 调研文档
  - 78 parquet shards, 6.4 GB 总量
  - CC-BY-4.0 商用友好
  - 字段:.kicad_sch 源 / PNG / 组件列表 / JSON / YAML / name / desc
  - 镜像方案(整包存 data/external/..., 不拆 per-project)
  - .gitattributes 建议(data/external/**/*.{parquet,png} → LFS)
- plan.md §1.5: 阶段说明 + 待 Charles 批 6.4 GB 预算
- README.md 数据源表: 加一行
- log.md: 本次记录

下载未触发,等 Charles 拍板。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:51:24 +08:00
Zhang Jiahao
ce22717288 Add projects.md index (stars-sorted) + build_index.py generator
Why:
- Charles 要一个索引页看入库项目 + 他们的 stars。手工维护会漂移,
  所以 scripts/build_index.py 直接读 metadata.json 重新生成,保证
  projects.md 永远是 data/raw/ 的镜像。

What:
- projects.md: 10 个项目按 Stars 倒序(最高 3293 的加热台量产计划
  → 最低 236 的柚子爱 AI 相机),含 stars/likes/forks/views/comments/
  files/size,+ License 与数据源分布
- scripts/build_index.py: 扫 metadata.json 渲染 markdown,支持未来
  多数据源(source 字段区分),下次新增 oshwhub / github / hackaday
  项目后重跑即可
- README.md: 加 projects.md 链接

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:48:21 +08:00
Zhang Jiahao
e222b08f27 Add corpus size/license estimator; snapshot 90-project findings
Why:
- 放量决策需要比"52MB/项目 × 12493 = 650GB"更扎实的数据。用
  scripts/estimate_size.py 采样 90 个 hot 项目的 attachments[].size
  得到真实分布(median 9MB / p90 54MB),全量 median 估算 110GB,
  p90 上界 660GB。这给 Charles 一个可信的存储预算。
- 附带 license 和 ext 分布采出两个重要洞察:
  (1) mp4+qt 视频占 54% 存储,加 --skip-ext 开关可节省一半;
  (2) NC (Non-Commercial) 许可 ~11%,下游必须按 whitelist 过滤。

What:
- scripts/estimate_size.py: 无下载的元数据采样器,复用 crawler.parse_detail_html
- docs/sources/oshwhub_corpus_estimate.md: 结果快照 + 决策建议
- log.md: 本次会话记录

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:45:54 +08:00
Zhang Jiahao
c8d55a22eb Add schema+file validator; pin down fs-web-stream as ad icons
Why:
- schema 必须能自动校验,否则后续放量无法防腐。现在 scripts/validate.py
  对全部 metadata.json 做两层检查(schema + 本地文件 sha256),跑一次
  即可对全量数据签收;10/10 项目已通过。
- docs/sources/oshwhub.md 之前把 fs-web-stream.jlc.com 标为"工程源待查",
  排查后确认那些 URL 全部是嘉立创服务侧栏/推广图标,与项目无关。
  image.lceda.cn/attachments/ 是项目附件的唯一入口,现在调研文档闭合。

What:
- scripts/validate.py: jsonschema 校验 + optional --check-files 核 sha256
- pyproject.toml: 加 jsonschema>=4.26 依赖
- docs/sources/oshwhub.md: fs-web-stream 归类为推广资源(已排除),附 context 证据
- log.md: 本次会话记录

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:40:55 +08:00
Zhang Jiahao
5ffa10f256 Phase 1 MVP: crawl 10 high-quality oshwhub projects into LFS
Why:
- Charles 指定:先爬 10 个高质量项目存 Gitea LFS,一个项目一个文件夹,
  保留原文件和 URL。先以小批量验证 schema + LFS 流水线,放量前再拍板
  存储规模。

What:
- crawlers/oshwhub: 列表 API (`/api/project?sort=hot`) + SSR HTML 解析,
  一次性产出 metadata / description / cover / files / _urls
- schemas/project.schema.json: 跨源统一 schema
- docs/sources/oshwhub.md: API 入口 / 字段映射 / 陷阱调研
- pyproject.toml: httpx[http2] 单依赖
- .gitattributes: data/raw/**/files/** 一律走 LFS(规则写窄,避免误伤 schemas/*.json 等)
- .gitignore: 移除 data/raw/* 排除(改走 LFS 入库)

10 个项目覆盖:调试器 / 加热台 / 盖革计数器 / 数控电源 / 焊台 /
智能手表 / USB 测电流 / ZVS 感应加热 / AI 开发板 / 红外热成像。
共 52 附件 ≈ 524 MB 入 LFS,筛选判据 grade=4 & likes>=100 & 多样性。

Known gaps(见 plan.md § Phase 1.4):
- EasyEDA 源 JSON 需登录 (u.lceda.cn),v0.1 跳过
- fs-web-stream.jlc.com 的工程源下载未测
- scripts/validate.py 自动 schema 校验未实现

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 19:34:09 +08:00
Zhang Jiahao
bf2370f83b Initial skeleton for FacereDataset
Why:
- Facere 需要一个统一的开源硬件设计数据源,用于训练专有模型与
  构建检索型知识库。仓库先立骨架,把合规红线、数据 schema 要求、
  爬虫规约写在 CLAUDE.md 里,避免后续实现时各站点爬虫写法发散。
- plan.md 用阶段化路线图明确"先广度后深度、先合规后规模"的策略,
  让放量前必须经过 Charles 对齐一次,降低存储与法律风险。

Contents:
- README.md: 项目简介、数据源表、仓库结构、合规声明
- CLAUDE.md: 项目级 Claude 指令(工作流 / 爬虫规约 / 合规红线)
- plan.md: Phase 0-6 分阶段计划 + 风险与未决项
- log.md: 首条日志(调研 + 初始化记录)
- .gitignore: 排除 data/{raw,processed,state} 内容,保留目录占位
- 目录骨架: crawlers/ schemas/ scripts/ data/ docs/sources/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:58:10 +08:00