FacereDataset

Author	SHA1	Message	Date
Knowit	6aa72faf84	docs: std corpus 2026-05 snapshot + batch-1000/4000/remaining log Snapshot of full oshwhub std corpus delivery: - 12,493 projects total, 12,166 (97.4%) with editor source - 4 sweep batches + 1 early-mixed = 5 zip artifacts in COS GZ + SG buckets - 30-day SG-region presigned URLs for downstream pickup log.md tracks the multi-batch sweep including driver bug postmortem (bash heredoc python3 missed httpx → 26-min run wasted on empty zips, recovered by switching to uv run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 10:56:09 +08:00
Knowit	d5cc6507cb	docs: 飞控 std topical index (79 projects) Topical index for std-origin flight-controller projects. Combines data/state/oshwhub_listing_full.jsonl listing fields with each project's metadata.json (license, source completeness, editor_version). Useful as a flat per-topic reference vs the global projects.md sorted purely by stars. 77 added this batch (commit `29530e0`) + 2 prior. 75 have editor source, 4 are attachments-only on upstream. scripts/build_feikong_index.py is reproducible: source of truth lives in data/state/ + data/raw/, no hand-editing.	2026-04-30 19:23:52 +08:00
Knowit	5aefd7c0a7	log: 飞控-77 batch summary	2026-04-30 19:06:36 +08:00
Knowit	29530e09d2	飞控-77: 77 std flight-controller projects ingested Topic-targeted pull from local listing index (`name OR introduction` contains 飞控). 79 std hits in oshwhub_listing_full.jsonl, 2 already crawled, 77 newly fetched. dev1 (Guangzhou) walltime: Step 1 detail scrape ~12s, Step 4 std-source backfill ~80s (concurrency=5) Source completeness: 73/77 with editor source, 4 are upstream attachments-only (no editor session ever attached, source_documents=[] is genuine — no editor_version on the SSR page either). Crawler hardening (crawlers/oshwhub/crawler.py): - count.{like,star,fork,views} are now `.get(..., 0)` defensive. Listing API omits zero-valued fields for some low-activity entries (3/77 hit this on first pass, hard-failed with KeyError 'like'). Affects rank_score, pick_top, and metadata.json metrics block. License mix: 65% GPL 3.0, 11% Public Domain, 11% MIT, ~6% CC variants. Transport: dev1 → SG via tar+scp (33 MB, ~3 min over lossy cross-region link). Bypassed gitea push from dev1 because the same 6.5%-loss link tanks single-stream throughput. feikong-77-20260430	2026-04-30 19:05:57 +08:00
qcloud	c199840ad3	飞控-77: 74 std flight-controller projects ingested Topic-pull from local listing index (`name OR introduction` contains 飞控). 77 std hits in oshwhub_listing_full.jsonl, minus 2 already crawled = 75 attempted; 74 OK + 1 hard fail (`m1_mh743_ada_v4`, listing entry missing `count.like`). dev1 walltime: Step 1 ~12s, Step 4 ~80s (concurrency=5). License mix: 65% GPL 3.0, 11% Public Domain, 11% MIT, 4% CC variants. 3 partial dirs (Step 1 KeyError on missing `count.like`) dropped — to be re-fetched after a follow-up crawler patch makes count fields defensive against listing-index outliers. Source backfill 74/74 OK, total +46 MB.	2026-04-30 18:52:22 +08:00
Knowit	f9d370e950	crawler: thread-pool concurrency for backfill paths Both _run_backfill_source and _run_backfill_pro_source now honor --concurrency N (default 1 keeps current sequential behavior). Shared dispatch helper _run_backfill_concurrent + _discover_backfill_targets factored out — the two paths had drifted but were structurally the same. Thread safety: - httpx.Client is sync-thread-safe per docs; one client shared across threads is correct - Per-project file writes (metadata.json + source/*) don't conflict since each thread owns one project dir - Oversize state file is shared; serialized via a Lock around _record_oversize - Print is wrapped in a Lock for readable progress Expected speedup on dev1 (Guangzhou): batch-200 Pro 100 项 sequential ~14 min -> concurrency 5 ~3-4 min. Std similar 2-3x. Server-side limit isn't likely to bite at this scale (probe showed Pro QPS=2 sustained clean; concurrency 5 puts effective rate around 4-5 req/s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:32:03 +08:00
Knowit	7cb35020f4	plan: batch-200 expansion (100 Pro + 100 Std) Doubles down on what worked in batch-50: - dev1 (Guangzhou) is primary execution host - Owner cap=2 for diversity - --max-source-mb 200 to defend against X86-class outliers - Pro 2.x deprecated-board fix is already in (commit `c3cac97`) - SSH transport for dev1 -> gitea (commit `8220c99`) Candidate pool: 200 picks from A-tier (grade>=3 & like>=10) minus already-crawled 65 Remaining A-tier corpus is 2,741 (Pro 1326 + Std 1415) 173 unique authors, like median 258, grade dist 4:118 / 3:82 Estimated walltime ~25-35 min on dev1 for Step 1-4 (no attachments). LFS increment ~2.5 GB (source only) or +10 GB if Step 5 attachments included. Either way well within Gitea's 200 GB migration threshold. Step 5 (attachment download) deferred — not on the critical path for EPRO2/Std → KiCad work, can revisit when license-filtered Forge projection demands it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:29:53 +08:00
Knowit	7f3729b89c	projects.md: rebuild index after batch-50 (65 projects) Refreshed via scripts/build_index.py. Reflects the full corpus state post batch-50 Step 1-4 + Super Dial deprecated-board fix: 65 projects · 253 attachments · 3.1 GB declared by origin: Pro 30 (5 modern + 25 legacy) + Std 35 by license: GPL 3.0 dominant, ~80% Forge-friendly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:25:36 +08:00
qcloud	6d9c980628	batch-50: Super Dial 7f7565ef11 source ingested after deprecated-board fix Pro 2.x legacy. 6 docs (3 sch sheets + 3 PCBs), 0.2 MB plain. The deprecated 主控板V1 sch/pcb pair is correctly skipped (filter via ticket.schematics/pcbs keys, see crawler commit `c3cac97`). batch-50 success rate is now 50/50 (was 49/50). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:24:07 +08:00
Knowit	c3cac97593	crawler: filter Pro 2.x deprecated boards from sch/pcb fetch Pro 2.x project metadata's boards[] can reference sch/pcb UUIDs that the project owner has since deprecated/deleted (e.g. "主控板V1（废弃）"). Such UUIDs are gone from ticket.schematics / ticket.pcbs but still in boards[]. Asking schematic/lists or documents/lists for them returns 401 and aborts the whole project. Filter both lists against the authoritative ticket dict before posting. Verified on 7f7565ef11 (Super Dial 电机旋钮屏): 4 boards but only 3 sch entries in schematics dict, isolating the deprecated 8bc59f to a 401 we now skip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:23:39 +08:00
Knowit	8220c99f7e	infra: switch dev1 -> gitea git remote to SSH (port 222) Avoids HTTPS-over-lossy-link TCP cwnd issues that pinned the previous push to ~360 KB/s for 10 min on the batch-50 Step 3-4 commit. SSH key generated on dev1 (~/.ssh/id_ed25519), public key posted to gitea via /api/v1/user/keys (title "dev1-guangzhou"), origin URL updated to ssh://git@git.deepknow.site:222/Facere/FacereDataset.git. Also documents the kernel + git side optimizations applied: sysctl net.ipv4.tcp_congestion_control=bbr (was cubic) git config --global http.postBuffer 524288000 (500 MB) Note: gitea git SSH port is 222, not 22 (22 is the host sshd). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:19:00 +08:00
qcloud	5286fc45b6	batch-50: Step 3-4 source download from Guangzhou — 49/50 ok Run on dev1 (Guangzhou) for the latency advantage. Walltime 3:41 vs Singapore-estimated 1-2h (~30x speedup, mostly from image.lceda.cn RTT going from 263ms to 2.6ms). Pro 25: 24 ok + 1 fail (Super Dial 7f7565ef11 — Pro 2.x legacy schematic/lists 401, separate cookie-perm issue) 611 docs, 31 MB total Std 25: 25 ok, 97 docs, 74 MB total Combined: 49/50 success, 708 docs, 105 MB new disk usage --max-source-mb 200 cap was not tripped; the 25 Pro candidates are all under 10 MB, so the 481 MB X86-board outlier from the original sample was not representative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:12:00 +08:00
Knowit	d11ca1d3be	tools/epro2/std: fetch + decrypt Pro 2.x encrypted-external blobs Pro 2.x stores some doc payloads (notably Taishan's PCB) externally at modules.lceda.cn keyed by dataStrId, AES-256-GCM encrypted with the iv/key fields stored alongside. Same crypto pattern as Pro 3.x EPRO2: last 16 bytes are the GCM auth tag, rest is gzip(plaintext-op-stream). The CDN doesn't require auth. - pro2_writer.fetch_encrypted_plaintext(): fetch + decrypt + gunzip, cache result at source/<uuid>.decrypted.txt so re-runs skip the network round-trip. Heavy imports (httpx, pycryptodome) are deferred to call-time so the pure-replay path doesn't pay for them. - pro2_writer.split_plaintext_by_doctype(): walk the multi-doc plaintext (Pro 2.x bundles N FOOTPRINTs + 1 PCB into one blob), yield (label, sub_text) per inner doc. Label = HEAD.uuid if present, else fallback `<kind>_<idx>`. - __main__._convert_pro2_encrypted(): for each sub-doc, write a synthetic inline-Pro-2.x JSON next to the original and re-route through write_pro2_doc — re-uses BBox / layers / objects-extraction instead of duplicating the logic. Output filename `<parent_uuid>__<sub_label>.json` makes the parent association visible. Smoke (Taishan): 28 inline SCHs → 55 total. Decrypts: - one PCB blob (3.4 MB plaintext, 20267-object PCB + 25 FOOTPRINT sub-docs of 130-580 objects each) - one SCH-typed encrypted doc (1 sub-SCH of 891 objects) 86 unit tests still pass; new fetch/decrypt path is covered manually via the smoke test rather than mocking httpx + AES. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:07:40 +08:00
Knowit	3720cd176a	tools/epro2/std: add Pro 2.x JSON path — Liangshan + Taishan SCH now exportable The downstream colleague's "encrypted_external" / "string old format" projects were Pro 2.x, not Pro 3.x EPRO2. Pro 2.x ships each doc as a JSON file whose `dataStr` is a plaintext op-stream — one JSON array per line, e.g. `["COMPONENT","e1","",0,0,0,0,{},0]`. Different wire format from EPRO2's binary tilde/pipe streams; same Std envelope works for output. - tools/epro2/std/pro2_writer.py: parses dataStr line-by-line, keys objects by id (position 1 for most ops, OPTYPE for singletons), extracts BBox by walking known coord positions per OPTYPE, derives layers from LAYER ops directly (Pro 2.x almost matches Std layer string format already). PCB blobs that are encrypted-external (`dataStrId` URL + `iv` + `key`, no inline dataStr — Taishan PCB) return None so the CLI skips with a message instead of stubbing. - tools/epro2/std/__main__.py: auto-detect via manifest's editor_version. "2.x" → Pro 2.x writer; otherwise the existing EPRO2 replay path. CLI surface and output layout unchanged. - docs/sources/epro2_to_std_mapping.md: adds a Pro 2.x section. Adapter dispatches on `head.epro_format`: absent / "epro2" gets dict-shaped objects values, "pro2" gets array-shaped values (`[OPTYPE, arg1, ...]`). Lists the Pro 2.x-specific OPTYPEs (FONTSTYLE / LINESTYLE / CONNECT / OBJ / REGION / DIMENSION / STRING / TEARDROP) the EPRO2 vocabulary doesn't have. Smoke (re-running --all on all 5 Pro projects): 191 → 222 JSON files. Liangshan adds 3 (2 SCH + inline 5357-object PCB). Taishan adds 28 (SCH only — PCB skipped, encrypted-external; source/<uuid>.json still keeps the dataStrId/iv/key for a later fetch+decrypt pass). 84 → 86 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 02:00:37 +08:00
Knowit	3866e24189	tools/epro2/std: rewrite to Option 2 (objects dump) per downstream spec Downstream came back with concrete requirements: don't pre-compute Std shape[] tilde strings, just dump the raw EPRO2 `objects: {id: payload}` dict and they'll write a ~100-LoC adapter on their side. Pulling the tilde-mapping work back saves us from second-guessing positional fields without their parser to verify against, and shortens our pcb_writer from ~500 lines to ~40. Output shape (Std envelope intact, just no `shape[]`): { "success": true, "code": 0, "result": { "uuid", "puuid", "title", "docType": 3 \| 1, "components": {}, "dataStr": { "head": { "docType": "3" \| "1", "editorVersion": "facere-epro2/0.1 (epro2 <X.Y.Z>)", "units": "mil", "epro2_doc_uuid": ..., "epro2_editor_version": ..., }, "BBox": {x, y, width, height}, # mil "layers": [...], # Std layer-string array "objects": dict(doc.objects), # raw EPRO2, 1:1 "preference": {}, "netColors": [], "DRCRULE": {}, } } } Per-doc spec downstream gave us: - shape[] dropped (empty placeholder misleads adapter) - all units mil (no mm conversion — Std canvas already declares mil) - head.units="mil" so adapter doesn't have to guess - BBox min/max across known x/y/startX/endX/centerX fields; adapter can refine by walking path arrays itself - layers[] keeps Std's 17-line default + inner SIGNAL layers actually used (21~Inner1.., 22~Inner2..) - empty stubs preference/netColors/DRCRULE for grep-based triage New: docs/sources/epro2_to_std_mapping.md with the full EPRO2 OPTYPE → Std verb table that downstream's adapter authors will copy from. Tables include the layer-id remapping (the 5↔7 paste/mask flip, 11→10 outline, 12→11 multi, SIGNAL 15+→21+), PCB op mappings, SCH op mappings (marked best-effort: no Std SCH samples in our corpus), and the 5-Voltage placeholder COMPONENT → extra net flag trick. Extracted from the previous Option-3 writer (commit `fe6971f`) so adapter writers don't have to reverse-engineer it from source. ESP-VoCat smoke: 6 PCB + 9 SCH = 15 JSON files, head.units=mil preserved, no shape[] field present. 82 → 84 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> std-deliverable-20260429	2026-04-29 01:41:12 +08:00
Knowit	c6fd111d6d	crawler: --no-cover, --concurrency, drop cross-host sleep + batch-50 Step 1 done Three crawler ergonomics for batch operations: --no-cover Skip cover image download. For scan-only modes (license/meta scrape) this drops ~1.3s/project and avoids slow-CDN hangs. --concurrency N ThreadPoolExecutor wrapping the per-project loop. Default 1 = serial (current behavior). Anonymous endpoints tolerate 5+ comfortably; output uses a print lock for readable interleaved progress. fetch_cover plumbs through crawl_one. Drop cross-host sleep #1: in crawl_one between detail HTML (oshwhub.com) and cover image (image.lceda.cn). Different hosts — sleep was unnecessary. Saves ~1s/project. Sleep #2 (post-cover, before next iteration) stays — it gates the next oshwhub.com hit. download_to gains max_seconds wall budget (default 60s, cover uses 15s). Defends against pathologically slow CDN connections — observed 10 KB/s on image.lceda.cn for one project, would have hung 6+ min on a 3.6 MB cover otherwise. httpx default timeout resets per chunk, so streaming downloads need an external wall-clock guard. batch-50 Step 1 (license/meta scrape) shipped: 50/50 candidates have metadata.json + license recorded License distribution: GPL 3.0 32, Public Domain 6, NC variants 8, CERN-OHL 1, MIT 1, CC BY 3.0 1 Forge-friendly (non-NC): 41/50 (82%) Declared attachments: 180 files / 2.36 GB (median 18 MB/proj, max 304 MB) Walltime: 3min 26s for 28 projects at concurrency=5 (server-side HTML render bound, not sleep-bound) One orphan partial cover (a670e60a...) cleaned up — leftover from the first aborted run before the timeout fix landed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 01:35:11 +08:00
Knowit	fe6971f3f9	tools/epro2: add std/ writer — EPRO2 → EasyEDA Std-format JSON for downstream The downstream colleague consumes oshwhub Std (lceda) dict-format JSON, not KiCad. The EPRO2 decryption part (per-doc plaintext .epro2 streams in data/raw/<uuid>/source/) is what we already provide; the missing piece is converting EPRO2 op-streams into the same `dataStr.shape` tilde-delimited format their parser already speaks. New tools/epro2/std/ module, peer of tools/epro2/kicad/, kept deliberately separate so the KiCad path stays untouched: - pcb_writer.write_pcb_std() — high-fidelity, validated against a Std PCB sample at data/raw/oshwhub/3e2f893d.../25931ddab8.json. Maps LINE→TRACK, VIA→VIA, POUR→COPPERAREA (with SVG `M..L..Z` path), POLY→CIRCLE/SOLIDREGION, COMPONENT+FOOTPRINT→LIB nested with #@$-separated PADs (placement rotation + translate applied so pad coords land at PCB-absolute positions). Layer-id mapping (EPRO2 5↔7 flipped vs Std solder/paste, 11→10 outline, 12→11 multi, SIGNAL inner 15+ → Std 21+) noted inline. - sch_writer.write_sch_std() — best-effort. Our corpus has zero Std schematic samples (docType=1) so verb field orders follow the EasyEDA Std public spec, not direct observation. Emits W (wire), N (net flag, including the 5-Voltage Global Net Name power-port pattern), T (text), LIB (placement with #@$-nested PIN/T). If downstream's parser bails the fix is almost certainly a positional field tweak, not a re-architecture. - __main__.py — flat output `<doc_uuid>.json` per doc directly under --out (mirrors Std's own data layout); --all-pcb / --all-sch / --all. Smoke test on ESP-VoCat: 6 PCB + 9 SCH = 15 JSON files, libs_unresolved=0 across the board. Compact JSON (separators=(",",":")) matches Std's single-line format. Numbers use _num() — integers without trailing .0, floats trimmed. 71 → 82 unit tests pass. Open questions for downstream: (1) confirm SCH verb field orders, (2) do they want any of the upstream metadata fields we drop (master, owner, created_at, etc — those live on the crawler side, not the schematic itself)? Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 01:16:39 +08:00
Knowit	ed713fa557	docs: consolidate rate-limit probe results into a proper benchmark report The doc had been growing incrementally as each host got probed; reshape it as a polished benchmark with TL;DR top, methodology section (including safety constraints + caveats), per-host detailed tables, final crawler settings, batch-50 walltime breakdown, and a reproduce recipe. Five hosts fully covered: pro.lceda.cn API 5.0s -> 0.5s (10×) lceda.cn doc 5.0s -> 0.5s (10×) oshwhub detail 2.0s -> 1.0s ( 2×) oshwhub listing 2.0s -> 1.0s ( 2×) modules.lceda CDN 0.2s (already optimized) Net effect on batch-50 plan: sleep total ~32min -> ~3min, walltime ~2h -> ~10-15min. Key finding: the original 5s/req on Pro was set out of "logged-in account is precious" caution with zero empirical evidence. Sustained burst probe (25 distinct UUIDs at 0.5s, no recovery) showed 0/25 errors and median latency 410ms — the caution was unjustified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:57:35 +08:00
Knowit	c474f8ad83	log: X86 motherboard hit OOM swap death-spiral on its CPU board PCB write Killed at the 14-min mark — VmRSS 1.96 GB + VmSwap 1.41 GB on a 3.3 GB RAM box with 4 GB swap (3.6 GB used), read_bytes 24 GB (pure swap thrash), process state D (uninterruptible disk sleep). The CPU board PCB doc (8K+ objects, 35+ child schematic pages) overflowed our current all-in-memory build pattern: pcb_writer builds the full output list before to_sexpr serializes once at the end, plus the 35 write_sch_page calls each build their own Relations + lib_symbols dedup state. Saved what finished: 4/5 X86 boards complete (Sch-CAM-IMX415, Schematic1, SCHEMATIC1, Sch-VTX-SSC338Q), the CPU board SCHEMATIC1_1 has all its 35 child .kicad_sch but no .kicad_pcb. Final downstream delivery: 17 board projects across the 3 supported Pro projects, 32/32 files pass kicad-cli (sch erc + pcb export svg). Streaming-write fix is the next logical follow-up but out of scope for this turn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:56:17 +08:00
Knowit	183f82a3be	crawler: drop SLEEP_SOURCE 5.0 -> 0.5 (Std doc endpoint probe) Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s) × 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency variance is dominated by payload size (Std docs span 4 KB to 4.5 MB) not server backpressure. Same posture as Pro API. Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19 min wall time (21min sleep -> 2min sleep). Combined plan now projects ~2h -> ~10min walltime exclusive of download bytes. scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json upstream_version_documents lists). Reusable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:54:46 +08:00
Knowit	8b857428e3	log: post-mortem of running --all on the other 4 Pro projects Captures the two new --all crash paths fixed in `61fd3ff` (odd inner copper layers, duplicate BOARD titles) plus the Pro 2.x scope gap (Taishan + Liangshan are JSON-format, not EPRO2 streams, so our replay_project reads the bytes but doc_type stays None and _group_by_board returns no SCH/PCB groupings — needs a separate Pro 2.x writer). Status as of this commit: ESP-VoCat 6 boards + 220V power 7 boards = 13 project dirs ready for downstream corpus. X86 motherboard is the largest of the five (7374 docs, 1.9 GB RAM in flight) and still running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:53:08 +08:00
Knowit	61fd3ff072	tools/epro2/kicad: fix two --all crashes found running the other 4 Pro projects Running the new --all on the remaining 4 Pro projects (X86 motherboard, 220V power supply, Taishan Pi, Liangshan Pi) surfaced two crash modes not covered by ESP-VoCat: 1. Odd inner-layer count → KiCad rejects the file at load with "3 is not a valid layer count". The 220V power boards have one used inner SIGNAL layer (3 copper total: F.Cu / In1.Cu / B.Cu), but KiCad requires an even copper count. Fixed pcb_writer to pad with one empty inner layer when the inner count is odd, so the total stays even (2, 4, 6, ...). 2. Two BOARDs sharing the same META.title — twin "显示板" boards in the 220V power project — landed in the same project directory and the second silently overwrote the first's .kicad_sch / .kicad_pcb / .kicad_pro. Fixed --all to detect title collisions and suffix every colliding basename with the BOARD uuid prefix (so both '显示板' boards become '显示板_52e8cc76' and '显示板_55d32906' rather than one quietly winning). 71 → 73 unit tests pass (test_odd_inner_signal_count_padded_to_even_total + test_duplicate_board_titles_get_distinct_basenames). Tangentially noted while running this: Taishan Pi and Liangshan Pi are Pro 2.x JSON, not EPRO2 streams — our replay layer reads the files but doesn't decode docType, so SCH/PCB grouping returns nothing. Pro 2.x needs a separate writer; out of scope for this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:48:46 +08:00
Knowit	cb868988b9	crawler: drop sleep rates 10x for Pro API, 2x for oshwhub detail Calibrated against ladder probes on 2026-04-29. Findings in docs/sources/probe_rate_limit_results.md. SLEEP_PRO 5.0 -> 0.5 (pro.lceda.cn API) SLEEP_BETWEEN 2.0 -> 1.0 (oshwhub detail/listing) SLEEP_SOURCE 5.0 unchanged (lceda.cn Std endpoints — not yet probed) SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized) The original 5s rate for Pro API was set out of caution because Pro requires a logged-in cookie. Empirical sustained-burst probe (25 distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was wrong — server tolerates QPS=2 cleanly. oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe water mark. Net effect on batch-50 estimate: ~1.5h -> ~30min. scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable for new endpoints (Std source still owes a probe). Designed for safety: 30s tier recovery, low rep counts on auth hosts, bail on first non-200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:45:34 +08:00
Knowit	3c00edf6db	tools/epro2/kicad: --all emits paired .kicad_pro + .kicad_sch + .kicad_pcb per BOARD KiCad pairs project files purely by basename + same directory: a folder holding `Foo.kicad_pro`, `Foo.kicad_sch`, `Foo.kicad_pcb` opens as one project on double-click of the .kicad_pro, with cross-tool navigation (open footprint from schematic etc) wired up automatically. - pro_writer.write_kicad_pro() renders the minimal KiCad 8 JSON we need: meta.filename pinning the basename, sheets=[[<root_uuid>, ""]] binding the schematic root, and stub blocks for board / schematic / net_settings / erc that KiCad expects to find on the first GUI load. - root_sch_writer.write_root_sheet() now accepts an optional root_uuid so the caller can pass the same uuid into the .kicad_pro and .kicad_sch (the binding fails silently with mismatched ids). - CLI gains `--all`: groups SCH/PCB docs by their META.board uuid (1:1 in EPRO2), strips SCH-/PCB- editor prefixes from titles to derive a shared project basename, and emits one directory per BOARD with paired files. BOARDs whose SCH is DELETE_DOC (LCD-BD on ESP-VoCat) still get a .kicad_pro with sheets:[] + .kicad_pcb so pcbnew opens cleanly. ESP-VoCat smoke: 6 boards → 6 project dirs, all pairs validated by kicad-cli sch erc / pcb export svg. The CoreBoard pro/sch/pcb trio shares root uuid 366d3e53...c2fccbe4330b end-to-end. 68 → 71 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:39:58 +08:00
Knowit	adc5dc5e1b	tools/epro2/kicad: PCB Phase-2 — POUR → (zone), CoreBoard unconnected -43% Phase-1 left 75-358 unconnected_items per board (DRC), dominated by GND/AGND/POWER nets that EPRO2 routes through copper pour, not discrete traces. Phase-2 lands those: - pcb_writer._decode_zone_path handles the three POUR.path encodings seen in ESP-VoCat: rectangle (['R', x, y, w, h, ...]), circle (['CIRCLE', cx, cy, r]) approximated as a 36-segment polygon, and polyline (numeric pairs with 'L'/'ARC' verb tokens). - Each POUR on a copper layer turns into a (zone (polygon ...) ...) block plus a (filled_polygon ...) that mirrors the boundary. Why mirror, not auto-fill: kicad-cli pcb drc does NOT run the zone filler before checking — only the KiCad GUI does. Without a pre-computed (filled_polygon ...), DRC sees zones as empty regions and reports the entire net as unconnected. Mirroring the boundary as the fill is "connectivity-correct, clearance-imprecise" — KiCad users can still hit Edit > Fill Zones to refine thermals and pad clearances. We chose this over reading EPRO2's POURED.pourFill (the editor's own post-fill polygons) because POURED paths use ARC tokens we'd need to fully decode, and the user-drawn POUR boundary is already the authoritative "intended copper" region. ESP-VoCat DRC totals: 883 → 730 unconnected_items (-17% project-wide). CoreBoard, the 4-layer board with the most pour coverage, drops 358 → 205 (-43%). Other boards see no movement because their unconnected items are non-pour issues — pads outside the user-drawn POUR rectangle, or internal $1N nets via vias on the wrong net (separate problem, separate fix). 65 → 68 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:27:33 +08:00
Knowit	eee1a9b97e	crawler: --skip-ext + --max-source-mb gates for batch-50 expansion Two CLI gates needed before scaling Pro batch beyond top-5: --skip-ext mp4,qt,mov (attachment filter) Skips video extensions in attachment download. Phase 1 measurements showed mp4+qt occupy ~54% of attachment storage. Entry still recorded in metadata.json with skipped:ext:<token> so we can re-fetch later if the policy changes. Honors both server-declared `ext` and filename suffix, case-insensitively. --max-source-mb N (Pro source size cap) Trips inside the chain replay loop on encrypted-blob total. On trip: raise ProjectOversizeError, wipe partial source/, append a row to data/state/oshwhub_pro_oversize.jsonl. Lets us shortlist 50+ Pro projects without one X86-board-class outlier (~500 MB) blowing the LFS budget. Std and Pro 2.x legacy are not capped (both <2 MB in sample). Verified: - cap=0 trips on first blob (1.2 MB), source/ wiped, state recorded - cap=100 runs full ESP-VoCat (7.5 MB plain, 278 docs) - skip-ext microtest: 8/8 cases (case-insensitive, declared/suffix fallback, empty-token edge cases) Plan + frozen candidate list for the next 50 projects: - docs/plans/oshwhub_batch50.md - data/state/oshwhub_batch50_candidates.jsonl (gitignore exception added) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:24:55 +08:00
Knowit	e61404478e	tools/epro2/kicad: Phase-1 .kicad_pcb exporter — 6/6 boards open in KiCad 8 Phase-1 scope: produce a .kicad_pcb that kicad-cli loads cleanly and that has the right geometry (nets, footprints, tracks, vias, board outline) — not a 1:1 EDA round-trip. Skipped on purpose for Phase 2: copper pours (POUR/POURED), manual FILL, teardrops, board-level strings/images, ARC circle-center recovery. What lands: - pcb_writer.write_pcb(): header/general, data-driven layer table (F.Cu = ord 0; B.Cu = ord 31; SIGNAL inner ids 15+ allocated to In1.Cu/In2.Cu/... in EPRO2-id sorted order so used inner layers stay contiguous), net-name → integer id map (id 0 reserved for the empty net per KiCad convention), LINE→segment / LINE→gr_line on Edge.Cuts, layer-11 POLY paths walked into Edge.Cuts gr_line chains (the actual board outline lives on POLY here, not LINE — without this stats showed edge=0), VIA→via. - footprint_writer.write_footprint_placement(): inline (footprint ...) blocks per PCB COMPONENT. EPRO2 RECT/ELLIPSE/OVAL/POLYGON pad shapes mapped to KiCad rect/circle/oval/custom; SMD vs THT detected by PAD.hole presence; SLOT holes use (drill oval w h). Pad nets resolved cross-doc via the existing PCB.PAD_NET → footprint.pad chain in ProjectRelations. layerId=2 component → (layer B.Cu) + text on B.SilkS so bottom-side parts render correctly. Smoke test on ESP-VoCat (6 PCBs): all 6 pass `kicad-cli pcb export svg` and render. DRC on smallest (MicBoard) reports 145 violations + 75 unconnected — most of the unconnected are GND nets that the EPRO2 source resolves through POUR copper, which Phase 2 will export. CLI: `python -m tools.epro2.kicad <project> --all-pcb --out <dir>` emits one .kicad_pcb per PCB doc. 52 → 65 unit tests pass. Float comparisons in tests use math.isclose because the s-expr 6-decimal trim doesn't preserve strict equality through `value * MIL_TO_MM` round-trips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:18:32 +08:00
Knowit	fc2a45f658	docs: explain per-doc .epro2 crawl vs web-export .epro2 ZIP Colleague-facing explainer at docs/sources/pro_crawl_vs_export.md. Addresses the "I see 278 .epro2 files but my browser only downloaded one" confusion: web download is a ZIP container (extension is a UX choice, not a format), our crawl produces per-doc message streams. Both carry equivalent EPRO2 data; only real gap is IMAGE/ binary previews which we don't fetch yet. Why per-doc and not ZIP: the ZIP path has no public endpoint — three HARs confirm the export button fires zero HTTP requests, it's pure client-side JSZip on data already loaded by the editor. Our crawler hits the same chain endpoints the editor uses internally, which delivers per-doc streams. Log entry references the 278 vs 266 doc-count delta for ESP-VoCat (we walk full history chain, web export is a current snapshot). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:13:52 +08:00
Knowit	1e06ba6582	crawler: split sleep policy by host — chain blob fetches drop 5s -> 0.2s The Pro modern fetch_pro_modern walks a per-history blob loop on modules.lceda.cn (CDN-flavored host serving AES-encrypted EPRO2 streams). We were sleeping 5s between every blob — same rate we use for the rate-sensitive pro.lceda.cn API host. HAR analysis (proexportNew2.har) shows the editor fires these blobs back-to-back without throttling, so 0.2s is plenty. Walltime drops linearly with chain length: ESP-VoCat (chain=12): 80s sleep -> 22s sleep (-72%) 220V power (chain=28): 160s sleep -> 26s sleep (-84%) X86 board (chain~700, projection): ~1h -> ~3min Verified by re-fetching ESP-VoCat + 220V power: byte-identical output across all per-doc .epro2 files (sha256 match), only fetched_at timestamp differs in manifest.json. Two manifest files re-stamped as proof of the validation runs. API host sleeps (4x 5s in modern fetcher, 7x 5s in legacy fetcher) are unchanged — those go to pro.lceda.cn /api/ which still wants polite QPS<=0.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:09:19 +08:00
Knowit	ff5553fb06	tools/epro2/kicad: hierarchical export + global_label + 5-Voltage power ports Three coupled changes so kicad-cli sch erc runs at the project level (across all sheets of one schematic) instead of single-sheet: 1. (label) → (global_label (shape passive)). EPRO2 nets are project-global by construction (named rails span every page in the SCH and physically wire across PCBs); KiCad's local label is sheet- scoped and triggers `label_dangling` for any name not duplicated on the same page. 2. New root_sch_writer that groups SCH_PAGE docs by their parent SCH (META.schematic), emits one root .kicad_sch per group with one (sheet ...) entry per child, and threads the root-assigned uuid back into each child's (sheet_instances) so KiCad can bind them. --all-sch now defaults to this; --flat falls back to one-file-per-page. 3. EPRO2's "5-Voltage" placeholder COMPONENT (partId pid8a0e77bacb214e, 365 instances on ESP-VoCat) is the editor's power port. The rail name lives in the placement's `Global Net Name` ATTR, not in the PART. We now emit a (global_label "<rail>") at the placement coords whenever that attr is set (101/365 of them on ESP-VoCat — the rest are unconfigured drafts). ESP-VoCat 5 hierarchical roots: 2325 → 2265 violations. Modest because 5 of 6 SCHs are single-page (no cross-sheet nets to resolve), and the one 4-page schematic (CoreBoard) shares only a handful of names across sheets — most net names are de-facto sheet-local. The remaining ~190 pin_not_connected are dominated by 0402-style passives whose pin tip lies on a wire's interior, not at an endpoint; KiCad needs an explicit (junction) at those points and we don't yet emit one. Marked as the next follow-up in log.md. 47 → 52 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:05:47 +08:00
Knowit	54f0173947	tools/epro2/kicad: fix two structural ERC bugs — wire_dangling -88%, pin_not_connected -52% Bisect found two semantics mismatches between EPRO2 and KiCad that cause the 850 real-connectivity ERC violations on the ESP-VoCat ref project: 1. sym_writer was emitting lib coords without negating Y, but KiCad lib uses Y-up and re-flips Y on placement (Y-down schematic). So vertically arranged pins ended up at Y-mirrored absolute positions and wires that reach the geometric pin tip in EPRO2 missed the rendered pin tip in KiCad. Fix: lib_y = -epro2_y, lib_rot = (360 - rot) % 360 for pin/text. 2. sch_writer was treating each LINE as an isolated wire — but EPRO2 binds segments into nets by NAME (WIRE.NET attr), not just geometry. Multi-segment nets like GND/VBUS show up as N disconnected stubs to KiCad. Fix: per-LINE, look up lineGroup → WIRE → NET attr and emit a `(label "<NET>")` at the LINE's start. Same-named labels on distinct physical wires is how KiCad's ERC recognizes a multi-segment net. ESP-VoCat 9 sheets: wire_dangling 444 → 52 (-88%) pin_not_connected 406 → 196 (-52%) real connectivity total 850 → 248 (-71%) Why we did NOT round to grid (the obvious-looking fix): EPRO2 places some pins on a 10-mil pitch (e.g. magnetic socket); rounding to KiCad's default 50-mil ERC grid would collapse those pins. The 248 residual is fundamentally cross-sheet — single-sheet ERC can't see a net's other endpoints on sibling sheets — and is a Phase-3 (hierarchical sheet) problem, not a per-sheet one. 41 → 46 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:43:11 +08:00
Knowit	5e63924474	oshwhub: pin listing index snapshot (33,695 rows, 29 MB) into git Previous commit added the dump script + report but the actual jsonl was caught by data/state/* gitignore. Add a targeted exception so the snapshot travels with the repo — anyone who clones can do local filtering without re-hitting the API. The data is regenerable (scripts/dump_listing_index.py is one-shot, ~1 min), but pinning a dated snapshot lets us reason about "the state of the corpus on 2026-04-28" reproducibly. Future re-dumps overwrite the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:32:58 +08:00
Knowit	d89a7cdf9c	oshwhub: dump full listing index (33,695 projects) for batch sizing Probed listing API and learned: total field is exposed (Pro=21,202 / Std=12,493), pageSize accepts >=1000 (full corpus = 35 requests / 71s), sort param is silently ignored. Dump all listings via scripts/dump_listing_index.py to local jsonl so downstream batch-selection no longer hits the API. Why: needed quantitative anchors before scaling Pro batch beyond top-5. License is detail-page only (~19h serial scan), so we want to filter on grade/like locally first to shortlist before paying that cost. Quality-tier counts now known: A-tier (grade>=3 & like>=10) = 2,806 across both origins. - scripts/dump_listing_index.py: one-shot scraper, polite QPS, streams to jsonl - docs/sources/oshwhub_listing_full.md: human-readable report with growth trends, quality tiers, owner concentration, and storage-budget anchors Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:30:56 +08:00
Knowit	67a2d0448b	.gitignore: add .rpt for kicad-cli sch erc default output bisect KiCad 8 语法时跑 kicad-cli sch erc 没传 --output 参数会把报告写到当前目录的 <input>.rpt，跑十几次主目录就堆了 20 个 .rpt 垃圾。加进 ignore 防回流。同时清掉本次留下的： 20 个 .rpt 报告（已 rm） data/state/std_probe[1-5]/ 5 个旧 probe 状态目录（~8.5 MB stale，这些目录里的 probe scripts 在前一会话已删；状态本身也没用了） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:11:20 +08:00
Knowit	fb577cc89f	tools/epro2/kicad: fix two KiCad 8 parse blockers (newline + pin_numbers) 装 kicad 8.0.9 (apt PPA) 后跑 kicad-cli sch erc 校验我们 emit 的 .kicad_sch 文件，发现 9/9 sheets 一开始全部报 "Failed to load schematic file" — 父节点解析就挂掉。Bisect 找到两个语法 bug： 1. (pin_numbers (hide no)) 不被 KiCad 8 接受 KiCad 8 lib_symbols 里 `pin_numbers` 是 token-form，不接受 (hide yes/no) 子块。要么省略整个 block 默认 visible，要么 `(pin_numbers hide)` 表示隐藏。原来的 `(hide no)` 风格是 KiCad 7 旧语法。 Fix: tools/epro2/kicad/sym_writer.py 删掉 (pin_numbers (hide no)) 行；KiCad 默认 visible 行为正是我们想要的。 2. String 里的字面 \n / \r / \t 让 KiCad 解析器中止 ESP-VoCat 的 Overview sheet 有 TEXT "Battary\n3.7V 700mAH"（多行电池标签），EPRO2 里以字面 0x0a 字符存储。我们把它原样 emit 成 "..." 包住的字符串 → KiCad reader 在 quoted string 内遇到 \n 就报 parse error 不给 message。 Fix: tools/epro2/kicad/sexpr.py 在 str escape 路径加 \n / \r / \t 转义；reader 加 \r 解码（roundtrip 用）。修完后： 9/9 sheets parse OK in KiCad 8.0.9 ERC 跑通，9 个 sheet 共 2793 violations，分布： 1372 endpoint_off_grid (49%, cosmetic — 30-mil EPRO2 grid 不 snap KiCad 默认 50-mil grid) 571 lib_symbol_issues (20%, cosmetic — facere 库未注册到 user library table；库已 embed 在 .kicad_sch 内联可用) 444 wire_dangling (16%, real — wire 端点没精确对齐 pin) 406 pin_not_connected (15%, 同上的另一面) Cosmetic 占 70%，real connectivity 30%，下个 phase 处理： - grid 校准（把 coord 精确 round 到统一 grid 上） - pin tip 端点匹配（KiCad 需要 wire 端点 == pin (at) 字段对应的绝对坐标，浮点必须精确相等） - 生成 sym-lib-table 注册 facere 库（消 lib_symbol_issues）测试： + test_string_escapes_newlines_and_tabs + test_lib_symbol_omits_pin_numbers_block reader 加 \r 解码 41/41 通过（39 旧 + 2 新）。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:04:58 +08:00
Knowit	8a91ce43f4	tools/epro2/kicad: Phase-2 lib_symbols — render symbol bodies from SYMBOL docs Phase 1 emit的 .kicad_sch 里组件位置 + 属性都对，但 lib_symbols 是空 stub —— KiCad 渲染时每个组件显示成红色 "?"。Phase 2 把 SYMBOL 文档里的 PART + RECT/POLY/CIRCLE/TEXT/PIN primitives 翻成 KiCad lib symbol 块，填到 lib_symbols 里，让 KiCad 显示真正的原理图符号。新增 tools/epro2/kicad/sym_writer.py: write_lib_symbol(symbol_doc) → S-expr list 形如: (symbol "facere:<partId>" (pin_numbers (hide no)) (pin_names (offset 1.016)) (in_bom yes) (on_board yes) (property "Reference" "U" ...) (property "Value" "<title>" ...) (property "Footprint" "" hide) (property "Datasheet" "" hide) (symbol "<partId>_1_1" (rectangle ...) ← from RECT.dotX1/Y1/dotX2/Y2 (polyline (pts ...)) ← from POLY.points + closed → fill (circle ...) ← from CIRCLE.center/radius (text "..." ...) ← from TEXT.value/x/y/rotation (pin <type> line (at ...) (length ...) (name ...) (number ...)) ← from PIN + sibling ATTR ops )) PIN 名字/编号/电气类型解析（这是关键数据探测点）： EPRO2 PIN 不直接带 number/name/type 字段；这些信息存为独立 ATTR 操作 (parentId=<pin_id>, key="Pin Name"/"Pin Number"/"Pin Type") Pin Type 取值映射：IN→input, OUT→output, BIDIR→bidirectional, POWER_IN→power_in, POWER_OUT→power_out, NC→no_connect, ... 默认 passive（保守） sch_writer 集成（lib_symbols 自动填）： write_sch_page(doc, project_relations=pr) — 增 pr 可选参数内部 _build_lib_symbols(): 收集本 sheet 用到的 partIds → 通过 ProjectRelations.parts_by_id 解析到 SYMBOL 文档 → write_lib_symbol → 组装 (lib_symbols ...) 块；同 partId 多 SYMBOL 候选取第一个，去重 WriteStats 增 lib_symbols_embedded / lib_symbols_missing CLI 加 --no-lib-symbols 用于回到 Phase-1 行为（占位符调试用）。 ESP-VoCat 重导出验证：9/9 SCH_PAGE 全部 0 lib_miss P1_45092758.kicad_sch wires=187 symbols=138 lib_emb=29 codec_0b0163fa.kicad_sch wires=190 symbols=112 lib_emb=20 Interface_b336a7c7.kicad_sch symbols=95 lib_emb=13 ... P1_408c9f4f.kicad_sch wires= 6 symbols= 10 lib_emb= 3 测试：6 个新单测覆盖 outer wrapper / pin ATTR pull / 多形状 primitives / sch_writer 集成路径 / 缺失 lib 计数 / no-pr 回退到 Phase 1。合计 39/39 通过（parser 6 + relations 9 + project_relations 6 + sexpr 6 + sch_writer 6 + sym_writer 6）。下一步 Phase 3：footprint library + .kicad_pcb 导出。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:45:19 +08:00
Knowit	9213429a57	tools/epro2/kicad: Phase-1 EPRO2 → KiCad schematic exporter 写第一版 EPRO2 → .kicad_sch 转换：把 SCH_PAGE Document 的 wires + COMPONENT placements + TEXT 输出到一个可被 KiCad 7+ 打开的 sch 文件。不含 symbol 主体（lib_symbols 留空 stub），所以 KiCad 里组件会渲染成红色 "?" 占位，但布线 + 位置 + Designator/Value 属性都正确。完整 symbol 库导出留 Phase 2。模块结构： tools/epro2/kicad/sexpr.py 手写 S-expr emitter，Sym 标记裸符号， str 自动加引号 + 转义；float 去尾零； bool→yes/no；NaN/Inf 主动报错 tools/epro2/kicad/_sexpr_reader.py 极简 S-expr parser，仅给 round-trip 测试用（非完整 KiCad reader） tools/epro2/kicad/sch_writer.py write_sch_page(doc) → str；处理： LINE → (wire (pts ...) ...) COMPONENT → (symbol (lib_id facere:<partId>) (at x y rot) (property Reference ...) ...) TEXT → (text "..." (at ...)) 单位 mil → mm × 0.0254；零长 wire 跳过 tools/epro2/kicad/__main__.py CLI: --doc <uuid> \| --all-sch ESP-VoCat 验证（python -m tools.epro2.kicad <project> --all-sch）： 9 SCH_PAGE 全部转换成功 P1_408c9f4f.kicad_sch wires= 6 symbols= 10 text= 0 skipped= 2 (370 lines) P1_ee409917.kicad_sch wires= 20 symbols= 14 text= 0 skipped= 3 P1_54743d77.kicad_sch wires= 42 symbols= 30 text= 3 Overview_dc13d6d2.kicad_sch wires= 0 symbols= 1 text= 34 (说明页) MCU_510cff33.kicad_sch wires= 91 symbols= 86 text= 9 Interface_b336a7c7.kicad_sch wires= 99 symbols= 95 text= 6 P1_5c38f45b.kicad_sch wires=179 symbols= 86 text= 9 P1_45092758.kicad_sch wires=187 symbols=138 text= 10 (主图) codec_0b0163fa.kicad_sch wires=190 symbols=112 text= 10 输出落在 data/processed/kicad_sch/<filename>.kicad_sch（gitignore 内，可重新生成；不入库）。测试：6 个 sexpr 测 + 6 个 sch_writer 测，含 round-trip parse 验证。 parser/relations/project_relations 的旧 21 个不动，合计 33/33 通过。下一步： 1. Phase 2 — symbol library 导出 (.kicad_sym)，把 SYMBOL doc 的 PIN/RECT/ TEXT primitives 转 KiCad symbol 主体；填 lib_symbols 块让组件渲染出真正的 schematic 符号 2. footprint library + .kicad_pcb 导出 3. 用 KiCad CLI (kicad-cli sch erc) 跑 ERC 校验 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:29:15 +08:00
Knowit	3052e42991	tools/epro2: add ProjectRelations for cross-document resolution per-doc Relations 在大量 cross-doc 引用前是不够的：PCB 的 PAD_NET 复合 id [PAD_NET, comp, pin, pad] 里的 pad 实际是 FOOTPRINT 文档里的 pad 实例；SCH_PAGE 的 COMPONENT.partId 指向某个 SYMBOL 文档的 PART.id。 ProjectRelations 在 per-doc Relations 之上做项目级聚合，把这些跨文档引用拼起来。 Probe 阶段（ESP-VoCat）发现的映射规则（已写入 docstring）： 1. SCH_PAGE COMPONENT.partId === PART.id in some SYMBOL doc - 命名两种风格：'pid<hex>' (anonymous/系统 part) + '<name>.<n>' (具名 SKU)，但都直接相等 PART.id，不是不同 namespace - 同一 PART.id 可能出现在多个 SYMBOL 文档里（库快照）， parts_by_id 保留全部，consumer 通常取第一个 2. PCB COMPONENT.id → FOOTPRINT 文档 UUID via 单独 ATTR op: ATTR(parentId=<comp>, key="Footprint", value=<fp_doc_uuid>) COMPONENT.attrs 子 dict 只有内务字段（Unique ID / Channel ID / ...），不含 footprint 引用。这跟 schematic 的 partId 在 COMPONENT 上的做法不一样，是 EPRO2 流的一处不对称 3. PCB PAD_NET[comp,pin,pad] 里的 pad 是 FOOTPRINT 文档内部的 pad id；解析链: comp → ATTR Footprint → FOOTPRINT relations.pads[pad] API： ProjectRelations.build(project) — 单遍构建 resolve_symbol_docs(sch_uuid, comp_id) → [SYMBOL doc uuids] resolve_footprint_doc(pcb_uuid, comp_id) → FOOTPRINT doc uuid \| None pad_in_footprint(fp_uuid, pad_id) → PAD payload \| None resolve_pcb_pad_net(pcb_uuid, comp, pin, pad) → {footprint, pad} \| None attrs_for_pcb_component(pcb_uuid, comp_id) → {key: value} 折叠 CLI 加 --project-relations，跑 ESP-VoCat: documents 278 distinct_parts 87 duplicated_parts 9 pcb_components_with_footprint 206 pcb_components_unresolved_footprint 0 sch_components_with_partid 572 sch_components_unresolved_part 0 PCB 样本验证：comp=e0 → fp=1069352d81c6 Designator='U8'， PAD_NET pin=1 pad=e7 net=GND 跨文档解到坐标 (-37.4,-45.24)。测试：6 个新单测覆盖 partId→symbol、comp→footprint、PAD_NET 跨文档、 attrs 折叠、unresolved 计数。parser + relations + project_relations 共 21/21 通过。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:22:39 +08:00
Knowit	7f9e2fad73	tools/epro2: add Relations layer for cross-object navigation 在 replay 的扁平 objects[id] -> payload 之上盖一层 Relations，建索引和反向引用，把孤立对象拼成可遍历的图，是后续 EPRO2 → KiCad 转换器的中间表示前置。 Relations.build(doc) 单遍扫所有对象，得到：主集合（按类型分桶）： parts / components / pins / pads / wires / nets / layers / rules 复合 ID 解析（关键）： '["LAYER",1]' → layers[1] '["NET","GND"]' → nets["GND"] '["PAD_NET","e0","1","e7"]' → pad_nets_by_pad/by_net '["RULE","SAFE","copperThickness1oz"]' → rules[("RULE","SAFE",...)] 反向引用： obj_ids_by_part partId → 引用对象 ids（lib 内 RECT/TEXT/PIN 都带 partId） components_by_part partId → component ids attrs_by_parent parentId → ATTR ids lines_by_wire WIRE.id → LINE ids（wire 由若干 LINE 段组成） pad_nets_by_pad PAD.id → PAD_NET 记录 pad_nets_by_net net name → PAD_NET 记录 objects_on_layer / objects_in_net 字段反查便捷 accessor： attrs_dict(parent_id) 折叠所有 ATTR ops 到 {key: value} dict（last write wins），KiCad 转换时按 component 拿 Designator/Value/Footprint 的常用入口 ATTR.parentId 解析（实测发现的两种坑）： 1. 不仅指向 COMPONENT/PART —— 也大量指向 WIRE（schematic 上的网络标签 / 网络属性）。原查重函数漏算，636 个 false positive unresolved；改为"任意 doc.objects[parentId] 命中即算 resolved" 2. 复合形式 `<comp_id>-<pin_id>` 用于把 ATTR 挂在某 component 的某个 pin 上（如 PinName）。`_resolve_parent()` 用 split("-",1) 兜底 CLI 加 --relations，按 docType 聚合 stats： uv run python -m tools.epro2 data/raw/oshwhub/<uuid> --relations ESP-VoCat 验证： SCH_PAGE 9 docs : 572 components, 563 wires, 934 lines_grouped, 4111 attrs_attached, 0 unresolved_parents PCB 6 docs : 206 components, 807 pad_nets, 173 nets, 544 layers SYMBOL 105 docs : 106 parts, 560 pins, 1680 attrs_attached FOOTPRINT 55 docs: 496 pads, 9 nets, 1771 layers, 140 rules 注：PCB 内 pads=6 vs pad_nets=807 不矛盾 —— PAD 实例存在 FOOTPRINT 文档里，PCB stream 用 ["PAD_NET",comp,pin,pad] 复合 id 跨文档引用；解析"comp 的某 pin 通过哪个 footprint 的哪个 pad"需要 project-级 Relations 聚合（下个 task）。测试：tools/epro2/tests/test_relations.py 9 个单测覆盖复合 id 解析、 lineGroup 链接、parentId 直/复合解析、partId 反查、attrs 折叠。 parser + relations 共 15/15 通过。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:17:28 +08:00
Knowit	3c57e75d51	Add tools/epro2 — EPRO2 parser + replay prototype 为 Pro 3.x .epro2 工程源数据写解析骨架，下游做 EPRO2→KiCad 转换器前的基础设施。在 ESP-VoCat (278 docs / 7.5 MB) + 220V 桌面电源 (771 docs / 26 MB) 端到端跑通，0 parse errors。模块结构： tools/epro2/parser.py 单行 → Op：rstrip("\|") + split("\|\|") + json.loads tools/epro2/replay.py state-machine：DOCHEAD 设头；其它 op 按 id 做 upsert（payload=None 当 delete）；EDIT_HEAD/ META/CANVAS/PREFERENCE/PANELIZE 当 doc 级单例存 tools/epro2/__main__.py CLI：传项目目录走 manifest.json 重放每个 doc，按 docType 聚合输出 + 可选 --dump-doc 看单文档详情 tools/epro2/tests/ 6 个单测 pin 死 trailing-pipe / 三段消息 / id-only-no-payload / 嵌入管道符等坑 ESP-VoCat 输出示例： Documents: 278 (parse_errors=0) count docType objects ops deletes untyped_ops 105 SYMBOL 4124 4439 0 0 88 DEVICE 88 264 0 0 55 FOOTPRINT 4641 4855 0 0 9 SCH_PAGE 7982 8167 42 0 6 PCB 8428 8547 38 0 6 BOARD 9 18 0 0 6 SCH 9 26 0 0 1 BLOB 4 8 0 0 1 FONT 16 28 0 0 1 CONFIG 2 3 0 0 Top ops: ATTR 7035 / ELE_PLACEHOLDER 4225 / LINE 3005 / LAYER 2318 ... PCB 文档单 dump 验证语义正确：META 含 title (PCB-EchoEar-CoreBoard-V1_0) + board 引用；CANVAS 含 origin/grid/unit (mm)；LAYER 1/2/3 = TOP/BOTTOM/ TOP_SILK 配色齐全。跑法： uv run python -m tools.epro2 data/raw/oshwhub/<project_uuid> uv run python -m tools.epro2 data/raw/oshwhub/<uuid> --dump-doc <doc_uuid> 下一步（不在本 commit）： 1. 把对象间关系建起来（COMPONENT.partId → PART；LINE.lineGroup → WIRE； PAD_NET id → PAD + NET 三方关联）—— 当前 replay 只做扁平 dict 2. EPRO2 → KiCad 序列化层（Forge 投影硬门槛） 3. 在 Pro 3.x 三个项目做整体回归（X86 主板 7374 docs 可作压力测试） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:10:27 +08:00
Knowit	c721e08c93	projects.md: replace Comments column with 版本 (Std / Pro 3.x / Pro 2.x) Comments 那列对工程"品质"信号弱（评论量主要看话题热度）；换成"版本" 列直接告诉读者每个项目源是哪种 EDA 格式 + 编辑器版本号。当前 15 个项目里 10 Std / 3 Pro 3.x / 2 Pro 2.x。 source_format 字段映射： easyeda-std → Std easyeda-pro → Pro 3.x easyeda-pro-legacy → Pro 2.x 其它 → 透传 editor_version（如 6.5.43 / 3.2.91 / 2.1.40）作为子标签放第二行。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 22:01:41 +08:00
Knowit	c6279bff08	Add EasyEDA Pro 2.x legacy source ingestion (5/5 batch closure) 补齐前一批失败的 2 个 legacy Pro 项目（立创·泰山派 RK3566、立创·梁山派），打通 Pro 2.x 旧版工程的源抓取链路。结合上一 commit 的 modern Pro 3.x 路径，本仓库 5/5 Pro 项目 EPRO2/dataStr 全部端到端打通。 Pro 2.x 与 Pro 3.x 是两个完全不同的存储模型： - Pro 3.x：git-style branch + linear history chain，AES-128-GCM 加密的 EPRO2 增量消息流，按 history 重放（已在前一 commit 打通） - Pro 2.x：无 branch / 无 history。文档以 EasyEDA Std plaintext dataStr 存储（同 ["DOCTYPE","SCH","1.1"] 格式），按 doc UUID 通过 /api/v2/documents/lists 批量 GET，主体无加密，只组件库走 AES Pro 2.x 抓取链由 HAR (tmp/prodownload3.har, 178 请求) 反推： GET /api/v4/projects/<P> → boards: [{sch, pcb, name}] GET /api/projects/<P>/ticket?uuid=&g_ticket=-1 → 完整项目 manifest POST /api/schematic/lists {uuids:[<sch>]} → sort: [{uuid:<sheet>}] POST /api/v2/documents/lists {uuids,docType:1} → schematic plaintext POST /api/v2/documents/lists {uuids,docType:3} → PCB plaintext POST /api/coppers/search {paths} → 铺铜层 POST /api/textpath/search {paths,project_uuid}→ 字体/文字 POST /api/v2/resources/search {hash,project_uuid} → BLOB 图片实现： - crawlers/oshwhub/crawler.py: - fetch_pro_source() refactor 成 dispatcher，先 GET project meta 检查 branch_uuid，null 即旧版走 _fetch_pro_legacy()，非空走 _fetch_pro_modern() - _fetch_pro_legacy() 新增（按上面 9 步流程拉所有 doc + 辅助层） - _pro_post_json() POST helper（与 _pro_get_json 对称） - schemas/project.schema.json: source_format enum 加 easyeda-pro-legacy - docs/sources/easyeda_pro_source.md rev 4: §1.1 旧版 vs 新版判别表更新、 §2.7 新增旧版抓取流程 + 实测数据落盘约定（旧版）： source/ticket.json 完整 manifest source/<sheet_uuid>.json 每张原理图（含 dataStr） source/pcb_<pcb_uuid>.json 每块 PCB source/coppers.json/textpath.json/blobs.json 辅助 PCB 层资源 source/manifest.json 索引实测：立创·梁山派 editor=2.1.30, 2 sheets+1 pcb, 1.0 MB, 78 sym/191 fp/128 dev 立创·泰山派 RK3566 editor=2.1.40, 29 sheets+1 pcb, 0.8 MB, 299 sym/524 fp/295 dev 旧版项目体量比新版小两个数量级（梁山派 1 MB vs RK3576 66 MB）—— 没有增量 history，组件库走单独端点，本身就是当前快照。 5/5 Pro 项目终极汇总： X86 主板 easyeda-pro 3.2.15 7374 docs / 481 MB 泰山派 RK3566 easyeda-pro-legacy 2.1.40 30 docs / 0.8 MB 梁山派 easyeda-pro-legacy 2.1.30 3 docs / 1.0 MB 220V 桌面电源 easyeda-pro 3.2.69 771 docs / 26 MB ESP-VoCat easyeda-pro 3.2.91 278 docs / 7.5 MB 共 8456 docs / ~516 MB plain。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:59:25 +08:00
Knowit	3282a028c4	Add EasyEDA Pro EPRO2 source ingestion (3/5 batch test) 打通 oshwhub origin=pro 现代 Pro 3.x 工程的 EPRO2 源抓取链路。3/5 modern Pro 项目完整解出（共 8423 docs / 542 MB plain）： - X86 主板 7374 docs / 481 MB plain (chain=85, editor=3.2.15) - 220V 桌面电源 771 docs / 26 MB plain (chain=28, editor=3.2.69) - ESP-VoCat 278 docs / 7.5 MB plain (chain=12, editor=3.2.91) 剩余 2/5 是 legacy Pro 2.x（立创泰山派 RK3566、梁山派），项目 meta 返回 branch_uuid=null + editorVersion="2.1.40"，没有 git-style chain 模型，文档直接挂在 boards[].sch/pcb 字段上，访问端点暂未挖通；元数据落库 metadata.json，source/ 留空。实现要点： - fetch_pro_source(): 4 步流程（project → branch HEAD → structures → /branches/<B>/histories/<HEAD> 即返完整 chain，无需 ?limit 批量端点）+ 逐 history 走 AES-128-GCM 解密（16 字节 IV，pycryptodome 原生支持）+ gunzip + 按 DOCHEAD 切 per-doc EPRO2 流 - EPRO2 解析坑：行末单 `\|` 是行终止符不是字段分隔符，必须先 rstrip("\|") 再 split("\|\|")，否则 payload JSON 解析失败 silently swallow 导致 cur_doc 不设 → 第一轮 X86 板 7374 docs 抽出来只剩 2 个 - docType 实测远不止 BOARD/PCB/SCH/SCH_PAGE，还含 SYMBOL / FOOTPRINT / DEVICE / BLOB / FONT / CONFIG —— Pro 把组件库快照也随项目存到 history，下游做 EPRO2→KiCad 转换时必须先把这些 lib doc 加载进 symbol cache - Pro 2.x vs 3.x 是不同存储模型 —— 3.x 走 branch 模型（已打通）， 2.x 走 boards[] 直链（未打通）；判别条件：project meta 的 branch_uuid 是否为 null CLI 新增 --with-pro-source / --backfill-pro-source / --pro-cookie / --origin（按 origin 字段服务端过滤 listing API），crawl_one() 按 origin=pro 自动 dispatch 到 Pro fetcher。 schema：docType 类型从 integer 放宽到 [integer, string, null] （兼容 Std 的 1/3 + Pro 的 BOARD/SCH 等），新增 message_count 字段。 License 注意：本批 5 个项目全是 NC-SA / GPL，未达 Pro source doc §4.2 Forge 白名单（MIT/BSD/Apache/CC0/CC-BY/CERN-OHL-P/Unlicense）。按 CLAUDE.md "研究用、不再分发" 原则 raw 入库无碍；Forge 投影时另过白名单。详细技术细节见 docs/sources/easyeda_pro_source.md rev 3 + log.md。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 21:45:52 +08:00
Knowit	d874278bc5	Add EasyEDA Std project source ingestion (10 boards backfilled) 打通 oshwhub origin=std 项目的工程源（schematic + PCB dataStr）抓取链路。原 plan.md §1.6 假设需要登录，实测 lceda.cn/api/documents/<doc>?uuid=<doc>&path=<doc> 对公开项目匿名可访问 —— 无需 cookie，无账号封禁风险。调研：4 轮探测留痕在 data/state/std_probe[1-5]/（gitignored）；翻 Std 编辑器 v6.5.51 的 main.min.js bundle 找到 ajaxDetail 端点；按 docType 区分两种响应 shape（schematic 项目视图 vs PCB 文档视图）。 Crawler: - make_source_client() 用浏览器 UA + lceda.cn/editor Referer，因为 oshwhub /api/project/<uuid> 端点拒绝 FacereDataset/0.1 UA（CLAUDE.md UA 例外条款：目标站主动封自定义 UA + 公开静态资源） - fetch_std_source(): 项目元 → version_documents → 逐文档 dataStr → 落 source/<doc>.json + source/manifest.json - --with-source（爬新项目时一并抓源）/ --backfill-source（仅扫已有） - QPS ≤ 0.2 (SLEEP_SOURCE = 5s) 自律 Schema: 加 source_format / source_path / source_documents / editor_version （前 3 进 enum 锁定，便于后续 Pro / KiCad 源对齐）。回填结果：10/10 成功，45 个文档，33.2 MB；schema validate 全通。 docTypes 主要是 1 (schematic) 与 3 (pcb)；USB 电压电流表只有 PCB 文档（4 个：主板+盖板+底板+面板，作者未上传原理图源）。完整调研：docs/sources/easyeda_std_source.md。	2026-04-28 20:07:40 +08:00
Knowit	b0d3afd2a9	update readme	2026-04-26 11:54:01 +08:00
Zhang Jiahao	a3942c03df	Update EasyEDA Pro source research	2026-04-24 00:40:18 +08:00
Zhang Jiahao	a16cb11c7d	Add easyeda_pro_source.md: Pro 工程源完整链 + EPRO2 格式解析 Why: - pro.lceda.cn (立创 EDA 专业版) 的工程源抓取链已经打通：4 步 API + AES-128-GCM 解密 + gzip 解压 + EPRO2 消息流解析，所有信息需要落成文档独立保留，避免丢失；也为后续实现 EPRO2 → KiCad 转换器/选型铺路。 - 与 oshwhub.md（Std 版）并列成为独立调研文档 —— Pro 和 Std 是两套独立编辑器，cookie/API/格式都不同，混在一起反而乱。 What: - docs/sources/easyeda_pro_source.md: * TL;DR 表 + §1 Std vs Pro 对照 * §2 4 步 API 链 + 必需 headers (Editor-Version/path/Referer/Cookie) + Python 解密代码 + 实测数据（2.7 MB 源流 / 8357 条消息） * §3 EPRO2 格式完整分类：40 种 message type 按功能分组 (零件/几何/PCB/层/规则/...) + 每类样例 * §4 安全合规（风控 / license / 密钥泄漏语义） * §5 接入 Forge (OSHWHUB_INGEST_SPEC.md) 的 gap 表 * §6 已知未验证 7 条 * 附录 A 一键重跑命令 - pyproject.toml: + pycryptodome>=3.23.0（AES-GCM 解密依赖） - log.md: 本次会话记录 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:11:32 +08:00
Charles Zhang	f797205dc8	add epec md	2026-04-23 23:42:21 +08:00
Zhang Jiahao	1a67df44ba	Add docs/infra.md: dev1 (广州) 部署记录 Why: - Phase 0.5 要求的基础设施文档落地。机器已部署好，记录机器规格、 SSH 约定、仓库位置、依赖装法、凭据管理规则、环境变量、长跑与日志策略、磁盘应急阈值，方便后续自己或新协作者快速接手。 - 严格不包含任何凭据值（token / cookie / 密码），只写事件与结构。 Deployment notes: - GIT_LFS_SKIP_SMUDGE=1 克隆：省带宽和磁盘（本地仓库 32MB 而非 535MB+），历史 LFS 对象按需 pull - uv 走清华镜像 + only-system python 3.10：sync 秒完成，避免下载独立 Python - ~/.secrets/ mode 700 就绪 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:36:21 +08:00
Zhang Jiahao	fb7c0488bc	Relax Python requirement 3.11 → 3.10 Why: - Ubuntu 22.04（广州云服务器 dev1 的发行版）自带 python3.10。之前要求 3.11 会让 uv 去下载独立 Python 解释器，从国内拉 Astral 的 python-build release 很慢。放到 3.10 直接用系统 Python，sync 秒完成。 - 代码没有用任何 3.11+ 特性（用了 `from __future__ import annotations` 支持 PEP 604/585 在 3.10 上下文），降版本零代价。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:34:54 +08:00

1 2

59 Commits