Snapshot of full oshwhub std corpus delivery:
- 12,493 projects total, 12,166 (97.4%) with editor source
- 4 sweep batches + 1 early-mixed = 5 zip artifacts in COS GZ + SG buckets
- 30-day SG-region presigned URLs for downstream pickup
log.md tracks the multi-batch sweep including driver bug postmortem
(bash heredoc python3 missed httpx → 26-min run wasted on empty zips,
recovered by switching to uv run).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Topical index for std-origin flight-controller projects. Combines
data/state/oshwhub_listing_full.jsonl listing fields with each
project's metadata.json (license, source completeness,
editor_version). Useful as a flat per-topic reference vs the global
projects.md sorted purely by stars.
77 added this batch (commit 29530e0) + 2 prior. 75 have editor source,
4 are attachments-only on upstream.
scripts/build_feikong_index.py is reproducible: source of truth lives
in data/state/ + data/raw/, no hand-editing.
Topic-targeted pull from local listing index (`name OR introduction`
contains 飞控). 79 std hits in oshwhub_listing_full.jsonl, 2 already
crawled, 77 newly fetched.
dev1 (Guangzhou) walltime:
Step 1 detail scrape ~12s, Step 4 std-source backfill ~80s
(concurrency=5)
Source completeness: 73/77 with editor source, 4 are upstream
attachments-only (no editor session ever attached, source_documents=[]
is genuine — no editor_version on the SSR page either).
Crawler hardening (crawlers/oshwhub/crawler.py):
- count.{like,star,fork,views} are now `.get(..., 0)` defensive.
Listing API omits zero-valued fields for some low-activity entries
(3/77 hit this on first pass, hard-failed with KeyError 'like').
Affects rank_score, pick_top, and metadata.json metrics block.
License mix: 65% GPL 3.0, 11% Public Domain, 11% MIT, ~6% CC variants.
Transport: dev1 → SG via tar+scp (33 MB, ~3 min over lossy
cross-region link). Bypassed gitea push from dev1 because the same
6.5%-loss link tanks single-stream throughput.
Topic-pull from local listing index (`name OR introduction` contains
飞控). 77 std hits in oshwhub_listing_full.jsonl, minus 2 already
crawled = 75 attempted; 74 OK + 1 hard fail (`m1_mh743_ada_v4`,
listing entry missing `count.like`).
dev1 walltime: Step 1 ~12s, Step 4 ~80s (concurrency=5).
License mix: 65% GPL 3.0, 11% Public Domain, 11% MIT, 4% CC variants.
3 partial dirs (Step 1 KeyError on missing `count.like`) dropped — to
be re-fetched after a follow-up crawler patch makes count fields
defensive against listing-index outliers.
Source backfill 74/74 OK, total +46 MB.
Both _run_backfill_source and _run_backfill_pro_source now honor
--concurrency N (default 1 keeps current sequential behavior). Shared
dispatch helper _run_backfill_concurrent + _discover_backfill_targets
factored out — the two paths had drifted but were structurally the same.
Thread safety:
- httpx.Client is sync-thread-safe per docs; one client shared across
threads is correct
- Per-project file writes (metadata.json + source/*) don't conflict
since each thread owns one project dir
- Oversize state file is shared; serialized via a Lock around
_record_oversize
- Print is wrapped in a Lock for readable progress
Expected speedup on dev1 (Guangzhou): batch-200 Pro 100 项 sequential
~14 min -> concurrency 5 ~3-4 min. Std similar 2-3x. Server-side limit
isn't likely to bite at this scale (probe showed Pro QPS=2 sustained
clean; concurrency 5 puts effective rate around 4-5 req/s).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doubles down on what worked in batch-50:
- dev1 (Guangzhou) is primary execution host
- Owner cap=2 for diversity
- --max-source-mb 200 to defend against X86-class outliers
- Pro 2.x deprecated-board fix is already in (commit c3cac97)
- SSH transport for dev1 -> gitea (commit 8220c99)
Candidate pool:
200 picks from A-tier (grade>=3 & like>=10) minus already-crawled 65
Remaining A-tier corpus is 2,741 (Pro 1326 + Std 1415)
173 unique authors, like median 258, grade dist 4:118 / 3:82
Estimated walltime ~25-35 min on dev1 for Step 1-4 (no attachments).
LFS increment ~2.5 GB (source only) or +10 GB if Step 5 attachments
included. Either way well within Gitea's 200 GB migration threshold.
Step 5 (attachment download) deferred — not on the critical path for
EPRO2/Std → KiCad work, can revisit when license-filtered Forge
projection demands it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refreshed via scripts/build_index.py. Reflects the full corpus state
post batch-50 Step 1-4 + Super Dial deprecated-board fix:
65 projects · 253 attachments · 3.1 GB declared
by origin: Pro 30 (5 modern + 25 legacy) + Std 35
by license: GPL 3.0 dominant, ~80% Forge-friendly
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pro 2.x legacy. 6 docs (3 sch sheets + 3 PCBs), 0.2 MB plain. The
deprecated 主控板V1 sch/pcb pair is correctly skipped (filter via
ticket.schematics/pcbs keys, see crawler commit c3cac97).
batch-50 success rate is now 50/50 (was 49/50).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pro 2.x project metadata's boards[] can reference sch/pcb UUIDs that
the project owner has since deprecated/deleted (e.g. "主控板V1(废弃)").
Such UUIDs are gone from ticket.schematics / ticket.pcbs but still in
boards[]. Asking schematic/lists or documents/lists for them returns
401 and aborts the whole project.
Filter both lists against the authoritative ticket dict before posting.
Verified on 7f7565ef11 (Super Dial 电机旋钮屏): 4 boards but only 3
sch entries in schematics dict, isolating the deprecated 8bc59f to a
401 we now skip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Avoids HTTPS-over-lossy-link TCP cwnd issues that pinned the previous
push to ~360 KB/s for 10 min on the batch-50 Step 3-4 commit. SSH key
generated on dev1 (~/.ssh/id_ed25519), public key posted to gitea via
/api/v1/user/keys (title "dev1-guangzhou"), origin URL updated to
ssh://git@git.deepknow.site:222/Facere/FacereDataset.git.
Also documents the kernel + git side optimizations applied:
sysctl net.ipv4.tcp_congestion_control=bbr (was cubic)
git config --global http.postBuffer 524288000 (500 MB)
Note: gitea git SSH port is 222, not 22 (22 is the host sshd).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run on dev1 (Guangzhou) for the latency advantage. Walltime 3:41 vs
Singapore-estimated 1-2h (~30x speedup, mostly from image.lceda.cn
RTT going from 263ms to 2.6ms).
Pro 25: 24 ok + 1 fail (Super Dial 7f7565ef11 — Pro 2.x legacy
schematic/lists 401, separate cookie-perm issue)
611 docs, 31 MB total
Std 25: 25 ok, 97 docs, 74 MB total
Combined: 49/50 success, 708 docs, 105 MB new disk usage
--max-source-mb 200 cap was not tripped; the 25 Pro candidates are all
under 10 MB, so the 481 MB X86-board outlier from the original sample
was not representative.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pro 2.x stores some doc payloads (notably Taishan's PCB) externally at
modules.lceda.cn keyed by dataStrId, AES-256-GCM encrypted with the
iv/key fields stored alongside. Same crypto pattern as Pro 3.x EPRO2:
last 16 bytes are the GCM auth tag, rest is gzip(plaintext-op-stream).
The CDN doesn't require auth.
- pro2_writer.fetch_encrypted_plaintext(): fetch + decrypt + gunzip,
cache result at source/<uuid>.decrypted.txt so re-runs skip the
network round-trip. Heavy imports (httpx, pycryptodome) are
deferred to call-time so the pure-replay path doesn't pay for them.
- pro2_writer.split_plaintext_by_doctype(): walk the multi-doc
plaintext (Pro 2.x bundles N FOOTPRINTs + 1 PCB into one blob), yield
(label, sub_text) per inner doc. Label = HEAD.uuid if present, else
fallback `<kind>_<idx>`.
- __main__._convert_pro2_encrypted(): for each sub-doc, write a
synthetic inline-Pro-2.x JSON next to the original and re-route
through write_pro2_doc — re-uses BBox / layers / objects-extraction
instead of duplicating the logic. Output filename
`<parent_uuid>__<sub_label>.json` makes the parent association
visible.
Smoke (Taishan): 28 inline SCHs → 55 total. Decrypts:
- one PCB blob (3.4 MB plaintext, 20267-object PCB + 25 FOOTPRINT
sub-docs of 130-580 objects each)
- one SCH-typed encrypted doc (1 sub-SCH of 891 objects)
86 unit tests still pass; new fetch/decrypt path is covered manually
via the smoke test rather than mocking httpx + AES.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The downstream colleague's "encrypted_external" / "string old format"
projects were Pro 2.x, not Pro 3.x EPRO2. Pro 2.x ships each doc as a
JSON file whose `dataStr` is a plaintext op-stream — one JSON array per
line, e.g. `["COMPONENT","e1","",0,0,0,0,{},0]`. Different wire format
from EPRO2's binary tilde/pipe streams; same Std envelope works for
output.
- tools/epro2/std/pro2_writer.py: parses dataStr line-by-line, keys
objects by id (position 1 for most ops, OPTYPE for singletons),
extracts BBox by walking known coord positions per OPTYPE, derives
layers from LAYER ops directly (Pro 2.x almost matches Std layer
string format already). PCB blobs that are encrypted-external
(`dataStrId` URL + `iv` + `key`, no inline dataStr — Taishan PCB)
return None so the CLI skips with a message instead of stubbing.
- tools/epro2/std/__main__.py: auto-detect via manifest's
editor_version. "2.x" → Pro 2.x writer; otherwise the existing
EPRO2 replay path. CLI surface and output layout unchanged.
- docs/sources/epro2_to_std_mapping.md: adds a Pro 2.x section.
Adapter dispatches on `head.epro_format`: absent / "epro2" gets
dict-shaped objects values, "pro2" gets array-shaped values
(`[OPTYPE, arg1, ...]`). Lists the Pro 2.x-specific OPTYPEs
(FONTSTYLE / LINESTYLE / CONNECT / OBJ / REGION / DIMENSION /
STRING / TEARDROP) the EPRO2 vocabulary doesn't have.
Smoke (re-running --all on all 5 Pro projects): 191 → 222 JSON files.
Liangshan adds 3 (2 SCH + inline 5357-object PCB). Taishan adds 28
(SCH only — PCB skipped, encrypted-external; source/<uuid>.json still
keeps the dataStrId/iv/key for a later fetch+decrypt pass).
84 → 86 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Downstream came back with concrete requirements: don't pre-compute Std
shape[] tilde strings, just dump the raw EPRO2 `objects: {id: payload}`
dict and they'll write a ~100-LoC adapter on their side. Pulling the
tilde-mapping work back saves us from second-guessing positional fields
without their parser to verify against, and shortens our pcb_writer
from ~500 lines to ~40.
Output shape (Std envelope intact, just no `shape[]`):
{
"success": true, "code": 0,
"result": {
"uuid", "puuid", "title",
"docType": 3 | 1,
"components": {},
"dataStr": {
"head": {
"docType": "3" | "1",
"editorVersion": "facere-epro2/0.1 (epro2 <X.Y.Z>)",
"units": "mil",
"epro2_doc_uuid": ...,
"epro2_editor_version": ...,
},
"BBox": {x, y, width, height}, # mil
"layers": [...], # Std layer-string array
"objects": dict(doc.objects), # raw EPRO2, 1:1
"preference": {}, "netColors": [], "DRCRULE": {},
}
}
}
Per-doc spec downstream gave us:
- shape[] dropped (empty placeholder misleads adapter)
- all units mil (no mm conversion — Std canvas already declares mil)
- head.units="mil" so adapter doesn't have to guess
- BBox min/max across known x/y/startX/endX/centerX fields; adapter
can refine by walking path arrays itself
- layers[] keeps Std's 17-line default + inner SIGNAL layers actually
used (21~Inner1.., 22~Inner2..)
- empty stubs preference/netColors/DRCRULE for grep-based triage
New: docs/sources/epro2_to_std_mapping.md with the full EPRO2 OPTYPE →
Std verb table that downstream's adapter authors will copy from. Tables
include the layer-id remapping (the 5↔7 paste/mask flip, 11→10 outline,
12→11 multi, SIGNAL 15+→21+), PCB op mappings, SCH op mappings (marked
best-effort: no Std SCH samples in our corpus), and the 5-Voltage
placeholder COMPONENT → extra net flag trick. Extracted from the
previous Option-3 writer (commit fe6971f) so adapter writers don't
have to reverse-engineer it from source.
ESP-VoCat smoke: 6 PCB + 9 SCH = 15 JSON files, head.units=mil
preserved, no shape[] field present. 82 → 84 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three crawler ergonomics for batch operations:
--no-cover Skip cover image download. For scan-only modes (license/meta
scrape) this drops ~1.3s/project and avoids slow-CDN hangs.
--concurrency N ThreadPoolExecutor wrapping the per-project loop. Default
1 = serial (current behavior). Anonymous endpoints tolerate
5+ comfortably; output uses a print lock for readable
interleaved progress. fetch_cover plumbs through crawl_one.
Drop cross-host sleep #1: in crawl_one between detail HTML (oshwhub.com)
and cover image (image.lceda.cn). Different hosts — sleep was unnecessary.
Saves ~1s/project. Sleep #2 (post-cover, before next iteration) stays — it
gates the next oshwhub.com hit.
download_to gains max_seconds wall budget (default 60s, cover uses 15s).
Defends against pathologically slow CDN connections — observed 10 KB/s
on image.lceda.cn for one project, would have hung 6+ min on a 3.6 MB
cover otherwise. httpx default timeout resets per chunk, so streaming
downloads need an external wall-clock guard.
batch-50 Step 1 (license/meta scrape) shipped:
50/50 candidates have metadata.json + license recorded
License distribution: GPL 3.0 32, Public Domain 6, NC variants 8,
CERN-OHL 1, MIT 1, CC BY 3.0 1
Forge-friendly (non-NC): 41/50 (82%)
Declared attachments: 180 files / 2.36 GB (median 18 MB/proj, max 304 MB)
Walltime: 3min 26s for 28 projects at concurrency=5 (server-side
HTML render bound, not sleep-bound)
One orphan partial cover (a670e60a...) cleaned up — leftover from the
first aborted run before the timeout fix landed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The downstream colleague consumes oshwhub Std (lceda) dict-format JSON,
not KiCad. The EPRO2 decryption part (per-doc plaintext .epro2 streams
in data/raw/<uuid>/source/) is what we already provide; the missing
piece is converting EPRO2 op-streams into the same `dataStr.shape`
tilde-delimited format their parser already speaks.
New tools/epro2/std/ module, peer of tools/epro2/kicad/, kept
deliberately separate so the KiCad path stays untouched:
- pcb_writer.write_pcb_std() — high-fidelity, validated against a Std
PCB sample at data/raw/oshwhub/3e2f893d.../25931ddab8.json. Maps
LINE→TRACK, VIA→VIA, POUR→COPPERAREA (with SVG `M..L..Z` path),
POLY→CIRCLE/SOLIDREGION, COMPONENT+FOOTPRINT→LIB nested with
#@$-separated PADs (placement rotation + translate applied so pad
coords land at PCB-absolute positions). Layer-id mapping (EPRO2 5↔7
flipped vs Std solder/paste, 11→10 outline, 12→11 multi, SIGNAL
inner 15+ → Std 21+) noted inline.
- sch_writer.write_sch_std() — best-effort. Our corpus has zero Std
schematic samples (docType=1) so verb field orders follow the
EasyEDA Std public spec, not direct observation. Emits W (wire),
N (net flag, including the 5-Voltage Global Net Name power-port
pattern), T (text), LIB (placement with #@$-nested PIN/T). If
downstream's parser bails the fix is almost certainly a positional
field tweak, not a re-architecture.
- __main__.py — flat output `<doc_uuid>.json` per doc directly under
--out (mirrors Std's own data layout); --all-pcb / --all-sch / --all.
Smoke test on ESP-VoCat: 6 PCB + 9 SCH = 15 JSON files, libs_unresolved=0
across the board. Compact JSON (separators=(",",":")) matches Std's
single-line format. Numbers use _num() — integers without trailing .0,
floats trimmed.
71 → 82 unit tests pass.
Open questions for downstream: (1) confirm SCH verb field orders, (2)
do they want any of the upstream metadata fields we drop (master,
owner, created_at, etc — those live on the crawler side, not the
schematic itself)?
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The doc had been growing incrementally as each host got probed; reshape
it as a polished benchmark with TL;DR top, methodology section
(including safety constraints + caveats), per-host detailed tables,
final crawler settings, batch-50 walltime breakdown, and a reproduce
recipe.
Five hosts fully covered:
pro.lceda.cn API 5.0s -> 0.5s (10×)
lceda.cn doc 5.0s -> 0.5s (10×)
oshwhub detail 2.0s -> 1.0s ( 2×)
oshwhub listing 2.0s -> 1.0s ( 2×)
modules.lceda CDN 0.2s (already optimized)
Net effect on batch-50 plan: sleep total ~32min -> ~3min, walltime
~2h -> ~10-15min.
Key finding: the original 5s/req on Pro was set out of "logged-in
account is precious" caution with zero empirical evidence. Sustained
burst probe (25 distinct UUIDs at 0.5s, no recovery) showed 0/25 errors
and median latency 410ms — the caution was unjustified.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Killed at the 14-min mark — VmRSS 1.96 GB + VmSwap 1.41 GB on a 3.3 GB
RAM box with 4 GB swap (3.6 GB used), read_bytes 24 GB (pure swap thrash),
process state D (uninterruptible disk sleep). The CPU board PCB doc
(8K+ objects, 35+ child schematic pages) overflowed our current
all-in-memory build pattern: pcb_writer builds the full output list
before to_sexpr serializes once at the end, plus the 35 write_sch_page
calls each build their own Relations + lib_symbols dedup state.
Saved what finished: 4/5 X86 boards complete (Sch-CAM-IMX415,
Schematic1, SCHEMATIC1, Sch-VTX-SSC338Q), the CPU board SCHEMATIC1_1
has all its 35 child .kicad_sch but no .kicad_pcb. Final downstream
delivery: 17 board projects across the 3 supported Pro projects, 32/32
files pass kicad-cli (sch erc + pcb export svg).
Streaming-write fix is the next logical follow-up but out of scope
for this turn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ladder probe lceda.cn/api/documents/<uuid>: 5 tiers (5/2/1/0.5/0.25s)
× 9 distinct Std doc UUIDs = 45 reqs total, all 200/success. Latency
variance is dominated by payload size (Std docs span 4 KB to 4.5 MB)
not server backpressure. Same posture as Pro API.
Net effect on batch-50 estimate: Std 25 项 × 10 doc calls saved ~19
min wall time (21min sleep -> 2min sleep). Combined plan now projects
~2h -> ~10min walltime exclusive of download bytes.
scripts/probe_rate_limit.py: --host std-doc tier added. Reads doc UUIDs
from /tmp/std_doc_uuids.json (assembled by caller from any source/manifest.json
upstream_version_documents lists). Reusable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the two new --all crash paths fixed in 61fd3ff (odd inner
copper layers, duplicate BOARD titles) plus the Pro 2.x scope gap
(Taishan + Liangshan are JSON-format, not EPRO2 streams, so our
replay_project reads the bytes but doc_type stays None and
_group_by_board returns no SCH/PCB groupings — needs a separate
Pro 2.x writer).
Status as of this commit: ESP-VoCat 6 boards + 220V power 7 boards =
13 project dirs ready for downstream corpus. X86 motherboard is the
largest of the five (7374 docs, 1.9 GB RAM in flight) and still
running.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Running the new --all on the remaining 4 Pro projects (X86 motherboard,
220V power supply, Taishan Pi, Liangshan Pi) surfaced two crash modes
not covered by ESP-VoCat:
1. Odd inner-layer count → KiCad rejects the file at load with
"3 is not a valid layer count". The 220V power boards have one
used inner SIGNAL layer (3 copper total: F.Cu / In1.Cu / B.Cu),
but KiCad requires an even copper count. Fixed pcb_writer to pad
with one empty inner layer when the inner count is odd, so the
total stays even (2, 4, 6, ...).
2. Two BOARDs sharing the same META.title — twin "显示板" boards in
the 220V power project — landed in the same project directory
and the second silently overwrote the first's .kicad_sch /
.kicad_pcb / .kicad_pro. Fixed --all to detect title collisions
and suffix every colliding basename with the BOARD uuid prefix
(so both '显示板' boards become '显示板_52e8cc76' and
'显示板_55d32906' rather than one quietly winning).
71 → 73 unit tests pass (test_odd_inner_signal_count_padded_to_even_total
+ test_duplicate_board_titles_get_distinct_basenames).
Tangentially noted while running this: Taishan Pi and Liangshan Pi are
Pro 2.x JSON, not EPRO2 streams — our replay layer reads the files but
doesn't decode docType, so SCH/PCB grouping returns nothing. Pro 2.x
needs a separate writer; out of scope for this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Calibrated against ladder probes on 2026-04-29. Findings in
docs/sources/probe_rate_limit_results.md.
SLEEP_PRO 5.0 -> 0.5 (pro.lceda.cn API)
SLEEP_BETWEEN 2.0 -> 1.0 (oshwhub detail/listing)
SLEEP_SOURCE 5.0 unchanged (lceda.cn Std endpoints — not yet probed)
SLEEP_PRO_CDN 0.2 unchanged (modules.lceda.cn — already optimized)
The original 5s rate for Pro API was set out of caution because Pro
requires a logged-in cookie. Empirical sustained-burst probe (25
distinct UUIDs at 0.5s sleep, no recovery): 0/25 errors, median
latency 410ms, p90 932ms. The "Pro is rate-sensitive" assumption was
wrong — server tolerates QPS=2 cleanly.
oshwhub detail HTML pages slowed from p90 6.4s at 1.0s sleep to
p90 15s at 0.5s — server queue backs up. 1.0s is the headroom-safe
water mark.
Net effect on batch-50 estimate: ~1.5h -> ~30min.
scripts/probe_rate_limit.py: rate-limit ladder probe tool. Reusable
for new endpoints (Std source still owes a probe). Designed for safety:
30s tier recovery, low rep counts on auth hosts, bail on first non-200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KiCad pairs project files purely by basename + same directory: a folder
holding `Foo.kicad_pro`, `Foo.kicad_sch`, `Foo.kicad_pcb` opens as one
project on double-click of the .kicad_pro, with cross-tool navigation
(open footprint from schematic etc) wired up automatically.
- pro_writer.write_kicad_pro() renders the minimal KiCad 8 JSON we
need: meta.filename pinning the basename, sheets=[[<root_uuid>,
""]] binding the schematic root, and stub blocks for board /
schematic / net_settings / erc that KiCad expects to find on the
first GUI load.
- root_sch_writer.write_root_sheet() now accepts an optional
root_uuid so the caller can pass the same uuid into the .kicad_pro
and .kicad_sch (the binding fails silently with mismatched ids).
- CLI gains `--all`: groups SCH/PCB docs by their META.board uuid
(1:1 in EPRO2), strips SCH-/PCB- editor prefixes from titles to
derive a shared project basename, and emits one directory per
BOARD with paired files. BOARDs whose SCH is DELETE_DOC (LCD-BD on
ESP-VoCat) still get a .kicad_pro with sheets:[] + .kicad_pcb so
pcbnew opens cleanly.
ESP-VoCat smoke: 6 boards → 6 project dirs, all pairs validated by
kicad-cli sch erc / pcb export svg. The CoreBoard pro/sch/pcb trio
shares root uuid 366d3e53...c2fccbe4330b end-to-end.
68 → 71 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase-1 left 75-358 unconnected_items per board (DRC), dominated by
GND/AGND/POWER nets that EPRO2 routes through copper pour, not discrete
traces. Phase-2 lands those:
- pcb_writer._decode_zone_path handles the three POUR.path encodings
seen in ESP-VoCat: rectangle (['R', x, y, w, h, ...]), circle
(['CIRCLE', cx, cy, r]) approximated as a 36-segment polygon, and
polyline (numeric pairs with 'L'/'ARC' verb tokens).
- Each POUR on a copper layer turns into a (zone (polygon ...) ...)
block plus a (filled_polygon ...) that mirrors the boundary.
Why mirror, not auto-fill: kicad-cli pcb drc does NOT run the zone
filler before checking — only the KiCad GUI does. Without a
pre-computed (filled_polygon ...), DRC sees zones as empty regions and
reports the entire net as unconnected. Mirroring the boundary as the
fill is "connectivity-correct, clearance-imprecise" — KiCad users can
still hit Edit > Fill Zones to refine thermals and pad clearances. We
chose this over reading EPRO2's POURED.pourFill (the editor's own
post-fill polygons) because POURED paths use ARC tokens we'd need to
fully decode, and the user-drawn POUR boundary is already the
authoritative "intended copper" region.
ESP-VoCat DRC totals: 883 → 730 unconnected_items (-17% project-wide).
CoreBoard, the 4-layer board with the most pour coverage, drops 358 →
205 (-43%). Other boards see no movement because their unconnected
items are non-pour issues — pads outside the user-drawn POUR
rectangle, or internal $1N nets via vias on the wrong net (separate
problem, separate fix).
65 → 68 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CLI gates needed before scaling Pro batch beyond top-5:
--skip-ext mp4,qt,mov (attachment filter)
Skips video extensions in attachment download. Phase 1 measurements
showed mp4+qt occupy ~54% of attachment storage. Entry still recorded
in metadata.json with skipped:ext:<token> so we can re-fetch later if
the policy changes. Honors both server-declared `ext` and filename
suffix, case-insensitively.
--max-source-mb N (Pro source size cap)
Trips inside the chain replay loop on encrypted-blob total. On trip:
raise ProjectOversizeError, wipe partial source/, append a row to
data/state/oshwhub_pro_oversize.jsonl. Lets us shortlist 50+ Pro
projects without one X86-board-class outlier (~500 MB) blowing the
LFS budget. Std and Pro 2.x legacy are not capped (both <2 MB in
sample).
Verified:
- cap=0 trips on first blob (1.2 MB), source/ wiped, state recorded
- cap=100 runs full ESP-VoCat (7.5 MB plain, 278 docs)
- skip-ext microtest: 8/8 cases (case-insensitive, declared/suffix
fallback, empty-token edge cases)
Plan + frozen candidate list for the next 50 projects:
- docs/plans/oshwhub_batch50.md
- data/state/oshwhub_batch50_candidates.jsonl (gitignore exception added)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase-1 scope: produce a .kicad_pcb that kicad-cli loads cleanly and
that has the right geometry (nets, footprints, tracks, vias, board
outline) — not a 1:1 EDA round-trip. Skipped on purpose for Phase 2:
copper pours (POUR/POURED), manual FILL, teardrops, board-level
strings/images, ARC circle-center recovery.
What lands:
- pcb_writer.write_pcb(): header/general, data-driven layer table
(F.Cu = ord 0; B.Cu = ord 31; SIGNAL inner ids 15+ allocated to
In1.Cu/In2.Cu/... in EPRO2-id sorted order so used inner layers
stay contiguous), net-name → integer id map (id 0 reserved for the
empty net per KiCad convention), LINE→segment / LINE→gr_line on
Edge.Cuts, layer-11 POLY paths walked into Edge.Cuts gr_line chains
(the actual board outline lives on POLY here, not LINE — without
this stats showed edge=0), VIA→via.
- footprint_writer.write_footprint_placement(): inline (footprint ...)
blocks per PCB COMPONENT. EPRO2 RECT/ELLIPSE/OVAL/POLYGON pad
shapes mapped to KiCad rect/circle/oval/custom; SMD vs THT detected
by PAD.hole presence; SLOT holes use (drill oval w h). Pad nets
resolved cross-doc via the existing PCB.PAD_NET → footprint.pad
chain in ProjectRelations. layerId=2 component → (layer B.Cu) +
text on B.SilkS so bottom-side parts render correctly.
Smoke test on ESP-VoCat (6 PCBs): all 6 pass `kicad-cli pcb export svg`
and render. DRC on smallest (MicBoard) reports 145 violations + 75
unconnected — most of the unconnected are GND nets that the EPRO2
source resolves through POUR copper, which Phase 2 will export.
CLI: `python -m tools.epro2.kicad <project> --all-pcb --out <dir>`
emits one .kicad_pcb per PCB doc.
52 → 65 unit tests pass. Float comparisons in tests use math.isclose
because the s-expr 6-decimal trim doesn't preserve strict equality
through `value * MIL_TO_MM` round-trips.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Colleague-facing explainer at docs/sources/pro_crawl_vs_export.md.
Addresses the "I see 278 .epro2 files but my browser only downloaded
one" confusion: web download is a ZIP container (extension is a UX
choice, not a format), our crawl produces per-doc message streams.
Both carry equivalent EPRO2 data; only real gap is IMAGE/ binary
previews which we don't fetch yet.
Why per-doc and not ZIP: the ZIP path has no public endpoint —
three HARs confirm the export button fires zero HTTP requests, it's
pure client-side JSZip on data already loaded by the editor. Our
crawler hits the same chain endpoints the editor uses internally,
which delivers per-doc streams.
Log entry references the 278 vs 266 doc-count delta for ESP-VoCat
(we walk full history chain, web export is a current snapshot).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Pro modern fetch_pro_modern walks a per-history blob loop on
modules.lceda.cn (CDN-flavored host serving AES-encrypted EPRO2 streams).
We were sleeping 5s between every blob — same rate we use for the
rate-sensitive pro.lceda.cn API host. HAR analysis (proexportNew2.har)
shows the editor fires these blobs back-to-back without throttling, so
0.2s is plenty.
Walltime drops linearly with chain length:
ESP-VoCat (chain=12): 80s sleep -> 22s sleep (-72%)
220V power (chain=28): 160s sleep -> 26s sleep (-84%)
X86 board (chain~700, projection): ~1h -> ~3min
Verified by re-fetching ESP-VoCat + 220V power: byte-identical output
across all per-doc .epro2 files (sha256 match), only fetched_at
timestamp differs in manifest.json. Two manifest files re-stamped as
proof of the validation runs.
API host sleeps (4x 5s in modern fetcher, 7x 5s in legacy fetcher) are
unchanged — those go to pro.lceda.cn /api/ which still wants polite
QPS<=0.2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three coupled changes so kicad-cli sch erc runs at the project level
(across all sheets of one schematic) instead of single-sheet:
1. (label) → (global_label (shape passive)). EPRO2 nets are
project-global by construction (named rails span every page in the
SCH and physically wire across PCBs); KiCad's local label is sheet-
scoped and triggers `label_dangling` for any name not duplicated on
the same page.
2. New root_sch_writer that groups SCH_PAGE docs by their parent SCH
(META.schematic), emits one root .kicad_sch per group with one
(sheet ...) entry per child, and threads the root-assigned uuid back
into each child's (sheet_instances) so KiCad can bind them.
--all-sch now defaults to this; --flat falls back to one-file-per-page.
3. EPRO2's "5-Voltage" placeholder COMPONENT (partId
pid8a0e77bacb214e, 365 instances on ESP-VoCat) is the editor's power
port. The rail name lives in the placement's `Global Net Name` ATTR,
not in the PART. We now emit a (global_label "<rail>") at the
placement coords whenever that attr is set (101/365 of them on
ESP-VoCat — the rest are unconfigured drafts).
ESP-VoCat 5 hierarchical roots: 2325 → 2265 violations. Modest because
5 of 6 SCHs are single-page (no cross-sheet nets to resolve), and the
one 4-page schematic (CoreBoard) shares only a handful of names across
sheets — most net names are de-facto sheet-local. The remaining ~190
pin_not_connected are dominated by 0402-style passives whose pin tip
lies on a wire's interior, not at an endpoint; KiCad needs an explicit
(junction) at those points and we don't yet emit one. Marked as the
next follow-up in log.md.
47 → 52 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bisect found two semantics mismatches between EPRO2 and KiCad that cause
the 850 real-connectivity ERC violations on the ESP-VoCat ref project:
1. sym_writer was emitting lib coords without negating Y, but KiCad lib
uses Y-up and re-flips Y on placement (Y-down schematic). So vertically
arranged pins ended up at Y-mirrored absolute positions and wires that
reach the geometric pin tip in EPRO2 missed the rendered pin tip in
KiCad. Fix: lib_y = -epro2_y, lib_rot = (360 - rot) % 360 for pin/text.
2. sch_writer was treating each LINE as an isolated wire — but EPRO2
binds segments into nets by NAME (WIRE.NET attr), not just geometry.
Multi-segment nets like GND/VBUS show up as N disconnected stubs to
KiCad. Fix: per-LINE, look up lineGroup → WIRE → NET attr and emit a
`(label "<NET>")` at the LINE's start. Same-named labels on distinct
physical wires is how KiCad's ERC recognizes a multi-segment net.
ESP-VoCat 9 sheets:
wire_dangling 444 → 52 (-88%)
pin_not_connected 406 → 196 (-52%)
real connectivity total 850 → 248 (-71%)
Why we did NOT round to grid (the obvious-looking fix): EPRO2 places
some pins on a 10-mil pitch (e.g. magnetic socket); rounding to KiCad's
default 50-mil ERC grid would collapse those pins. The 248 residual is
fundamentally cross-sheet — single-sheet ERC can't see a net's other
endpoints on sibling sheets — and is a Phase-3 (hierarchical sheet)
problem, not a per-sheet one.
41 → 46 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit added the dump script + report but the actual jsonl was caught
by data/state/* gitignore. Add a targeted exception so the snapshot travels with
the repo — anyone who clones can do local filtering without re-hitting the API.
The data is regenerable (scripts/dump_listing_index.py is one-shot, ~1 min), but
pinning a dated snapshot lets us reason about "the state of the corpus on
2026-04-28" reproducibly. Future re-dumps overwrite the same path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Probed listing API and learned: total field is exposed (Pro=21,202 / Std=12,493),
pageSize accepts >=1000 (full corpus = 35 requests / 71s), sort param is silently
ignored. Dump all listings via scripts/dump_listing_index.py to local jsonl so
downstream batch-selection no longer hits the API.
Why: needed quantitative anchors before scaling Pro batch beyond top-5. License
is detail-page only (~19h serial scan), so we want to filter on grade/like
*locally* first to shortlist before paying that cost. Quality-tier counts now
known: A-tier (grade>=3 & like>=10) = 2,806 across both origins.
- scripts/dump_listing_index.py: one-shot scraper, polite QPS, streams to jsonl
- docs/sources/oshwhub_listing_full.md: human-readable report with growth
trends, quality tiers, owner concentration, and storage-budget anchors
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>