crawler: --skip-ext + --max-source-mb gates for batch-50 expansion
Two CLI gates needed before scaling Pro batch beyond top-5:
--skip-ext mp4,qt,mov (attachment filter)
Skips video extensions in attachment download. Phase 1 measurements
showed mp4+qt occupy ~54% of attachment storage. Entry still recorded
in metadata.json with skipped:ext:<token> so we can re-fetch later if
the policy changes. Honors both server-declared `ext` and filename
suffix, case-insensitively.
--max-source-mb N (Pro source size cap)
Trips inside the chain replay loop on encrypted-blob total. On trip:
raise ProjectOversizeError, wipe partial source/, append a row to
data/state/oshwhub_pro_oversize.jsonl. Lets us shortlist 50+ Pro
projects without one X86-board-class outlier (~500 MB) blowing the
LFS budget. Std and Pro 2.x legacy are not capped (both <2 MB in
sample).
Verified:
- cap=0 trips on first blob (1.2 MB), source/ wiped, state recorded
- cap=100 runs full ESP-VoCat (7.5 MB plain, 278 docs)
- skip-ext microtest: 8/8 cases (case-insensitive, declared/suffix
fallback, empty-token edge cases)
Plan + frozen candidate list for the next 50 projects:
- docs/plans/oshwhub_batch50.md
- data/state/oshwhub_batch50_candidates.jsonl (gitignore exception added)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
"project_uuid": "ba64bd6f1c9c467ba3b674a54943557d",
|
||||
"branch_uuid": "ef5f58bd0f1245b0a808c07e541a1b5c",
|
||||
"head_uuid": "764dd8b722a44914a915493277e204c9",
|
||||
"fetched_at": "2026-04-28T16:06:55.434479+00:00",
|
||||
"fetched_at": "2026-04-28T16:23:06.697278+00:00",
|
||||
"editor_version": "3.2.91",
|
||||
"chain_length": 12,
|
||||
"blob_bytes_total": 1195716,
|
||||
|
||||
Reference in New Issue
Block a user