oshwhub: dump full listing index (33,695 projects) for batch sizing
Probed listing API and learned: total field is exposed (Pro=21,202 / Std=12,493), pageSize accepts >=1000 (full corpus = 35 requests / 71s), sort param is silently ignored. Dump all listings via scripts/dump_listing_index.py to local jsonl so downstream batch-selection no longer hits the API. Why: needed quantitative anchors before scaling Pro batch beyond top-5. License is detail-page only (~19h serial scan), so we want to filter on grade/like *locally* first to shortlist before paying that cost. Quality-tier counts now known: A-tier (grade>=3 & like>=10) = 2,806 across both origins. - scripts/dump_listing_index.py: one-shot scraper, polite QPS, streams to jsonl - docs/sources/oshwhub_listing_full.md: human-readable report with growth trends, quality tiers, owner concentration, and storage-budget anchors Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
41
log.md
41
log.md
@@ -4,6 +4,47 @@
|
||||
|
||||
---
|
||||
|
||||
## 2026-04-28 23:30 oshwhub 全量 listing 索引落本地:33,695 项 / 28.4 MB
|
||||
|
||||
**Claude 会话**
|
||||
|
||||
为了在"扩量到 top-30 / top-50 / 全量"前先量化候选池规模 + 质量分布,把 oshwhub listing API 全量扫一遍落地。
|
||||
|
||||
### 关键收获(之前以为是黑箱)
|
||||
|
||||
- **listing API 直接返回 `total` 字段**:Pro 21,202 / Std 12,493,**合计 33,695**。
|
||||
- **`pageSize` 无上限**,实测 1000 工作良好;全量索引 = 35 次请求 / 71 秒 / 35 MB 流量。
|
||||
- **`sort` 参数被服务端静默忽略**——传啥都返回相同顺序(grade desc → 隐式质量分 desc)。"按时间排序"必须先拉全集再本地排。
|
||||
- **`origin` 默认 std**——不带参数永远看不到 Pro 池。
|
||||
- **`license` 不在 listing 响应**,必须挨个抓详情页(QPS=0.5 → ~19 小时全量)。
|
||||
|
||||
### 数据画像(写到 `docs/sources/oshwhub_listing_full.md`)
|
||||
|
||||
- **Pro 长尾极重**:grade=1 占 82%,真正 A 档(grade≥3 & like≥10)只有 1,356 (6.4%)
|
||||
- **Std 高质量比例反而高**:A 档 1,450 (11.6%),因为平台老 7 年(2016 起 vs Pro 2021 起),项目有时间累积点赞
|
||||
- **Std 已停滞**:2021-2022 见顶(3.4k/年),之后断崖(1.5k → 0.9k → 0.4k → 0.05k 2026Q1)
|
||||
- **Pro 还在快速膨胀**:2023 起线性增长,2025 全年 7.4k,2026Q1 已 1.1k
|
||||
- **作者长尾健康**:Pro 10,536 个 / Std 5,531 个唯一作者;top-1 占比 0.4% / 1.5%
|
||||
- **立创官方账号占据头部**(course-examples / li-chuang-kai-fa-ban / li-chuang-zhi-neng-ying-jian-bu)
|
||||
|
||||
### 实操含义
|
||||
|
||||
放量决策有了量化锚点:S 档 583 项 / A 档 2,806 项 / B 档 6,243 项 / 全量 33,695。Pro 工程源体积外推(基于 5 项实测均值),全 Pro 约 1 TB——超出 Gitea LFS 舒适区,必须配 size cap + license 白名单。
|
||||
|
||||
### 下一步
|
||||
|
||||
1. 在本地 jsonl 上按 A 档过滤,做 license 详情页扫描(一次性 ~7 小时)
|
||||
2. license 白名单 ∩ A 档 → 真候选清单
|
||||
3. 然后再决定批量下载源
|
||||
|
||||
### 文件
|
||||
|
||||
- `scripts/dump_listing_index.py` —— 一次性全量扫描脚本,可重抓
|
||||
- `data/state/oshwhub_listing_full.jsonl` —— 28.4 MB,gitignore(可重建,不入库)
|
||||
- `docs/sources/oshwhub_listing_full.md` —— 给人看的简报
|
||||
|
||||
---
|
||||
|
||||
## 2026-04-28 22:00 Pro 2.x 旧版工程源抓取链路打通,5/5 Pro 项目全部 ✅
|
||||
|
||||
**Claude 会话**
|
||||
|
||||
Reference in New Issue
Block a user