log: 飞控-77 batch summary
This commit is contained in:
39
log.md
39
log.md
@@ -4,6 +4,45 @@
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## 2026-04-30 19:10 飞控-77:主题定向抓 77 块标准飞控板
|
||||||
|
|
||||||
|
**Claude 会话**
|
||||||
|
|
||||||
|
走完整 pipeline:本地索引筛 → dev1 抓 → tar+scp 回 SG → push gitea。
|
||||||
|
|
||||||
|
### 候选筛选
|
||||||
|
- 数据源:`data/state/oshwhub_listing_full.jsonl`(33,695 项)
|
||||||
|
- 过滤:`origin=std AND ('飞控' in name OR '飞控' in introduction)` → 79 hits
|
||||||
|
- 减去已抓的 2 项 → 77 个新候选
|
||||||
|
- 工具:临时脚本,候选 jsonl 落 dev1 `data/state/oshwhub_feikong_candidates.jsonl`(不入 git,可重算)
|
||||||
|
|
||||||
|
### 抓取(dev1 Guangzhou,concurrency=5)
|
||||||
|
- Step 1 详情扫 license: ~12s, 74/77 OK + 3 fail
|
||||||
|
- 3 fail 都是同一个 bug:listing entry 的 `count` dict 缺 `like` 字段,crawler 直接 `count["like"]` 抛 KeyError
|
||||||
|
- 修:`rank_score` / `pick_top` / metadata builder 全改 `count.get("like", 0)` 形式(commit `29530e0`)
|
||||||
|
- 重抓 3 项 → 全 OK
|
||||||
|
- Step 4 std-source backfill: ~80s, 73/77 拉到源工程文档(4 项 upstream 就是 attachments-only,没编辑器 session,`source_documents=[]` 是真实状态)
|
||||||
|
|
||||||
|
### 传输:tar+scp 而非 dev1 push gitea
|
||||||
|
- dev1 → SG 同样吃 6.5% 丢包 link,单 TCP cwnd 压扁
|
||||||
|
- 33 MB tarball 走 scp ~3 min(与之前 dev1 push gitea 同量级)
|
||||||
|
- 落 SG 后从 SG 直推 gitea(同区低延迟),秒级完成
|
||||||
|
- rebase:dev1 端有人手动推了 74-项 commit (`c199840`),本地 77-项 superset rebase 上去,conflicts 仅 projects.md(regen 一遍即解)
|
||||||
|
|
||||||
|
### 完成度
|
||||||
|
- 79/79 飞控 std 项目都有 metadata
|
||||||
|
- 73 项有完整 std 源工程
|
||||||
|
- 4 项是真实 attachments-only(upstream API 返空)
|
||||||
|
- License 分布:65% GPL 3.0,11% PD,11% MIT,~6% CC variants(与 batch-50 同形态)
|
||||||
|
- corpus 由 65 项扩到 142 项(+77)
|
||||||
|
|
||||||
|
### 下一步建议
|
||||||
|
- 跨区传输优化:tencent-cloud COS 同 cloud 跨区复制走骨干网,比 scp 快几倍;下次大批量再装。或者 split + 并行 scp 也能拉 3-5x。
|
||||||
|
- 清理 stash 里那两份 .decrypted.txt(pre-existing 调试残留)
|
||||||
|
- 可以再试一波 Pro 飞控(93 hits,origin=pro)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## 2026-04-29 04:30 std/ writer 翻 Option 2:raw objects dump + mapping doc
|
## 2026-04-29 04:30 std/ writer 翻 Option 2:raw objects dump + mapping doc
|
||||||
|
|
||||||
**Claude 会话**
|
**Claude 会话**
|
||||||
|
|||||||
Reference in New Issue
Block a user