飞控-77: 77 std flight-controller projects ingested

Topic-targeted pull from local listing index (`name OR introduction`
contains 飞控). 79 std hits in oshwhub_listing_full.jsonl, 2 already
crawled, 77 newly fetched.

dev1 (Guangzhou) walltime:
  Step 1 detail scrape ~12s, Step 4 std-source backfill ~80s
  (concurrency=5)
Source completeness: 73/77 with editor source, 4 are upstream
attachments-only (no editor session ever attached, source_documents=[]
is genuine — no editor_version on the SSR page either).

Crawler hardening (crawlers/oshwhub/crawler.py):
- count.{like,star,fork,views} are now `.get(..., 0)` defensive.
  Listing API omits zero-valued fields for some low-activity entries
  (3/77 hit this on first pass, hard-failed with KeyError 'like').
  Affects rank_score, pick_top, and metadata.json metrics block.

License mix: 65% GPL 3.0, 11% Public Domain, 11% MIT, ~6% CC variants.

Transport: dev1 → SG via tar+scp (33 MB, ~3 min over lossy
cross-region link). Bypassed gitea push from dev1 because the same
6.5%-loss link tanks single-stream throughput.
This commit is contained in:
2026-04-30 19:04:58 +08:00
parent c199840ad3
commit 29530e09d2
20 changed files with 442 additions and 19 deletions

View File

@@ -153,13 +153,15 @@ def list_projects(
def rank_score(item: dict) -> float:
"""Composite quality score: favor projects with broad engagement."""
c = item["count"]
# Listing API can omit zero-valued count fields (observed: low-activity
# projects miss `like`, possibly others). Use .get with 0 default.
c = item.get("count") or {}
return (
c["like"] * 3
+ c["star"] * 1
+ c["fork"] * 2
+ c["views"] / 100
+ item["comments_count"] * 2
c.get("like", 0) * 3
+ c.get("star", 0) * 1
+ c.get("fork", 0) * 2
+ c.get("views", 0) / 100
+ (item.get("comments_count") or 0) * 2
+ (item.get("grade") or 0) * 50
)
@@ -175,7 +177,7 @@ def pick_top(
for it in items:
if exclude_copies and "_copy" in it["path"]:
continue
if it["count"]["like"] < min_likes:
if (it.get("count") or {}).get("like", 0) < min_likes:
continue
if (it.get("grade") or 0) < min_grade:
continue
@@ -1095,11 +1097,11 @@ def crawl_one(
"published_at": list_item.get("oshwhub_publish_at"),
"crawled_at": datetime.now(timezone.utc).isoformat(),
"metrics": {
"likes": list_item["count"]["like"],
"stars": list_item["count"]["star"],
"forks": list_item["count"]["fork"],
"views": list_item["count"]["views"],
"watch": list_item["count"].get("watch", 0),
"likes": (list_item.get("count") or {}).get("like", 0),
"stars": (list_item.get("count") or {}).get("star", 0),
"forks": (list_item.get("count") or {}).get("fork", 0),
"views": (list_item.get("count") or {}).get("views", 0),
"watch": (list_item.get("count") or {}).get("watch", 0),
"comments": list_item.get("comments_count", 0),
},
"cover": {"url": thumb_url, "path": cover_rel} if thumb_url else None,