飞控-77: 77 std flight-controller projects ingested
Topic-targeted pull from local listing index (`name OR introduction`
contains 飞控). 79 std hits in oshwhub_listing_full.jsonl, 2 already
crawled, 77 newly fetched.
dev1 (Guangzhou) walltime:
Step 1 detail scrape ~12s, Step 4 std-source backfill ~80s
(concurrency=5)
Source completeness: 73/77 with editor source, 4 are upstream
attachments-only (no editor session ever attached, source_documents=[]
is genuine — no editor_version on the SSR page either).
Crawler hardening (crawlers/oshwhub/crawler.py):
- count.{like,star,fork,views} are now `.get(..., 0)` defensive.
Listing API omits zero-valued fields for some low-activity entries
(3/77 hit this on first pass, hard-failed with KeyError 'like').
Affects rank_score, pick_top, and metadata.json metrics block.
License mix: 65% GPL 3.0, 11% Public Domain, 11% MIT, ~6% CC variants.
Transport: dev1 → SG via tar+scp (33 MB, ~3 min over lossy
cross-region link). Bypassed gitea push from dev1 because the same
6.5%-loss link tanks single-stream throughput.
This commit is contained in:
@@ -153,13 +153,15 @@ def list_projects(
|
||||
|
||||
def rank_score(item: dict) -> float:
|
||||
"""Composite quality score: favor projects with broad engagement."""
|
||||
c = item["count"]
|
||||
# Listing API can omit zero-valued count fields (observed: low-activity
|
||||
# projects miss `like`, possibly others). Use .get with 0 default.
|
||||
c = item.get("count") or {}
|
||||
return (
|
||||
c["like"] * 3
|
||||
+ c["star"] * 1
|
||||
+ c["fork"] * 2
|
||||
+ c["views"] / 100
|
||||
+ item["comments_count"] * 2
|
||||
c.get("like", 0) * 3
|
||||
+ c.get("star", 0) * 1
|
||||
+ c.get("fork", 0) * 2
|
||||
+ c.get("views", 0) / 100
|
||||
+ (item.get("comments_count") or 0) * 2
|
||||
+ (item.get("grade") or 0) * 50
|
||||
)
|
||||
|
||||
@@ -175,7 +177,7 @@ def pick_top(
|
||||
for it in items:
|
||||
if exclude_copies and "_copy" in it["path"]:
|
||||
continue
|
||||
if it["count"]["like"] < min_likes:
|
||||
if (it.get("count") or {}).get("like", 0) < min_likes:
|
||||
continue
|
||||
if (it.get("grade") or 0) < min_grade:
|
||||
continue
|
||||
@@ -1095,11 +1097,11 @@ def crawl_one(
|
||||
"published_at": list_item.get("oshwhub_publish_at"),
|
||||
"crawled_at": datetime.now(timezone.utc).isoformat(),
|
||||
"metrics": {
|
||||
"likes": list_item["count"]["like"],
|
||||
"stars": list_item["count"]["star"],
|
||||
"forks": list_item["count"]["fork"],
|
||||
"views": list_item["count"]["views"],
|
||||
"watch": list_item["count"].get("watch", 0),
|
||||
"likes": (list_item.get("count") or {}).get("like", 0),
|
||||
"stars": (list_item.get("count") or {}).get("star", 0),
|
||||
"forks": (list_item.get("count") or {}).get("fork", 0),
|
||||
"views": (list_item.get("count") or {}).get("views", 0),
|
||||
"watch": (list_item.get("count") or {}).get("watch", 0),
|
||||
"comments": list_item.get("comments_count", 0),
|
||||
},
|
||||
"cover": {"url": thumb_url, "path": cover_rel} if thumb_url else None,
|
||||
|
||||
Reference in New Issue
Block a user