Files

Zhao 7a09167261 Initial commit: PastPaper Master full stack

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-21 12:27:47 +07:00

9.5 KiB

Raw Blame History

Tag Schema & Similar Question Retrieval — Requirements

Background

Current state of paper_questions tagging for COMP2211:

analytics_topic: 8 coarse buckets (e.g. "KNN and Clustering" covers both KNN and K-Means)
topic_tags: redundant copy of analytics_topic, adds no information
skill_tags: fine-grained snake_case labels (e.g. centroid_update, distance_calculation), not shown to users
question_text: at subquestion level, but currently stores parent problem header text, not the actual subquestion statement

The result is that similar question retrieval conflates KNN and K-Means, cannot distinguish "write code" from "trace algorithm", and produces low-precision recommendations.

Goal

Every subquestion should carry enough structured metadata that the retrieval system can return topically and skill-wise identical questions across different exam years, rather than just questions from the same broad topic bucket.

Precision target: a question on K-Means centroid update should retrieve other K-Means centroid update questions, not KNN distance questions.

Field Definitions (revised)

`analytics_topic` — single string, primary retrieval bucket

Granularity: algorithm or concept level, not course-section level.

Allowed values for COMP2211 (replace current 8-bucket system):

New value	Replaces / splits
`Naive Bayes`	Probabilistic Models (partial)
`Bayesian Inference`	Probabilistic Models (partial)
`KNN`	KNN and Clustering (partial)
`K-Means`	KNN and Clustering (partial)
`Perceptron`	Perceptron and MLP (partial)
`MLP`	Perceptron and MLP (partial)
`CNN`	Vision and CNN
`Evaluation Metrics`	Evaluation and Validation (partial)
`Cross Validation`	Evaluation and Validation (partial)
`Python and NumPy`	Python Fundamentals
`Search Algorithms`	Search and Games (partial)
`Game Trees`	Search and Games (partial)
`Ethics of AI`	Ethics of AI (unchanged)

Rules:

One value per question — pick the most specific algorithm being tested
If a subquestion genuinely spans two algorithms, pick the one being asked to compute/demonstrate
True/False is not a valid analytics_topic (it is a format, not a topic)

`topic_tags` — string array, secondary topic labels

Granularity: concept and variant level within the algorithm.

Purpose: catch cross-topic overlaps and concept aliases.

Examples:

analytics_topic = "K-Means"
topic_tags = ["K-Means", "Centroid Update", "Convergence"]

analytics_topic = "KNN"
topic_tags = ["KNN", "Euclidean Distance", "Classification"]

analytics_topic = "Naive Bayes"
topic_tags = ["Naive Bayes", "Prior", "Likelihood", "Posterior"]

analytics_topic = "Evaluation Metrics"
topic_tags = ["Evaluation Metrics", "Precision", "Recall", "F1 Score"]

analytics_topic = "MLP"
topic_tags = ["MLP", "Backpropagation", "Activation Function", "Hidden Layer"]

analytics_topic = "Python and NumPy"
topic_tags = ["NumPy", "Broadcasting", "Array Indexing", "Vectorization"]

Rules:

First element should match or alias analytics_topic
Include concept names a student would search for ("F1 Score", not "metric_reasoning")
2–5 tags per question; avoid over-tagging
Human-readable, title-case, no underscores

`skill_tags` — string array, task type labels

Granularity: what the student must do, not what the topic is.

Current values are acceptable in meaning but must be converted to human-readable form.

Rename convention: snake_case → Title Case with spaces

Old	New
`concept_check`	`Concept Check`
`code_tracing`	`Code Tracing`
`algorithm_tracing`	`Algorithm Tracing`
`distance_calculation`	`Distance Calculation`
`centroid_update`	`Centroid Update`
`weight_update`	`Weight Update`
`decision_boundary`	`Decision Boundary`
`implementation`	`Implementation`
`debugging`	`Debugging`
`model_selection`	`Model Selection`
`concept_explanation`	`Concept Explanation`
`architecture_reasoning`	`Architecture Reasoning`
`convergence_reasoning`	`Convergence Reasoning`
`generalization_reasoning`	`Generalization Reasoning`
`classification_decision`	`Classification Decision`

Rules:

1–3 tags per question
Describes the task type, not the subject matter
These are used for retrieval ranking, not primary display

`question_text` — the actual subquestion statement

Current problem: subquestions store the parent problem header as question_text, not the individual statement.

Required fix per subquestion type:

Type	What `question_text` should contain
True/False subquestion (Q1a–Q1j)	The specific T/F statement being judged
Code output (Q2a_i–Q2a_v)	The specific code snippet + "What is the output?"
Calculation subquestion (Q4a, Q5a)	The specific sub-task, e.g. "Compute the Euclidean distance between..."
Written explanation (Q3, Q5c)	The full question prompt for that part

This is a data extraction quality issue. The backfill script must extract the correct per-subquestion text from the source PDF or from raw_answer_text.

Backfill Requirements

Script: `backfill_comp2211_tags.py`

Target: all paper_questions where paper_id in the COMP2211 course library.

For each question:

Re-classify analytics_topic using the new value list above
- Use question_text + existing topic_tags + skill_tags as signals
- If analytics_topic is currently "KNN and Clustering":
  - Look at skill_tags and question_text
  - If centroid_update, algorithm_tracing, or text contains "K-Means" / "centroid" → set "K-Means"
  - Otherwise → set "KNN"
- If analytics_topic is currently "Perceptron and MLP":
  - If question_text or skill_tags references hidden layer, backprop, activation function → "MLP"
  - Otherwise → "Perceptron"
- If analytics_topic is currently "Probabilistic Models":
  - If Naive Bayes in text → "Naive Bayes"
  - Otherwise → "Bayesian Inference"
- If analytics_topic is currently "Evaluation and Validation":
  - If cross-validation, train/val split in text → "Cross Validation"
  - Otherwise → "Evaluation Metrics"
- If analytics_topic is currently "Search and Games":
  - If minimax, alpha-beta, game tree in text → "Game Trees"
  - Otherwise → "Search Algorithms"
Rebuild topic_tags — do not copy analytics_topic; derive from question content
Rename skill_tags — convert all snake_case values to Title Case per the mapping table above
Do not overwrite question_text in this pass (separate task)

Retrieval Algorithm Changes (backend `questions.py`)

Separate topic and skill contributions

Current similarity_score() merges analytics_topic, topic_tags, and skill_tags into one set. This causes skill tags like centroid_update to appear as "Shared topic: centroid_update" in the UI.

Required split:

def similarity_score(target, candidate):
    score = 0
    reasons = []

    # 1. analytics_topic exact match: 40 pts
    if target.get("analytics_topic") and target["analytics_topic"] == candidate.get("analytics_topic"):
        score += 40
        reasons.append(f"Same topic: {target['analytics_topic']}")

    # 2. topic_tags overlap: up to 20 pts (10 per shared tag, max 2)
    target_tt = set(t.lower() for t in (target.get("topic_tags") or []))
    candidate_tt = set(t.lower() for t in (candidate.get("topic_tags") or []))
    shared_tt = target_tt & candidate_tt
    tt_pts = min(len(shared_tt) * 10, 20)
    if tt_pts:
        score += tt_pts
        reasons.append(f"Shared concept: {', '.join(sorted(shared_tt)[:2])}")

    # 3. skill_tags overlap: up to 20 pts (10 per shared tag, max 2)
    target_st = set(t.lower() for t in (target.get("skill_tags") or []))
    candidate_st = set(t.lower() for t in (candidate.get("skill_tags") or []))
    shared_st = target_st & candidate_st
    st_pts = min(len(shared_st) * 10, 20)
    if st_pts:
        score += st_pts
        reasons.append(f"Shared skill: {', '.join(sorted(shared_st)[:2])}")

    # 4. Same question format: 10 pts
    if question_family(candidate) == question_family(target):
        score += 10
        reasons.append("Same format")

    # 5. Same difficulty: 5 pts
    if candidate.get("difficulty") and candidate["difficulty"] == target.get("difficulty"):
        score += 5
        reasons.append("Same difficulty")

    # 6. Full-text similarity: up to 20 pts (from tsvector RPC)
    # (injected externally, not computed here)

    return min(score, 99), reasons

Threshold and display

Filter: match_percent < 20 (raised from 10; ensures analytics_topic at least partially matches)
UI display: show match_reasons chips, but replace snake_case with Title Case before display

Definition of Done

All COMP2211 questions have analytics_topic from the new value list
No analytics_topic value of "KNN and Clustering", "Perceptron and MLP", "Probabilistic Models", "Evaluation and Validation", "Search and Games" remains
topic_tags contains 2–5 human-readable concept names, not a copy of analytics_topic
skill_tags values are Title Case with spaces
Similar question retrieval returns 0 cross-algorithm false positives between KNN and K-Means
match_reasons chips in the UI show no underscores
Retrieval threshold enforces analytics_topic match as a hard or near-hard requirement

9.5 KiB Raw Blame History Unescape Escape

Tag Schema & Similar Question Retrieval — Requirements

Background

Goal

Field Definitions (revised)

analytics_topic — single string, primary retrieval bucket

topic_tags — string array, secondary topic labels

skill_tags — string array, task type labels

question_text — the actual subquestion statement

Backfill Requirements

Script: backfill_comp2211_tags.py

Retrieval Algorithm Changes (backend questions.py)

Separate topic and skill contributions

Threshold and display

Definition of Done

9.5 KiB

Raw Blame History

`analytics_topic` — single string, primary retrieval bucket

`topic_tags` — string array, secondary topic labels

`skill_tags` — string array, task type labels

`question_text` — the actual subquestion statement

Script: `backfill_comp2211_tags.py`

Retrieval Algorithm Changes (backend `questions.py`)