Files
PastpaperMaster/docs/TAGGING_REQUIREMENTS.md
Zhao 7a09167261 Initial commit: PastPaper Master full stack
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 12:27:47 +07:00

9.5 KiB
Raw Blame History

Tag Schema & Similar Question Retrieval — Requirements

Background

Current state of paper_questions tagging for COMP2211:

  • analytics_topic: 8 coarse buckets (e.g. "KNN and Clustering" covers both KNN and K-Means)
  • topic_tags: redundant copy of analytics_topic, adds no information
  • skill_tags: fine-grained snake_case labels (e.g. centroid_update, distance_calculation), not shown to users
  • question_text: at subquestion level, but currently stores parent problem header text, not the actual subquestion statement

The result is that similar question retrieval conflates KNN and K-Means, cannot distinguish "write code" from "trace algorithm", and produces low-precision recommendations.


Goal

Every subquestion should carry enough structured metadata that the retrieval system can return topically and skill-wise identical questions across different exam years, rather than just questions from the same broad topic bucket.

Precision target: a question on K-Means centroid update should retrieve other K-Means centroid update questions, not KNN distance questions.


Field Definitions (revised)

analytics_topic — single string, primary retrieval bucket

Granularity: algorithm or concept level, not course-section level.

Allowed values for COMP2211 (replace current 8-bucket system):

New value Replaces / splits
Naive Bayes Probabilistic Models (partial)
Bayesian Inference Probabilistic Models (partial)
KNN KNN and Clustering (partial)
K-Means KNN and Clustering (partial)
Perceptron Perceptron and MLP (partial)
MLP Perceptron and MLP (partial)
CNN Vision and CNN
Evaluation Metrics Evaluation and Validation (partial)
Cross Validation Evaluation and Validation (partial)
Python and NumPy Python Fundamentals
Search Algorithms Search and Games (partial)
Game Trees Search and Games (partial)
Ethics of AI Ethics of AI (unchanged)

Rules:

  • One value per question — pick the most specific algorithm being tested
  • If a subquestion genuinely spans two algorithms, pick the one being asked to compute/demonstrate
  • True/False is not a valid analytics_topic (it is a format, not a topic)

topic_tags — string array, secondary topic labels

Granularity: concept and variant level within the algorithm.

Purpose: catch cross-topic overlaps and concept aliases.

Examples:

analytics_topic = "K-Means"
topic_tags = ["K-Means", "Centroid Update", "Convergence"]

analytics_topic = "KNN"
topic_tags = ["KNN", "Euclidean Distance", "Classification"]

analytics_topic = "Naive Bayes"
topic_tags = ["Naive Bayes", "Prior", "Likelihood", "Posterior"]

analytics_topic = "Evaluation Metrics"
topic_tags = ["Evaluation Metrics", "Precision", "Recall", "F1 Score"]

analytics_topic = "MLP"
topic_tags = ["MLP", "Backpropagation", "Activation Function", "Hidden Layer"]

analytics_topic = "Python and NumPy"
topic_tags = ["NumPy", "Broadcasting", "Array Indexing", "Vectorization"]

Rules:

  • First element should match or alias analytics_topic
  • Include concept names a student would search for ("F1 Score", not "metric_reasoning")
  • 25 tags per question; avoid over-tagging
  • Human-readable, title-case, no underscores

skill_tags — string array, task type labels

Granularity: what the student must do, not what the topic is.

Current values are acceptable in meaning but must be converted to human-readable form.

Rename convention: snake_caseTitle Case with spaces

Old New
concept_check Concept Check
code_tracing Code Tracing
algorithm_tracing Algorithm Tracing
distance_calculation Distance Calculation
centroid_update Centroid Update
weight_update Weight Update
decision_boundary Decision Boundary
implementation Implementation
debugging Debugging
model_selection Model Selection
concept_explanation Concept Explanation
architecture_reasoning Architecture Reasoning
convergence_reasoning Convergence Reasoning
generalization_reasoning Generalization Reasoning
classification_decision Classification Decision

Rules:

  • 13 tags per question
  • Describes the task type, not the subject matter
  • These are used for retrieval ranking, not primary display

question_text — the actual subquestion statement

Current problem: subquestions store the parent problem header as question_text, not the individual statement.

Required fix per subquestion type:

Type What question_text should contain
True/False subquestion (Q1aQ1j) The specific T/F statement being judged
Code output (Q2a_iQ2a_v) The specific code snippet + "What is the output?"
Calculation subquestion (Q4a, Q5a) The specific sub-task, e.g. "Compute the Euclidean distance between..."
Written explanation (Q3, Q5c) The full question prompt for that part

This is a data extraction quality issue. The backfill script must extract the correct per-subquestion text from the source PDF or from raw_answer_text.


Backfill Requirements

Script: backfill_comp2211_tags.py

Target: all paper_questions where paper_id in the COMP2211 course library.

For each question:

  1. Re-classify analytics_topic using the new value list above

    • Use question_text + existing topic_tags + skill_tags as signals
    • If analytics_topic is currently "KNN and Clustering":
      • Look at skill_tags and question_text
      • If centroid_update, algorithm_tracing, or text contains "K-Means" / "centroid" → set "K-Means"
      • Otherwise → set "KNN"
    • If analytics_topic is currently "Perceptron and MLP":
      • If question_text or skill_tags references hidden layer, backprop, activation function → "MLP"
      • Otherwise → "Perceptron"
    • If analytics_topic is currently "Probabilistic Models":
      • If Naive Bayes in text → "Naive Bayes"
      • Otherwise → "Bayesian Inference"
    • If analytics_topic is currently "Evaluation and Validation":
      • If cross-validation, train/val split in text → "Cross Validation"
      • Otherwise → "Evaluation Metrics"
    • If analytics_topic is currently "Search and Games":
      • If minimax, alpha-beta, game tree in text → "Game Trees"
      • Otherwise → "Search Algorithms"
  2. Rebuild topic_tags — do not copy analytics_topic; derive from question content

  3. Rename skill_tags — convert all snake_case values to Title Case per the mapping table above

  4. Do not overwrite question_text in this pass (separate task)


Retrieval Algorithm Changes (backend questions.py)

Separate topic and skill contributions

Current similarity_score() merges analytics_topic, topic_tags, and skill_tags into one set. This causes skill tags like centroid_update to appear as "Shared topic: centroid_update" in the UI.

Required split:

def similarity_score(target, candidate):
    score = 0
    reasons = []

    # 1. analytics_topic exact match: 40 pts
    if target.get("analytics_topic") and target["analytics_topic"] == candidate.get("analytics_topic"):
        score += 40
        reasons.append(f"Same topic: {target['analytics_topic']}")

    # 2. topic_tags overlap: up to 20 pts (10 per shared tag, max 2)
    target_tt = set(t.lower() for t in (target.get("topic_tags") or []))
    candidate_tt = set(t.lower() for t in (candidate.get("topic_tags") or []))
    shared_tt = target_tt & candidate_tt
    tt_pts = min(len(shared_tt) * 10, 20)
    if tt_pts:
        score += tt_pts
        reasons.append(f"Shared concept: {', '.join(sorted(shared_tt)[:2])}")

    # 3. skill_tags overlap: up to 20 pts (10 per shared tag, max 2)
    target_st = set(t.lower() for t in (target.get("skill_tags") or []))
    candidate_st = set(t.lower() for t in (candidate.get("skill_tags") or []))
    shared_st = target_st & candidate_st
    st_pts = min(len(shared_st) * 10, 20)
    if st_pts:
        score += st_pts
        reasons.append(f"Shared skill: {', '.join(sorted(shared_st)[:2])}")

    # 4. Same question format: 10 pts
    if question_family(candidate) == question_family(target):
        score += 10
        reasons.append("Same format")

    # 5. Same difficulty: 5 pts
    if candidate.get("difficulty") and candidate["difficulty"] == target.get("difficulty"):
        score += 5
        reasons.append("Same difficulty")

    # 6. Full-text similarity: up to 20 pts (from tsvector RPC)
    # (injected externally, not computed here)

    return min(score, 99), reasons

Threshold and display

  • Filter: match_percent < 20 (raised from 10; ensures analytics_topic at least partially matches)
  • UI display: show match_reasons chips, but replace snake_case with Title Case before display

Definition of Done

  • All COMP2211 questions have analytics_topic from the new value list
  • No analytics_topic value of "KNN and Clustering", "Perceptron and MLP", "Probabilistic Models", "Evaluation and Validation", "Search and Games" remains
  • topic_tags contains 25 human-readable concept names, not a copy of analytics_topic
  • skill_tags values are Title Case with spaces
  • Similar question retrieval returns 0 cross-algorithm false positives between KNN and K-Means
  • match_reasons chips in the UI show no underscores
  • Retrieval threshold enforces analytics_topic match as a hard or near-hard requirement