9.5 KiB
Tag Schema & Similar Question Retrieval — Requirements
Background
Current state of paper_questions tagging for COMP2211:
analytics_topic: 8 coarse buckets (e.g. "KNN and Clustering" covers both KNN and K-Means)topic_tags: redundant copy ofanalytics_topic, adds no informationskill_tags: fine-grained snake_case labels (e.g.centroid_update,distance_calculation), not shown to usersquestion_text: at subquestion level, but currently stores parent problem header text, not the actual subquestion statement
The result is that similar question retrieval conflates KNN and K-Means, cannot distinguish "write code" from "trace algorithm", and produces low-precision recommendations.
Goal
Every subquestion should carry enough structured metadata that the retrieval system can return topically and skill-wise identical questions across different exam years, rather than just questions from the same broad topic bucket.
Precision target: a question on K-Means centroid update should retrieve other K-Means centroid update questions, not KNN distance questions.
Field Definitions (revised)
analytics_topic — single string, primary retrieval bucket
Granularity: algorithm or concept level, not course-section level.
Allowed values for COMP2211 (replace current 8-bucket system):
| New value | Replaces / splits |
|---|---|
Naive Bayes |
Probabilistic Models (partial) |
Bayesian Inference |
Probabilistic Models (partial) |
KNN |
KNN and Clustering (partial) |
K-Means |
KNN and Clustering (partial) |
Perceptron |
Perceptron and MLP (partial) |
MLP |
Perceptron and MLP (partial) |
CNN |
Vision and CNN |
Evaluation Metrics |
Evaluation and Validation (partial) |
Cross Validation |
Evaluation and Validation (partial) |
Python and NumPy |
Python Fundamentals |
Search Algorithms |
Search and Games (partial) |
Game Trees |
Search and Games (partial) |
Ethics of AI |
Ethics of AI (unchanged) |
Rules:
- One value per question — pick the most specific algorithm being tested
- If a subquestion genuinely spans two algorithms, pick the one being asked to compute/demonstrate
True/Falseis not a valid analytics_topic (it is a format, not a topic)
topic_tags — string array, secondary topic labels
Granularity: concept and variant level within the algorithm.
Purpose: catch cross-topic overlaps and concept aliases.
Examples:
analytics_topic = "K-Means"
topic_tags = ["K-Means", "Centroid Update", "Convergence"]
analytics_topic = "KNN"
topic_tags = ["KNN", "Euclidean Distance", "Classification"]
analytics_topic = "Naive Bayes"
topic_tags = ["Naive Bayes", "Prior", "Likelihood", "Posterior"]
analytics_topic = "Evaluation Metrics"
topic_tags = ["Evaluation Metrics", "Precision", "Recall", "F1 Score"]
analytics_topic = "MLP"
topic_tags = ["MLP", "Backpropagation", "Activation Function", "Hidden Layer"]
analytics_topic = "Python and NumPy"
topic_tags = ["NumPy", "Broadcasting", "Array Indexing", "Vectorization"]
Rules:
- First element should match or alias
analytics_topic - Include concept names a student would search for ("F1 Score", not "metric_reasoning")
- 2–5 tags per question; avoid over-tagging
- Human-readable, title-case, no underscores
skill_tags — string array, task type labels
Granularity: what the student must do, not what the topic is.
Current values are acceptable in meaning but must be converted to human-readable form.
Rename convention: snake_case → Title Case with spaces
| Old | New |
|---|---|
concept_check |
Concept Check |
code_tracing |
Code Tracing |
algorithm_tracing |
Algorithm Tracing |
distance_calculation |
Distance Calculation |
centroid_update |
Centroid Update |
weight_update |
Weight Update |
decision_boundary |
Decision Boundary |
implementation |
Implementation |
debugging |
Debugging |
model_selection |
Model Selection |
concept_explanation |
Concept Explanation |
architecture_reasoning |
Architecture Reasoning |
convergence_reasoning |
Convergence Reasoning |
generalization_reasoning |
Generalization Reasoning |
classification_decision |
Classification Decision |
Rules:
- 1–3 tags per question
- Describes the task type, not the subject matter
- These are used for retrieval ranking, not primary display
question_text — the actual subquestion statement
Current problem: subquestions store the parent problem header as question_text, not the individual statement.
Required fix per subquestion type:
| Type | What question_text should contain |
|---|---|
| True/False subquestion (Q1a–Q1j) | The specific T/F statement being judged |
| Code output (Q2a_i–Q2a_v) | The specific code snippet + "What is the output?" |
| Calculation subquestion (Q4a, Q5a) | The specific sub-task, e.g. "Compute the Euclidean distance between..." |
| Written explanation (Q3, Q5c) | The full question prompt for that part |
This is a data extraction quality issue. The backfill script must extract the correct per-subquestion text from the source PDF or from raw_answer_text.
Backfill Requirements
Script: backfill_comp2211_tags.py
Target: all paper_questions where paper_id in the COMP2211 course library.
For each question:
-
Re-classify
analytics_topicusing the new value list above- Use
question_text+ existingtopic_tags+skill_tagsas signals - If
analytics_topicis currently"KNN and Clustering":- Look at
skill_tagsandquestion_text - If
centroid_update,algorithm_tracing, or text contains "K-Means" / "centroid" → set"K-Means" - Otherwise → set
"KNN"
- Look at
- If
analytics_topicis currently"Perceptron and MLP":- If
question_textorskill_tagsreferences hidden layer, backprop, activation function →"MLP" - Otherwise →
"Perceptron"
- If
- If
analytics_topicis currently"Probabilistic Models":- If Naive Bayes in text →
"Naive Bayes" - Otherwise →
"Bayesian Inference"
- If Naive Bayes in text →
- If
analytics_topicis currently"Evaluation and Validation":- If cross-validation, train/val split in text →
"Cross Validation" - Otherwise →
"Evaluation Metrics"
- If cross-validation, train/val split in text →
- If
analytics_topicis currently"Search and Games":- If minimax, alpha-beta, game tree in text →
"Game Trees" - Otherwise →
"Search Algorithms"
- If minimax, alpha-beta, game tree in text →
- Use
-
Rebuild
topic_tags— do not copyanalytics_topic; derive from question content -
Rename
skill_tags— convert all snake_case values to Title Case per the mapping table above -
Do not overwrite
question_textin this pass (separate task)
Retrieval Algorithm Changes (backend questions.py)
Separate topic and skill contributions
Current similarity_score() merges analytics_topic, topic_tags, and skill_tags into one set. This causes skill tags like centroid_update to appear as "Shared topic: centroid_update" in the UI.
Required split:
def similarity_score(target, candidate):
score = 0
reasons = []
# 1. analytics_topic exact match: 40 pts
if target.get("analytics_topic") and target["analytics_topic"] == candidate.get("analytics_topic"):
score += 40
reasons.append(f"Same topic: {target['analytics_topic']}")
# 2. topic_tags overlap: up to 20 pts (10 per shared tag, max 2)
target_tt = set(t.lower() for t in (target.get("topic_tags") or []))
candidate_tt = set(t.lower() for t in (candidate.get("topic_tags") or []))
shared_tt = target_tt & candidate_tt
tt_pts = min(len(shared_tt) * 10, 20)
if tt_pts:
score += tt_pts
reasons.append(f"Shared concept: {', '.join(sorted(shared_tt)[:2])}")
# 3. skill_tags overlap: up to 20 pts (10 per shared tag, max 2)
target_st = set(t.lower() for t in (target.get("skill_tags") or []))
candidate_st = set(t.lower() for t in (candidate.get("skill_tags") or []))
shared_st = target_st & candidate_st
st_pts = min(len(shared_st) * 10, 20)
if st_pts:
score += st_pts
reasons.append(f"Shared skill: {', '.join(sorted(shared_st)[:2])}")
# 4. Same question format: 10 pts
if question_family(candidate) == question_family(target):
score += 10
reasons.append("Same format")
# 5. Same difficulty: 5 pts
if candidate.get("difficulty") and candidate["difficulty"] == target.get("difficulty"):
score += 5
reasons.append("Same difficulty")
# 6. Full-text similarity: up to 20 pts (from tsvector RPC)
# (injected externally, not computed here)
return min(score, 99), reasons
Threshold and display
- Filter:
match_percent < 20(raised from 10; ensures analytics_topic at least partially matches) - UI display: show
match_reasonschips, but replace snake_case with Title Case before display
Definition of Done
- All COMP2211 questions have
analytics_topicfrom the new value list - No
analytics_topicvalue of"KNN and Clustering","Perceptron and MLP","Probabilistic Models","Evaluation and Validation","Search and Games"remains topic_tagscontains 2–5 human-readable concept names, not a copy ofanalytics_topicskill_tagsvalues are Title Case with spaces- Similar question retrieval returns 0 cross-algorithm false positives between KNN and K-Means
match_reasonschips in the UI show no underscores- Retrieval threshold enforces
analytics_topicmatch as a hard or near-hard requirement