# Tag Schema & Similar Question Retrieval — Requirements ## Background Current state of `paper_questions` tagging for COMP2211: - `analytics_topic`: 8 coarse buckets (e.g. "KNN and Clustering" covers both KNN and K-Means) - `topic_tags`: redundant copy of `analytics_topic`, adds no information - `skill_tags`: fine-grained snake_case labels (e.g. `centroid_update`, `distance_calculation`), not shown to users - `question_text`: at subquestion level, but currently stores **parent problem header text**, not the actual subquestion statement The result is that similar question retrieval conflates KNN and K-Means, cannot distinguish "write code" from "trace algorithm", and produces low-precision recommendations. --- ## Goal Every subquestion should carry enough structured metadata that the retrieval system can return **topically and skill-wise identical questions across different exam years**, rather than just questions from the same broad topic bucket. Precision target: a question on K-Means centroid update should retrieve other K-Means centroid update questions, not KNN distance questions. --- ## Field Definitions (revised) ### `analytics_topic` — single string, primary retrieval bucket Granularity: **algorithm or concept level**, not course-section level. Allowed values for COMP2211 (replace current 8-bucket system): | New value | Replaces / splits | |-----------|-------------------| | `Naive Bayes` | Probabilistic Models (partial) | | `Bayesian Inference` | Probabilistic Models (partial) | | `KNN` | KNN and Clustering (partial) | | `K-Means` | KNN and Clustering (partial) | | `Perceptron` | Perceptron and MLP (partial) | | `MLP` | Perceptron and MLP (partial) | | `CNN` | Vision and CNN | | `Evaluation Metrics` | Evaluation and Validation (partial) | | `Cross Validation` | Evaluation and Validation (partial) | | `Python and NumPy` | Python Fundamentals | | `Search Algorithms` | Search and Games (partial) | | `Game Trees` | Search and Games (partial) | | `Ethics of AI` | Ethics of AI (unchanged) | Rules: - One value per question — pick the **most specific** algorithm being tested - If a subquestion genuinely spans two algorithms, pick the one being asked to compute/demonstrate - `True/False` is **not** a valid analytics_topic (it is a format, not a topic) --- ### `topic_tags` — string array, secondary topic labels Granularity: **concept and variant level** within the algorithm. Purpose: catch cross-topic overlaps and concept aliases. Examples: ``` analytics_topic = "K-Means" topic_tags = ["K-Means", "Centroid Update", "Convergence"] analytics_topic = "KNN" topic_tags = ["KNN", "Euclidean Distance", "Classification"] analytics_topic = "Naive Bayes" topic_tags = ["Naive Bayes", "Prior", "Likelihood", "Posterior"] analytics_topic = "Evaluation Metrics" topic_tags = ["Evaluation Metrics", "Precision", "Recall", "F1 Score"] analytics_topic = "MLP" topic_tags = ["MLP", "Backpropagation", "Activation Function", "Hidden Layer"] analytics_topic = "Python and NumPy" topic_tags = ["NumPy", "Broadcasting", "Array Indexing", "Vectorization"] ``` Rules: - First element should match or alias `analytics_topic` - Include concept names a student would search for ("F1 Score", not "metric_reasoning") - 2–5 tags per question; avoid over-tagging - Human-readable, title-case, no underscores --- ### `skill_tags` — string array, task type labels Granularity: **what the student must do**, not what the topic is. Current values are acceptable in meaning but must be converted to human-readable form. Rename convention: `snake_case` → `Title Case with spaces` | Old | New | |-----|-----| | `concept_check` | `Concept Check` | | `code_tracing` | `Code Tracing` | | `algorithm_tracing` | `Algorithm Tracing` | | `distance_calculation` | `Distance Calculation` | | `centroid_update` | `Centroid Update` | | `weight_update` | `Weight Update` | | `decision_boundary` | `Decision Boundary` | | `implementation` | `Implementation` | | `debugging` | `Debugging` | | `model_selection` | `Model Selection` | | `concept_explanation` | `Concept Explanation` | | `architecture_reasoning` | `Architecture Reasoning` | | `convergence_reasoning` | `Convergence Reasoning` | | `generalization_reasoning` | `Generalization Reasoning` | | `classification_decision` | `Classification Decision` | Rules: - 1–3 tags per question - Describes the **task type**, not the subject matter - These are used for retrieval ranking, not primary display --- ### `question_text` — the actual subquestion statement Current problem: subquestions store the **parent problem header** as `question_text`, not the individual statement. Required fix per subquestion type: | Type | What `question_text` should contain | |------|-------------------------------------| | True/False subquestion (Q1a–Q1j) | The specific T/F statement being judged | | Code output (Q2a_i–Q2a_v) | The specific code snippet + "What is the output?" | | Calculation subquestion (Q4a, Q5a) | The specific sub-task, e.g. "Compute the Euclidean distance between..." | | Written explanation (Q3, Q5c) | The full question prompt for that part | This is a **data extraction quality issue**. The backfill script must extract the correct per-subquestion text from the source PDF or from `raw_answer_text`. --- ## Backfill Requirements ### Script: `backfill_comp2211_tags.py` Target: all `paper_questions` where `paper_id` in the COMP2211 course library. For each question: 1. **Re-classify `analytics_topic`** using the new value list above - Use `question_text` + existing `topic_tags` + `skill_tags` as signals - If `analytics_topic` is currently `"KNN and Clustering"`: - Look at `skill_tags` and `question_text` - If `centroid_update`, `algorithm_tracing`, or text contains "K-Means" / "centroid" → set `"K-Means"` - Otherwise → set `"KNN"` - If `analytics_topic` is currently `"Perceptron and MLP"`: - If `question_text` or `skill_tags` references hidden layer, backprop, activation function → `"MLP"` - Otherwise → `"Perceptron"` - If `analytics_topic` is currently `"Probabilistic Models"`: - If Naive Bayes in text → `"Naive Bayes"` - Otherwise → `"Bayesian Inference"` - If `analytics_topic` is currently `"Evaluation and Validation"`: - If cross-validation, train/val split in text → `"Cross Validation"` - Otherwise → `"Evaluation Metrics"` - If `analytics_topic` is currently `"Search and Games"`: - If minimax, alpha-beta, game tree in text → `"Game Trees"` - Otherwise → `"Search Algorithms"` 2. **Rebuild `topic_tags`** — do not copy `analytics_topic`; derive from question content 3. **Rename `skill_tags`** — convert all snake_case values to Title Case per the mapping table above 4. **Do not overwrite `question_text`** in this pass (separate task) --- ## Retrieval Algorithm Changes (backend `questions.py`) ### Separate topic and skill contributions Current `similarity_score()` merges `analytics_topic`, `topic_tags`, and `skill_tags` into one set. This causes skill tags like `centroid_update` to appear as "Shared topic: centroid_update" in the UI. Required split: ```python def similarity_score(target, candidate): score = 0 reasons = [] # 1. analytics_topic exact match: 40 pts if target.get("analytics_topic") and target["analytics_topic"] == candidate.get("analytics_topic"): score += 40 reasons.append(f"Same topic: {target['analytics_topic']}") # 2. topic_tags overlap: up to 20 pts (10 per shared tag, max 2) target_tt = set(t.lower() for t in (target.get("topic_tags") or [])) candidate_tt = set(t.lower() for t in (candidate.get("topic_tags") or [])) shared_tt = target_tt & candidate_tt tt_pts = min(len(shared_tt) * 10, 20) if tt_pts: score += tt_pts reasons.append(f"Shared concept: {', '.join(sorted(shared_tt)[:2])}") # 3. skill_tags overlap: up to 20 pts (10 per shared tag, max 2) target_st = set(t.lower() for t in (target.get("skill_tags") or [])) candidate_st = set(t.lower() for t in (candidate.get("skill_tags") or [])) shared_st = target_st & candidate_st st_pts = min(len(shared_st) * 10, 20) if st_pts: score += st_pts reasons.append(f"Shared skill: {', '.join(sorted(shared_st)[:2])}") # 4. Same question format: 10 pts if question_family(candidate) == question_family(target): score += 10 reasons.append("Same format") # 5. Same difficulty: 5 pts if candidate.get("difficulty") and candidate["difficulty"] == target.get("difficulty"): score += 5 reasons.append("Same difficulty") # 6. Full-text similarity: up to 20 pts (from tsvector RPC) # (injected externally, not computed here) return min(score, 99), reasons ``` ### Threshold and display - Filter: `match_percent < 20` (raised from 10; ensures analytics_topic at least partially matches) - UI display: show `match_reasons` chips, but replace snake_case with Title Case before display --- ## Definition of Done - [ ] All COMP2211 questions have `analytics_topic` from the new value list - [ ] No `analytics_topic` value of `"KNN and Clustering"`, `"Perceptron and MLP"`, `"Probabilistic Models"`, `"Evaluation and Validation"`, `"Search and Games"` remains - [ ] `topic_tags` contains 2–5 human-readable concept names, not a copy of `analytics_topic` - [ ] `skill_tags` values are Title Case with spaces - [ ] Similar question retrieval returns 0 cross-algorithm false positives between KNN and K-Means - [ ] `match_reasons` chips in the UI show no underscores - [ ] Retrieval threshold enforces `analytics_topic` match as a hard or near-hard requirement