Initial commit: PastPaper Master full stack
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
243
docs/TAGGING_REQUIREMENTS.md
Normal file
243
docs/TAGGING_REQUIREMENTS.md
Normal file
@@ -0,0 +1,243 @@
|
||||
# Tag Schema & Similar Question Retrieval — Requirements
|
||||
|
||||
## Background
|
||||
|
||||
Current state of `paper_questions` tagging for COMP2211:
|
||||
|
||||
- `analytics_topic`: 8 coarse buckets (e.g. "KNN and Clustering" covers both KNN and K-Means)
|
||||
- `topic_tags`: redundant copy of `analytics_topic`, adds no information
|
||||
- `skill_tags`: fine-grained snake_case labels (e.g. `centroid_update`, `distance_calculation`), not shown to users
|
||||
- `question_text`: at subquestion level, but currently stores **parent problem header text**, not the actual subquestion statement
|
||||
|
||||
The result is that similar question retrieval conflates KNN and K-Means, cannot distinguish "write code" from "trace algorithm", and produces low-precision recommendations.
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Every subquestion should carry enough structured metadata that the retrieval system can return **topically and skill-wise identical questions across different exam years**, rather than just questions from the same broad topic bucket.
|
||||
|
||||
Precision target: a question on K-Means centroid update should retrieve other K-Means centroid update questions, not KNN distance questions.
|
||||
|
||||
---
|
||||
|
||||
## Field Definitions (revised)
|
||||
|
||||
### `analytics_topic` — single string, primary retrieval bucket
|
||||
|
||||
Granularity: **algorithm or concept level**, not course-section level.
|
||||
|
||||
Allowed values for COMP2211 (replace current 8-bucket system):
|
||||
|
||||
| New value | Replaces / splits |
|
||||
|-----------|-------------------|
|
||||
| `Naive Bayes` | Probabilistic Models (partial) |
|
||||
| `Bayesian Inference` | Probabilistic Models (partial) |
|
||||
| `KNN` | KNN and Clustering (partial) |
|
||||
| `K-Means` | KNN and Clustering (partial) |
|
||||
| `Perceptron` | Perceptron and MLP (partial) |
|
||||
| `MLP` | Perceptron and MLP (partial) |
|
||||
| `CNN` | Vision and CNN |
|
||||
| `Evaluation Metrics` | Evaluation and Validation (partial) |
|
||||
| `Cross Validation` | Evaluation and Validation (partial) |
|
||||
| `Python and NumPy` | Python Fundamentals |
|
||||
| `Search Algorithms` | Search and Games (partial) |
|
||||
| `Game Trees` | Search and Games (partial) |
|
||||
| `Ethics of AI` | Ethics of AI (unchanged) |
|
||||
|
||||
Rules:
|
||||
- One value per question — pick the **most specific** algorithm being tested
|
||||
- If a subquestion genuinely spans two algorithms, pick the one being asked to compute/demonstrate
|
||||
- `True/False` is **not** a valid analytics_topic (it is a format, not a topic)
|
||||
|
||||
---
|
||||
|
||||
### `topic_tags` — string array, secondary topic labels
|
||||
|
||||
Granularity: **concept and variant level** within the algorithm.
|
||||
|
||||
Purpose: catch cross-topic overlaps and concept aliases.
|
||||
|
||||
Examples:
|
||||
|
||||
```
|
||||
analytics_topic = "K-Means"
|
||||
topic_tags = ["K-Means", "Centroid Update", "Convergence"]
|
||||
|
||||
analytics_topic = "KNN"
|
||||
topic_tags = ["KNN", "Euclidean Distance", "Classification"]
|
||||
|
||||
analytics_topic = "Naive Bayes"
|
||||
topic_tags = ["Naive Bayes", "Prior", "Likelihood", "Posterior"]
|
||||
|
||||
analytics_topic = "Evaluation Metrics"
|
||||
topic_tags = ["Evaluation Metrics", "Precision", "Recall", "F1 Score"]
|
||||
|
||||
analytics_topic = "MLP"
|
||||
topic_tags = ["MLP", "Backpropagation", "Activation Function", "Hidden Layer"]
|
||||
|
||||
analytics_topic = "Python and NumPy"
|
||||
topic_tags = ["NumPy", "Broadcasting", "Array Indexing", "Vectorization"]
|
||||
```
|
||||
|
||||
Rules:
|
||||
- First element should match or alias `analytics_topic`
|
||||
- Include concept names a student would search for ("F1 Score", not "metric_reasoning")
|
||||
- 2–5 tags per question; avoid over-tagging
|
||||
- Human-readable, title-case, no underscores
|
||||
|
||||
---
|
||||
|
||||
### `skill_tags` — string array, task type labels
|
||||
|
||||
Granularity: **what the student must do**, not what the topic is.
|
||||
|
||||
Current values are acceptable in meaning but must be converted to human-readable form.
|
||||
|
||||
Rename convention: `snake_case` → `Title Case with spaces`
|
||||
|
||||
| Old | New |
|
||||
|-----|-----|
|
||||
| `concept_check` | `Concept Check` |
|
||||
| `code_tracing` | `Code Tracing` |
|
||||
| `algorithm_tracing` | `Algorithm Tracing` |
|
||||
| `distance_calculation` | `Distance Calculation` |
|
||||
| `centroid_update` | `Centroid Update` |
|
||||
| `weight_update` | `Weight Update` |
|
||||
| `decision_boundary` | `Decision Boundary` |
|
||||
| `implementation` | `Implementation` |
|
||||
| `debugging` | `Debugging` |
|
||||
| `model_selection` | `Model Selection` |
|
||||
| `concept_explanation` | `Concept Explanation` |
|
||||
| `architecture_reasoning` | `Architecture Reasoning` |
|
||||
| `convergence_reasoning` | `Convergence Reasoning` |
|
||||
| `generalization_reasoning` | `Generalization Reasoning` |
|
||||
| `classification_decision` | `Classification Decision` |
|
||||
|
||||
Rules:
|
||||
- 1–3 tags per question
|
||||
- Describes the **task type**, not the subject matter
|
||||
- These are used for retrieval ranking, not primary display
|
||||
|
||||
---
|
||||
|
||||
### `question_text` — the actual subquestion statement
|
||||
|
||||
Current problem: subquestions store the **parent problem header** as `question_text`, not the individual statement.
|
||||
|
||||
Required fix per subquestion type:
|
||||
|
||||
| Type | What `question_text` should contain |
|
||||
|------|-------------------------------------|
|
||||
| True/False subquestion (Q1a–Q1j) | The specific T/F statement being judged |
|
||||
| Code output (Q2a_i–Q2a_v) | The specific code snippet + "What is the output?" |
|
||||
| Calculation subquestion (Q4a, Q5a) | The specific sub-task, e.g. "Compute the Euclidean distance between..." |
|
||||
| Written explanation (Q3, Q5c) | The full question prompt for that part |
|
||||
|
||||
This is a **data extraction quality issue**. The backfill script must extract the correct per-subquestion text from the source PDF or from `raw_answer_text`.
|
||||
|
||||
---
|
||||
|
||||
## Backfill Requirements
|
||||
|
||||
### Script: `backfill_comp2211_tags.py`
|
||||
|
||||
Target: all `paper_questions` where `paper_id` in the COMP2211 course library.
|
||||
|
||||
For each question:
|
||||
|
||||
1. **Re-classify `analytics_topic`** using the new value list above
|
||||
- Use `question_text` + existing `topic_tags` + `skill_tags` as signals
|
||||
- If `analytics_topic` is currently `"KNN and Clustering"`:
|
||||
- Look at `skill_tags` and `question_text`
|
||||
- If `centroid_update`, `algorithm_tracing`, or text contains "K-Means" / "centroid" → set `"K-Means"`
|
||||
- Otherwise → set `"KNN"`
|
||||
- If `analytics_topic` is currently `"Perceptron and MLP"`:
|
||||
- If `question_text` or `skill_tags` references hidden layer, backprop, activation function → `"MLP"`
|
||||
- Otherwise → `"Perceptron"`
|
||||
- If `analytics_topic` is currently `"Probabilistic Models"`:
|
||||
- If Naive Bayes in text → `"Naive Bayes"`
|
||||
- Otherwise → `"Bayesian Inference"`
|
||||
- If `analytics_topic` is currently `"Evaluation and Validation"`:
|
||||
- If cross-validation, train/val split in text → `"Cross Validation"`
|
||||
- Otherwise → `"Evaluation Metrics"`
|
||||
- If `analytics_topic` is currently `"Search and Games"`:
|
||||
- If minimax, alpha-beta, game tree in text → `"Game Trees"`
|
||||
- Otherwise → `"Search Algorithms"`
|
||||
|
||||
2. **Rebuild `topic_tags`** — do not copy `analytics_topic`; derive from question content
|
||||
|
||||
3. **Rename `skill_tags`** — convert all snake_case values to Title Case per the mapping table above
|
||||
|
||||
4. **Do not overwrite `question_text`** in this pass (separate task)
|
||||
|
||||
---
|
||||
|
||||
## Retrieval Algorithm Changes (backend `questions.py`)
|
||||
|
||||
### Separate topic and skill contributions
|
||||
|
||||
Current `similarity_score()` merges `analytics_topic`, `topic_tags`, and `skill_tags` into one set. This causes skill tags like `centroid_update` to appear as "Shared topic: centroid_update" in the UI.
|
||||
|
||||
Required split:
|
||||
|
||||
```python
|
||||
def similarity_score(target, candidate):
|
||||
score = 0
|
||||
reasons = []
|
||||
|
||||
# 1. analytics_topic exact match: 40 pts
|
||||
if target.get("analytics_topic") and target["analytics_topic"] == candidate.get("analytics_topic"):
|
||||
score += 40
|
||||
reasons.append(f"Same topic: {target['analytics_topic']}")
|
||||
|
||||
# 2. topic_tags overlap: up to 20 pts (10 per shared tag, max 2)
|
||||
target_tt = set(t.lower() for t in (target.get("topic_tags") or []))
|
||||
candidate_tt = set(t.lower() for t in (candidate.get("topic_tags") or []))
|
||||
shared_tt = target_tt & candidate_tt
|
||||
tt_pts = min(len(shared_tt) * 10, 20)
|
||||
if tt_pts:
|
||||
score += tt_pts
|
||||
reasons.append(f"Shared concept: {', '.join(sorted(shared_tt)[:2])}")
|
||||
|
||||
# 3. skill_tags overlap: up to 20 pts (10 per shared tag, max 2)
|
||||
target_st = set(t.lower() for t in (target.get("skill_tags") or []))
|
||||
candidate_st = set(t.lower() for t in (candidate.get("skill_tags") or []))
|
||||
shared_st = target_st & candidate_st
|
||||
st_pts = min(len(shared_st) * 10, 20)
|
||||
if st_pts:
|
||||
score += st_pts
|
||||
reasons.append(f"Shared skill: {', '.join(sorted(shared_st)[:2])}")
|
||||
|
||||
# 4. Same question format: 10 pts
|
||||
if question_family(candidate) == question_family(target):
|
||||
score += 10
|
||||
reasons.append("Same format")
|
||||
|
||||
# 5. Same difficulty: 5 pts
|
||||
if candidate.get("difficulty") and candidate["difficulty"] == target.get("difficulty"):
|
||||
score += 5
|
||||
reasons.append("Same difficulty")
|
||||
|
||||
# 6. Full-text similarity: up to 20 pts (from tsvector RPC)
|
||||
# (injected externally, not computed here)
|
||||
|
||||
return min(score, 99), reasons
|
||||
```
|
||||
|
||||
### Threshold and display
|
||||
|
||||
- Filter: `match_percent < 20` (raised from 10; ensures analytics_topic at least partially matches)
|
||||
- UI display: show `match_reasons` chips, but replace snake_case with Title Case before display
|
||||
|
||||
---
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- [ ] All COMP2211 questions have `analytics_topic` from the new value list
|
||||
- [ ] No `analytics_topic` value of `"KNN and Clustering"`, `"Perceptron and MLP"`, `"Probabilistic Models"`, `"Evaluation and Validation"`, `"Search and Games"` remains
|
||||
- [ ] `topic_tags` contains 2–5 human-readable concept names, not a copy of `analytics_topic`
|
||||
- [ ] `skill_tags` values are Title Case with spaces
|
||||
- [ ] Similar question retrieval returns 0 cross-algorithm false positives between KNN and K-Means
|
||||
- [ ] `match_reasons` chips in the UI show no underscores
|
||||
- [ ] Retrieval threshold enforces `analytics_topic` match as a hard or near-hard requirement
|
||||
Reference in New Issue
Block a user