Initial commit: PastPaper Master full stack

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 12:15:35 +07:00
commit 7a09167261
105 changed files with 24799 additions and 0 deletions
--- a/docs/PAGE_NUMBER_BACKFILL.md
+++ b/docs/PAGE_NUMBER_BACKFILL.md
@@ -0,0 +1,152 @@
+# Sub-question Page Number Backfill — Requirements
+
+## Problem
+
+All six `split_comp2211_*.py` scripts create sub-questions by inheriting `page_number`
+from their parent question:
+
+```python
+"page_number": parent.get("page_number"),
+```
+
+This is wrong for sub-questions that span multiple pages. For example, Q1 True/False
+has 10 statements (a–j); if (a)–(f) are on page 1 and (g)–(j) are on page 2, all ten
+inherit page 1 from the parent. Clicking Q1h in the UI scrolls to page 1 instead of page 2.
+
+## Goal
+
+Every `ChildSpec` in every split script should carry its own correct `page_number`.
+When the script runs, it writes that page number to the database instead of inheriting
+from the parent.
+
+## Files to modify
+
+```
+backend/split_comp2211_2022_fall_midterm.py      ← does not exist yet; parent is seed SQL
+backend/split_comp2211_2022_spring_midterm.py
+backend/split_comp2211_2022_spring_final_part_a.py
+backend/split_comp2211_2022_spring_final_part_b.py
+backend/split_comp2211_2023_spring_midterm.py
+backend/split_comp2211_2024_spring_midterm.py
+backend/split_comp2211_2024_spring_final.py
+```
+
+Note: `2022-fall-midterm` sub-questions were inserted directly via the seed SQL
+(`supabase/seeds/comp2211_problem_level_questions.sql`), not via a split script.
+Their page numbers must be fixed directly in that SQL file or via a separate UPDATE.
+
+## How to determine page numbers
+
+Use PyMuPDF (`import pymupdf` — already in the venv) to search for question markers
+in the local PDF files. The PDFs are at:
+
+```
+../pastpaper-scraper/papers/COMP2211/<filename>
+```
+
+Filename mapping (from `upload_course_library_pdfs.py`):
+
+| Exam key | Local paper PDF |
+|----------|----------------|
+| COMP2211-2022-fall-midterm | (COMP2211)[2022](f)midterm~=yjz8dxdd^_27002.pdf |
+| COMP2211-2022-spring-midterm | (COMP2211)[2022](s)midterm~=b8bidkgs^_14629.pdf |
+| COMP2211-2022-spring-final-part-a | (COMP2211)[2022](s)final~=b8bidkgs^_33018.pdf |
+| COMP2211-2022-spring-final-part-b | (COMP2211)[2022](s)final~=b8bidkgs^_40627.pdf |
+| COMP2211-2023-spring-midterm | (COMP2211)[2023](s)midterm~=bxbidkmj^_26587.pdf |
+| COMP2211-2024-spring-midterm | (COMP2211)[2024](s)midterm~=rcidkjgf^_82003.pdf |
+| COMP2211-2024-spring-final | (COMP2211)[2024](s)final~=igk5mmg^_90365.pdf |
+
+### Suggested search strategy
+
+```python
+import pymupdf
+
+doc = pymupdf.open("path/to/paper.pdf")
+for page_num, page in enumerate(doc, start=1):
+    text = page.get_text()
+    print(f"--- Page {page_num} ---")
+    print(text[:500])
+```
+
+Search for markers like:
+- `"(a)"`, `"(b)"`, ... for True/False sub-statements
+- `"Q2(a)"`, `"2(a)"`, `"Question 2"` for major sub-questions
+- `"(i)"`, `"(ii)"` for nested sub-questions
+
+Page numbers are 1-indexed (matching the `page_number` field in the database).
+
+## Code changes per split script
+
+### Step 1 — Add `page_number` field to `ChildSpec`
+
+Each script has its own `ChildSpec` dataclass. Add the field with a default so
+existing call sites don't break immediately:
+
+```python
+@dataclass(frozen=True)
+class ChildSpec:
+    ...
+    page_number: int = 1   # add this field
+```
+
+### Step 2 — Set correct page numbers in each `ChildSpec` instance
+
+Fill in the actual page after inspecting the PDF:
+
+```python
+ChildSpec("1a", "1", "1", ("a",), 1.5, "true_false", page_number=1),
+ChildSpec("1b", "1", "1", ("b",), 1.5, "true_false", page_number=1),
+...
+ChildSpec("1h", "1", "1", ("h",), 1.5, "true_false", page_number=2),
+```
+
+### Step 3 — Write `page_number` in the upsert payload
+
+Find where the script builds the INSERT/upsert dict and replace the inherited value:
+
+```python
+# Before:
+"page_number": parent.get("page_number"),
+
+# After:
+"page_number": child.page_number,
+```
+
+### Step 4 — Update existing rows in the database
+
+After modifying the scripts, run each script once — they already use upsert/update
+semantics, so re-running overwrites the old (inherited) page numbers with the correct ones.
+
+If a script does INSERT-only (not upsert), add a separate UPDATE pass:
+
+```python
+sb.table("paper_questions").update({"page_number": child.page_number}) \
+  .eq("paper_id", paper_id) \
+  .eq("question_number", child.question_number) \
+  .execute()
+```
+
+## 2022-fall-midterm (seed SQL)
+
+Sub-questions for this paper are in:
+`supabase/seeds/comp2211_problem_level_questions.sql`
+
+The seed has a `page_number` column in the VALUES rows. Find all rows for
+`COMP2211-2022-fall-midterm` and correct the values. Then run a direct UPDATE
+against the live database:
+
+```sql
+-- Example — adjust actual page numbers after inspecting the PDF
+UPDATE paper_questions
+SET page_number = 2
+WHERE paper_id = (SELECT id FROM papers WHERE source_exam_key = 'COMP2211-2022-fall-midterm')
+  AND question_number IN ('1g', '1h', '1i', '1j');
+```
+
+## Definition of Done
+
+- [ ] Every `ChildSpec` in every split script has an explicit `page_number`
+- [ ] No script uses `parent.get("page_number")` for the upsert payload
+- [ ] All six scripts have been re-run against the live database
+- [ ] 2022-fall-midterm sub-questions updated via SQL
+- [ ] Spot-check: clicking Q1h in a paper where Q1 spans 2 pages scrolls to page 2 in the UI
--- a/docs/TAGGING_REQUIREMENTS.md
+++ b/docs/TAGGING_REQUIREMENTS.md
@@ -0,0 +1,243 @@
+# Tag Schema & Similar Question Retrieval — Requirements
+
+## Background
+
+Current state of `paper_questions` tagging for COMP2211:
+
+- `analytics_topic`: 8 coarse buckets (e.g. "KNN and Clustering" covers both KNN and K-Means)
+- `topic_tags`: redundant copy of `analytics_topic`, adds no information
+- `skill_tags`: fine-grained snake_case labels (e.g. `centroid_update`, `distance_calculation`), not shown to users
+- `question_text`: at subquestion level, but currently stores **parent problem header text**, not the actual subquestion statement
+
+The result is that similar question retrieval conflates KNN and K-Means, cannot distinguish "write code" from "trace algorithm", and produces low-precision recommendations.
+
+---
+
+## Goal
+
+Every subquestion should carry enough structured metadata that the retrieval system can return **topically and skill-wise identical questions across different exam years**, rather than just questions from the same broad topic bucket.
+
+Precision target: a question on K-Means centroid update should retrieve other K-Means centroid update questions, not KNN distance questions.
+
+---
+
+## Field Definitions (revised)
+
+### `analytics_topic` — single string, primary retrieval bucket
+
+Granularity: **algorithm or concept level**, not course-section level.
+
+Allowed values for COMP2211 (replace current 8-bucket system):
+
+| New value | Replaces / splits |
+|-----------|-------------------|
+| `Naive Bayes` | Probabilistic Models (partial) |
+| `Bayesian Inference` | Probabilistic Models (partial) |
+| `KNN` | KNN and Clustering (partial) |
+| `K-Means` | KNN and Clustering (partial) |
+| `Perceptron` | Perceptron and MLP (partial) |
+| `MLP` | Perceptron and MLP (partial) |
+| `CNN` | Vision and CNN |
+| `Evaluation Metrics` | Evaluation and Validation (partial) |
+| `Cross Validation` | Evaluation and Validation (partial) |
+| `Python and NumPy` | Python Fundamentals |
+| `Search Algorithms` | Search and Games (partial) |
+| `Game Trees` | Search and Games (partial) |
+| `Ethics of AI` | Ethics of AI (unchanged) |
+
+Rules:
+- One value per question — pick the **most specific** algorithm being tested
+- If a subquestion genuinely spans two algorithms, pick the one being asked to compute/demonstrate
+- `True/False` is **not** a valid analytics_topic (it is a format, not a topic)
+
+---
+
+### `topic_tags` — string array, secondary topic labels
+
+Granularity: **concept and variant level** within the algorithm.
+
+Purpose: catch cross-topic overlaps and concept aliases.
+
+Examples:
+
+```
+analytics_topic = "K-Means"
+topic_tags = ["K-Means", "Centroid Update", "Convergence"]
+
+analytics_topic = "KNN"
+topic_tags = ["KNN", "Euclidean Distance", "Classification"]
+
+analytics_topic = "Naive Bayes"
+topic_tags = ["Naive Bayes", "Prior", "Likelihood", "Posterior"]
+
+analytics_topic = "Evaluation Metrics"
+topic_tags = ["Evaluation Metrics", "Precision", "Recall", "F1 Score"]
+
+analytics_topic = "MLP"
+topic_tags = ["MLP", "Backpropagation", "Activation Function", "Hidden Layer"]
+
+analytics_topic = "Python and NumPy"
+topic_tags = ["NumPy", "Broadcasting", "Array Indexing", "Vectorization"]
+```
+
+Rules:
+- First element should match or alias `analytics_topic`
+- Include concept names a student would search for ("F1 Score", not "metric_reasoning")
+- 2–5 tags per question; avoid over-tagging
+- Human-readable, title-case, no underscores
+
+---
+
+### `skill_tags` — string array, task type labels
+
+Granularity: **what the student must do**, not what the topic is.
+
+Current values are acceptable in meaning but must be converted to human-readable form.
+
+Rename convention: `snake_case` → `Title Case with spaces`
+
+| Old | New |
+|-----|-----|
+| `concept_check` | `Concept Check` |
+| `code_tracing` | `Code Tracing` |
+| `algorithm_tracing` | `Algorithm Tracing` |
+| `distance_calculation` | `Distance Calculation` |
+| `centroid_update` | `Centroid Update` |
+| `weight_update` | `Weight Update` |
+| `decision_boundary` | `Decision Boundary` |
+| `implementation` | `Implementation` |
+| `debugging` | `Debugging` |
+| `model_selection` | `Model Selection` |
+| `concept_explanation` | `Concept Explanation` |
+| `architecture_reasoning` | `Architecture Reasoning` |
+| `convergence_reasoning` | `Convergence Reasoning` |
+| `generalization_reasoning` | `Generalization Reasoning` |
+| `classification_decision` | `Classification Decision` |
+
+Rules:
+- 1–3 tags per question
+- Describes the **task type**, not the subject matter
+- These are used for retrieval ranking, not primary display
+
+---
+
+### `question_text` — the actual subquestion statement
+
+Current problem: subquestions store the **parent problem header** as `question_text`, not the individual statement.
+
+Required fix per subquestion type:
+
+| Type | What `question_text` should contain |
+|------|-------------------------------------|
+| True/False subquestion (Q1a–Q1j) | The specific T/F statement being judged |
+| Code output (Q2a_i–Q2a_v) | The specific code snippet + "What is the output?" |
+| Calculation subquestion (Q4a, Q5a) | The specific sub-task, e.g. "Compute the Euclidean distance between..." |
+| Written explanation (Q3, Q5c) | The full question prompt for that part |
+
+This is a **data extraction quality issue**. The backfill script must extract the correct per-subquestion text from the source PDF or from `raw_answer_text`.
+
+---
+
+## Backfill Requirements
+
+### Script: `backfill_comp2211_tags.py`
+
+Target: all `paper_questions` where `paper_id` in the COMP2211 course library.
+
+For each question:
+
+1. **Re-classify `analytics_topic`** using the new value list above
+   - Use `question_text` + existing `topic_tags` + `skill_tags` as signals
+   - If `analytics_topic` is currently `"KNN and Clustering"`:
+     - Look at `skill_tags` and `question_text`
+     - If `centroid_update`, `algorithm_tracing`, or text contains "K-Means" / "centroid" → set `"K-Means"`
+     - Otherwise → set `"KNN"`
+   - If `analytics_topic` is currently `"Perceptron and MLP"`:
+     - If `question_text` or `skill_tags` references hidden layer, backprop, activation function → `"MLP"`
+     - Otherwise → `"Perceptron"`
+   - If `analytics_topic` is currently `"Probabilistic Models"`:
+     - If Naive Bayes in text → `"Naive Bayes"`
+     - Otherwise → `"Bayesian Inference"`
+   - If `analytics_topic` is currently `"Evaluation and Validation"`:
+     - If cross-validation, train/val split in text → `"Cross Validation"`
+     - Otherwise → `"Evaluation Metrics"`
+   - If `analytics_topic` is currently `"Search and Games"`:
+     - If minimax, alpha-beta, game tree in text → `"Game Trees"`
+     - Otherwise → `"Search Algorithms"`
+
+2. **Rebuild `topic_tags`** — do not copy `analytics_topic`; derive from question content
+
+3. **Rename `skill_tags`** — convert all snake_case values to Title Case per the mapping table above
+
+4. **Do not overwrite `question_text`** in this pass (separate task)
+
+---
+
+## Retrieval Algorithm Changes (backend `questions.py`)
+
+### Separate topic and skill contributions
+
+Current `similarity_score()` merges `analytics_topic`, `topic_tags`, and `skill_tags` into one set. This causes skill tags like `centroid_update` to appear as "Shared topic: centroid_update" in the UI.
+
+Required split:
+
+```python
+def similarity_score(target, candidate):
+    score = 0
+    reasons = []
+
+    # 1. analytics_topic exact match: 40 pts
+    if target.get("analytics_topic") and target["analytics_topic"] == candidate.get("analytics_topic"):
+        score += 40
+        reasons.append(f"Same topic: {target['analytics_topic']}")
+
+    # 2. topic_tags overlap: up to 20 pts (10 per shared tag, max 2)
+    target_tt = set(t.lower() for t in (target.get("topic_tags") or []))
+    candidate_tt = set(t.lower() for t in (candidate.get("topic_tags") or []))
+    shared_tt = target_tt & candidate_tt
+    tt_pts = min(len(shared_tt) * 10, 20)
+    if tt_pts:
+        score += tt_pts
+        reasons.append(f"Shared concept: {', '.join(sorted(shared_tt)[:2])}")
+
+    # 3. skill_tags overlap: up to 20 pts (10 per shared tag, max 2)
+    target_st = set(t.lower() for t in (target.get("skill_tags") or []))
+    candidate_st = set(t.lower() for t in (candidate.get("skill_tags") or []))
+    shared_st = target_st & candidate_st
+    st_pts = min(len(shared_st) * 10, 20)
+    if st_pts:
+        score += st_pts
+        reasons.append(f"Shared skill: {', '.join(sorted(shared_st)[:2])}")
+
+    # 4. Same question format: 10 pts
+    if question_family(candidate) == question_family(target):
+        score += 10
+        reasons.append("Same format")
+
+    # 5. Same difficulty: 5 pts
+    if candidate.get("difficulty") and candidate["difficulty"] == target.get("difficulty"):
+        score += 5
+        reasons.append("Same difficulty")
+
+    # 6. Full-text similarity: up to 20 pts (from tsvector RPC)
+    # (injected externally, not computed here)
+
+    return min(score, 99), reasons
+```
+
+### Threshold and display
+
+- Filter: `match_percent < 20` (raised from 10; ensures analytics_topic at least partially matches)
+- UI display: show `match_reasons` chips, but replace snake_case with Title Case before display
+
+---
+
+## Definition of Done
+
+- [ ] All COMP2211 questions have `analytics_topic` from the new value list
+- [ ] No `analytics_topic` value of `"KNN and Clustering"`, `"Perceptron and MLP"`, `"Probabilistic Models"`, `"Evaluation and Validation"`, `"Search and Games"` remains
+- [ ] `topic_tags` contains 2–5 human-readable concept names, not a copy of `analytics_topic`
+- [ ] `skill_tags` values are Title Case with spaces
+- [ ] Similar question retrieval returns 0 cross-algorithm false positives between KNN and K-Means
+- [ ] `match_reasons` chips in the UI show no underscores
+- [ ] Retrieval threshold enforces `analytics_topic` match as a hard or near-hard requirement