Files
PastpaperMaster/docs/PAGE_NUMBER_BACKFILL.md
Zhao 7a09167261 Initial commit: PastPaper Master full stack
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 12:27:47 +07:00

153 lines
5.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Sub-question Page Number Backfill — Requirements
## Problem
All six `split_comp2211_*.py` scripts create sub-questions by inheriting `page_number`
from their parent question:
```python
"page_number": parent.get("page_number"),
```
This is wrong for sub-questions that span multiple pages. For example, Q1 True/False
has 10 statements (aj); if (a)(f) are on page 1 and (g)(j) are on page 2, all ten
inherit page 1 from the parent. Clicking Q1h in the UI scrolls to page 1 instead of page 2.
## Goal
Every `ChildSpec` in every split script should carry its own correct `page_number`.
When the script runs, it writes that page number to the database instead of inheriting
from the parent.
## Files to modify
```
backend/split_comp2211_2022_fall_midterm.py ← does not exist yet; parent is seed SQL
backend/split_comp2211_2022_spring_midterm.py
backend/split_comp2211_2022_spring_final_part_a.py
backend/split_comp2211_2022_spring_final_part_b.py
backend/split_comp2211_2023_spring_midterm.py
backend/split_comp2211_2024_spring_midterm.py
backend/split_comp2211_2024_spring_final.py
```
Note: `2022-fall-midterm` sub-questions were inserted directly via the seed SQL
(`supabase/seeds/comp2211_problem_level_questions.sql`), not via a split script.
Their page numbers must be fixed directly in that SQL file or via a separate UPDATE.
## How to determine page numbers
Use PyMuPDF (`import pymupdf` — already in the venv) to search for question markers
in the local PDF files. The PDFs are at:
```
../pastpaper-scraper/papers/COMP2211/<filename>
```
Filename mapping (from `upload_course_library_pdfs.py`):
| Exam key | Local paper PDF |
|----------|----------------|
| COMP2211-2022-fall-midterm | (COMP2211)[2022](f)midterm~=yjz8dxdd^_27002.pdf |
| COMP2211-2022-spring-midterm | (COMP2211)[2022](s)midterm~=b8bidkgs^_14629.pdf |
| COMP2211-2022-spring-final-part-a | (COMP2211)[2022](s)final~=b8bidkgs^_33018.pdf |
| COMP2211-2022-spring-final-part-b | (COMP2211)[2022](s)final~=b8bidkgs^_40627.pdf |
| COMP2211-2023-spring-midterm | (COMP2211)[2023](s)midterm~=bxbidkmj^_26587.pdf |
| COMP2211-2024-spring-midterm | (COMP2211)[2024](s)midterm~=rcidkjgf^_82003.pdf |
| COMP2211-2024-spring-final | (COMP2211)[2024](s)final~=igk5mmg^_90365.pdf |
### Suggested search strategy
```python
import pymupdf
doc = pymupdf.open("path/to/paper.pdf")
for page_num, page in enumerate(doc, start=1):
text = page.get_text()
print(f"--- Page {page_num} ---")
print(text[:500])
```
Search for markers like:
- `"(a)"`, `"(b)"`, ... for True/False sub-statements
- `"Q2(a)"`, `"2(a)"`, `"Question 2"` for major sub-questions
- `"(i)"`, `"(ii)"` for nested sub-questions
Page numbers are 1-indexed (matching the `page_number` field in the database).
## Code changes per split script
### Step 1 — Add `page_number` field to `ChildSpec`
Each script has its own `ChildSpec` dataclass. Add the field with a default so
existing call sites don't break immediately:
```python
@dataclass(frozen=True)
class ChildSpec:
...
page_number: int = 1 # add this field
```
### Step 2 — Set correct page numbers in each `ChildSpec` instance
Fill in the actual page after inspecting the PDF:
```python
ChildSpec("1a", "1", "1", ("a",), 1.5, "true_false", page_number=1),
ChildSpec("1b", "1", "1", ("b",), 1.5, "true_false", page_number=1),
...
ChildSpec("1h", "1", "1", ("h",), 1.5, "true_false", page_number=2),
```
### Step 3 — Write `page_number` in the upsert payload
Find where the script builds the INSERT/upsert dict and replace the inherited value:
```python
# Before:
"page_number": parent.get("page_number"),
# After:
"page_number": child.page_number,
```
### Step 4 — Update existing rows in the database
After modifying the scripts, run each script once — they already use upsert/update
semantics, so re-running overwrites the old (inherited) page numbers with the correct ones.
If a script does INSERT-only (not upsert), add a separate UPDATE pass:
```python
sb.table("paper_questions").update({"page_number": child.page_number}) \
.eq("paper_id", paper_id) \
.eq("question_number", child.question_number) \
.execute()
```
## 2022-fall-midterm (seed SQL)
Sub-questions for this paper are in:
`supabase/seeds/comp2211_problem_level_questions.sql`
The seed has a `page_number` column in the VALUES rows. Find all rows for
`COMP2211-2022-fall-midterm` and correct the values. Then run a direct UPDATE
against the live database:
```sql
-- Example — adjust actual page numbers after inspecting the PDF
UPDATE paper_questions
SET page_number = 2
WHERE paper_id = (SELECT id FROM papers WHERE source_exam_key = 'COMP2211-2022-fall-midterm')
AND question_number IN ('1g', '1h', '1i', '1j');
```
## Definition of Done
- [ ] Every `ChildSpec` in every split script has an explicit `page_number`
- [ ] No script uses `parent.get("page_number")` for the upsert payload
- [ ] All six scripts have been re-run against the live database
- [ ] 2022-fall-midterm sub-questions updated via SQL
- [ ] Spot-check: clicking Q1h in a paper where Q1 spans 2 pages scrolls to page 2 in the UI