PastpaperMaster/docs/PAGE_NUMBER_BACKFILL.md

# Sub-question Page Number Backfill — Requirements

## Problem

All six `split_comp2211_*.py` scripts create sub-questions by inheriting `page_number`
from their parent question:

```python
"page_number": parent.get("page_number"),
```

This is wrong for sub-questions that span multiple pages. For example, Q1 True/False
has 10 statements (a–j); if (a)–(f) are on page 1 and (g)–(j) are on page 2, all ten
inherit page 1 from the parent. Clicking Q1h in the UI scrolls to page 1 instead of page 2.

## Goal

Every `ChildSpec` in every split script should carry its own correct `page_number`.
When the script runs, it writes that page number to the database instead of inheriting
from the parent.

## Files to modify

```
backend/split_comp2211_2022_fall_midterm.py      ← does not exist yet; parent is seed SQL
backend/split_comp2211_2022_spring_midterm.py
backend/split_comp2211_2022_spring_final_part_a.py
backend/split_comp2211_2022_spring_final_part_b.py
backend/split_comp2211_2023_spring_midterm.py
backend/split_comp2211_2024_spring_midterm.py
backend/split_comp2211_2024_spring_final.py
```

Note: `2022-fall-midterm` sub-questions were inserted directly via the seed SQL
(`supabase/seeds/comp2211_problem_level_questions.sql`), not via a split script.
Their page numbers must be fixed directly in that SQL file or via a separate UPDATE.

## How to determine page numbers

Use PyMuPDF (`import pymupdf` — already in the venv) to search for question markers
in the local PDF files. The PDFs are at:

```
../pastpaper-scraper/papers/COMP2211/<filename>
```

Filename mapping (from `upload_course_library_pdfs.py`):

| Exam key | Local paper PDF |
|----------|----------------|
| COMP2211-2022-fall-midterm | (COMP2211)[2022](f)midterm~=yjz8dxdd^_27002.pdf |
| COMP2211-2022-spring-midterm | (COMP2211)[2022](s)midterm~=b8bidkgs^_14629.pdf |
| COMP2211-2022-spring-final-part-a | (COMP2211)[2022](s)final~=b8bidkgs^_33018.pdf |
| COMP2211-2022-spring-final-part-b | (COMP2211)[2022](s)final~=b8bidkgs^_40627.pdf |
| COMP2211-2023-spring-midterm | (COMP2211)[2023](s)midterm~=bxbidkmj^_26587.pdf |
| COMP2211-2024-spring-midterm | (COMP2211)[2024](s)midterm~=rcidkjgf^_82003.pdf |
| COMP2211-2024-spring-final | (COMP2211)[2024](s)final~=igk5mmg^_90365.pdf |

### Suggested search strategy

```python
import pymupdf

doc = pymupdf.open("path/to/paper.pdf")
for page_num, page in enumerate(doc, start=1):
    text = page.get_text()
    print(f"--- Page {page_num} ---")
    print(text[:500])
```

Search for markers like:
- `"(a)"`, `"(b)"`, ... for True/False sub-statements
- `"Q2(a)"`, `"2(a)"`, `"Question 2"` for major sub-questions
- `"(i)"`, `"(ii)"` for nested sub-questions

Page numbers are 1-indexed (matching the `page_number` field in the database).

## Code changes per split script

### Step 1 — Add `page_number` field to `ChildSpec`

Each script has its own `ChildSpec` dataclass. Add the field with a default so
existing call sites don't break immediately:

```python
@dataclass(frozen=True)
class ChildSpec:
    ...
    page_number: int = 1   # add this field
```

### Step 2 — Set correct page numbers in each `ChildSpec` instance

Fill in the actual page after inspecting the PDF:

```python
ChildSpec("1a", "1", "1", ("a",), 1.5, "true_false", page_number=1),
ChildSpec("1b", "1", "1", ("b",), 1.5, "true_false", page_number=1),
...
ChildSpec("1h", "1", "1", ("h",), 1.5, "true_false", page_number=2),
```

### Step 3 — Write `page_number` in the upsert payload

Find where the script builds the INSERT/upsert dict and replace the inherited value:

```python
# Before:
"page_number": parent.get("page_number"),

# After:
"page_number": child.page_number,
```

### Step 4 — Update existing rows in the database

After modifying the scripts, run each script once — they already use upsert/update
semantics, so re-running overwrites the old (inherited) page numbers with the correct ones.

If a script does INSERT-only (not upsert), add a separate UPDATE pass:

```python
sb.table("paper_questions").update({"page_number": child.page_number}) \
  .eq("paper_id", paper_id) \
  .eq("question_number", child.question_number) \
  .execute()
```

## 2022-fall-midterm (seed SQL)

Sub-questions for this paper are in:
`supabase/seeds/comp2211_problem_level_questions.sql`

The seed has a `page_number` column in the VALUES rows. Find all rows for
`COMP2211-2022-fall-midterm` and correct the values. Then run a direct UPDATE
against the live database:

```sql
-- Example — adjust actual page numbers after inspecting the PDF
UPDATE paper_questions
SET page_number = 2
WHERE paper_id = (SELECT id FROM papers WHERE source_exam_key = 'COMP2211-2022-fall-midterm')
  AND question_number IN ('1g', '1h', '1i', '1j');
```

## Definition of Done

- [ ] Every `ChildSpec` in every split script has an explicit `page_number`
- [ ] No script uses `parent.get("page_number")` for the upsert payload
- [ ] All six scripts have been re-run against the live database
- [ ] 2022-fall-midterm sub-questions updated via SQL
- [ ] Spot-check: clicking Q1h in a paper where Q1 spans 2 pages scrolls to page 2 in the UI