153 lines
5.0 KiB
Markdown
153 lines
5.0 KiB
Markdown
# Sub-question Page Number Backfill — Requirements
|
||
|
||
## Problem
|
||
|
||
All six `split_comp2211_*.py` scripts create sub-questions by inheriting `page_number`
|
||
from their parent question:
|
||
|
||
```python
|
||
"page_number": parent.get("page_number"),
|
||
```
|
||
|
||
This is wrong for sub-questions that span multiple pages. For example, Q1 True/False
|
||
has 10 statements (a–j); if (a)–(f) are on page 1 and (g)–(j) are on page 2, all ten
|
||
inherit page 1 from the parent. Clicking Q1h in the UI scrolls to page 1 instead of page 2.
|
||
|
||
## Goal
|
||
|
||
Every `ChildSpec` in every split script should carry its own correct `page_number`.
|
||
When the script runs, it writes that page number to the database instead of inheriting
|
||
from the parent.
|
||
|
||
## Files to modify
|
||
|
||
```
|
||
backend/split_comp2211_2022_fall_midterm.py ← does not exist yet; parent is seed SQL
|
||
backend/split_comp2211_2022_spring_midterm.py
|
||
backend/split_comp2211_2022_spring_final_part_a.py
|
||
backend/split_comp2211_2022_spring_final_part_b.py
|
||
backend/split_comp2211_2023_spring_midterm.py
|
||
backend/split_comp2211_2024_spring_midterm.py
|
||
backend/split_comp2211_2024_spring_final.py
|
||
```
|
||
|
||
Note: `2022-fall-midterm` sub-questions were inserted directly via the seed SQL
|
||
(`supabase/seeds/comp2211_problem_level_questions.sql`), not via a split script.
|
||
Their page numbers must be fixed directly in that SQL file or via a separate UPDATE.
|
||
|
||
## How to determine page numbers
|
||
|
||
Use PyMuPDF (`import pymupdf` — already in the venv) to search for question markers
|
||
in the local PDF files. The PDFs are at:
|
||
|
||
```
|
||
../pastpaper-scraper/papers/COMP2211/<filename>
|
||
```
|
||
|
||
Filename mapping (from `upload_course_library_pdfs.py`):
|
||
|
||
| Exam key | Local paper PDF |
|
||
|----------|----------------|
|
||
| COMP2211-2022-fall-midterm | (COMP2211)[2022](f)midterm~=yjz8dxdd^_27002.pdf |
|
||
| COMP2211-2022-spring-midterm | (COMP2211)[2022](s)midterm~=b8bidkgs^_14629.pdf |
|
||
| COMP2211-2022-spring-final-part-a | (COMP2211)[2022](s)final~=b8bidkgs^_33018.pdf |
|
||
| COMP2211-2022-spring-final-part-b | (COMP2211)[2022](s)final~=b8bidkgs^_40627.pdf |
|
||
| COMP2211-2023-spring-midterm | (COMP2211)[2023](s)midterm~=bxbidkmj^_26587.pdf |
|
||
| COMP2211-2024-spring-midterm | (COMP2211)[2024](s)midterm~=rcidkjgf^_82003.pdf |
|
||
| COMP2211-2024-spring-final | (COMP2211)[2024](s)final~=igk5mmg^_90365.pdf |
|
||
|
||
### Suggested search strategy
|
||
|
||
```python
|
||
import pymupdf
|
||
|
||
doc = pymupdf.open("path/to/paper.pdf")
|
||
for page_num, page in enumerate(doc, start=1):
|
||
text = page.get_text()
|
||
print(f"--- Page {page_num} ---")
|
||
print(text[:500])
|
||
```
|
||
|
||
Search for markers like:
|
||
- `"(a)"`, `"(b)"`, ... for True/False sub-statements
|
||
- `"Q2(a)"`, `"2(a)"`, `"Question 2"` for major sub-questions
|
||
- `"(i)"`, `"(ii)"` for nested sub-questions
|
||
|
||
Page numbers are 1-indexed (matching the `page_number` field in the database).
|
||
|
||
## Code changes per split script
|
||
|
||
### Step 1 — Add `page_number` field to `ChildSpec`
|
||
|
||
Each script has its own `ChildSpec` dataclass. Add the field with a default so
|
||
existing call sites don't break immediately:
|
||
|
||
```python
|
||
@dataclass(frozen=True)
|
||
class ChildSpec:
|
||
...
|
||
page_number: int = 1 # add this field
|
||
```
|
||
|
||
### Step 2 — Set correct page numbers in each `ChildSpec` instance
|
||
|
||
Fill in the actual page after inspecting the PDF:
|
||
|
||
```python
|
||
ChildSpec("1a", "1", "1", ("a",), 1.5, "true_false", page_number=1),
|
||
ChildSpec("1b", "1", "1", ("b",), 1.5, "true_false", page_number=1),
|
||
...
|
||
ChildSpec("1h", "1", "1", ("h",), 1.5, "true_false", page_number=2),
|
||
```
|
||
|
||
### Step 3 — Write `page_number` in the upsert payload
|
||
|
||
Find where the script builds the INSERT/upsert dict and replace the inherited value:
|
||
|
||
```python
|
||
# Before:
|
||
"page_number": parent.get("page_number"),
|
||
|
||
# After:
|
||
"page_number": child.page_number,
|
||
```
|
||
|
||
### Step 4 — Update existing rows in the database
|
||
|
||
After modifying the scripts, run each script once — they already use upsert/update
|
||
semantics, so re-running overwrites the old (inherited) page numbers with the correct ones.
|
||
|
||
If a script does INSERT-only (not upsert), add a separate UPDATE pass:
|
||
|
||
```python
|
||
sb.table("paper_questions").update({"page_number": child.page_number}) \
|
||
.eq("paper_id", paper_id) \
|
||
.eq("question_number", child.question_number) \
|
||
.execute()
|
||
```
|
||
|
||
## 2022-fall-midterm (seed SQL)
|
||
|
||
Sub-questions for this paper are in:
|
||
`supabase/seeds/comp2211_problem_level_questions.sql`
|
||
|
||
The seed has a `page_number` column in the VALUES rows. Find all rows for
|
||
`COMP2211-2022-fall-midterm` and correct the values. Then run a direct UPDATE
|
||
against the live database:
|
||
|
||
```sql
|
||
-- Example — adjust actual page numbers after inspecting the PDF
|
||
UPDATE paper_questions
|
||
SET page_number = 2
|
||
WHERE paper_id = (SELECT id FROM papers WHERE source_exam_key = 'COMP2211-2022-fall-midterm')
|
||
AND question_number IN ('1g', '1h', '1i', '1j');
|
||
```
|
||
|
||
## Definition of Done
|
||
|
||
- [ ] Every `ChildSpec` in every split script has an explicit `page_number`
|
||
- [ ] No script uses `parent.get("page_number")` for the upsert payload
|
||
- [ ] All six scripts have been re-run against the live database
|
||
- [ ] 2022-fall-midterm sub-questions updated via SQL
|
||
- [ ] Spot-check: clicking Q1h in a paper where Q1 spans 2 pages scrolls to page 2 in the UI
|