5.0 KiB
Sub-question Page Number Backfill — Requirements
Problem
All six split_comp2211_*.py scripts create sub-questions by inheriting page_number
from their parent question:
"page_number": parent.get("page_number"),
This is wrong for sub-questions that span multiple pages. For example, Q1 True/False has 10 statements (a–j); if (a)–(f) are on page 1 and (g)–(j) are on page 2, all ten inherit page 1 from the parent. Clicking Q1h in the UI scrolls to page 1 instead of page 2.
Goal
Every ChildSpec in every split script should carry its own correct page_number.
When the script runs, it writes that page number to the database instead of inheriting
from the parent.
Files to modify
backend/split_comp2211_2022_fall_midterm.py ← does not exist yet; parent is seed SQL
backend/split_comp2211_2022_spring_midterm.py
backend/split_comp2211_2022_spring_final_part_a.py
backend/split_comp2211_2022_spring_final_part_b.py
backend/split_comp2211_2023_spring_midterm.py
backend/split_comp2211_2024_spring_midterm.py
backend/split_comp2211_2024_spring_final.py
Note: 2022-fall-midterm sub-questions were inserted directly via the seed SQL
(supabase/seeds/comp2211_problem_level_questions.sql), not via a split script.
Their page numbers must be fixed directly in that SQL file or via a separate UPDATE.
How to determine page numbers
Use PyMuPDF (import pymupdf — already in the venv) to search for question markers
in the local PDF files. The PDFs are at:
../pastpaper-scraper/papers/COMP2211/<filename>
Filename mapping (from upload_course_library_pdfs.py):
| Exam key | Local paper PDF |
|---|---|
| COMP2211-2022-fall-midterm | (COMP2211)2022midterm~=yjz8dxdd^_27002.pdf |
| COMP2211-2022-spring-midterm | (COMP2211)2022midterm~=b8bidkgs^_14629.pdf |
| COMP2211-2022-spring-final-part-a | (COMP2211)2022final~=b8bidkgs^_33018.pdf |
| COMP2211-2022-spring-final-part-b | (COMP2211)2022final~=b8bidkgs^_40627.pdf |
| COMP2211-2023-spring-midterm | (COMP2211)2023midterm~=bxbidkmj^_26587.pdf |
| COMP2211-2024-spring-midterm | (COMP2211)2024midterm~=rcidkjgf^_82003.pdf |
| COMP2211-2024-spring-final | (COMP2211)2024final~=igk5mmg^_90365.pdf |
Suggested search strategy
import pymupdf
doc = pymupdf.open("path/to/paper.pdf")
for page_num, page in enumerate(doc, start=1):
text = page.get_text()
print(f"--- Page {page_num} ---")
print(text[:500])
Search for markers like:
"(a)","(b)", ... for True/False sub-statements"Q2(a)","2(a)","Question 2"for major sub-questions"(i)","(ii)"for nested sub-questions
Page numbers are 1-indexed (matching the page_number field in the database).
Code changes per split script
Step 1 — Add page_number field to ChildSpec
Each script has its own ChildSpec dataclass. Add the field with a default so
existing call sites don't break immediately:
@dataclass(frozen=True)
class ChildSpec:
...
page_number: int = 1 # add this field
Step 2 — Set correct page numbers in each ChildSpec instance
Fill in the actual page after inspecting the PDF:
ChildSpec("1a", "1", "1", ("a",), 1.5, "true_false", page_number=1),
ChildSpec("1b", "1", "1", ("b",), 1.5, "true_false", page_number=1),
...
ChildSpec("1h", "1", "1", ("h",), 1.5, "true_false", page_number=2),
Step 3 — Write page_number in the upsert payload
Find where the script builds the INSERT/upsert dict and replace the inherited value:
# Before:
"page_number": parent.get("page_number"),
# After:
"page_number": child.page_number,
Step 4 — Update existing rows in the database
After modifying the scripts, run each script once — they already use upsert/update semantics, so re-running overwrites the old (inherited) page numbers with the correct ones.
If a script does INSERT-only (not upsert), add a separate UPDATE pass:
sb.table("paper_questions").update({"page_number": child.page_number}) \
.eq("paper_id", paper_id) \
.eq("question_number", child.question_number) \
.execute()
2022-fall-midterm (seed SQL)
Sub-questions for this paper are in:
supabase/seeds/comp2211_problem_level_questions.sql
The seed has a page_number column in the VALUES rows. Find all rows for
COMP2211-2022-fall-midterm and correct the values. Then run a direct UPDATE
against the live database:
-- Example — adjust actual page numbers after inspecting the PDF
UPDATE paper_questions
SET page_number = 2
WHERE paper_id = (SELECT id FROM papers WHERE source_exam_key = 'COMP2211-2022-fall-midterm')
AND question_number IN ('1g', '1h', '1i', '1j');
Definition of Done
- Every
ChildSpecin every split script has an explicitpage_number - No script uses
parent.get("page_number")for the upsert payload - All six scripts have been re-run against the live database
- 2022-fall-midterm sub-questions updated via SQL
- Spot-check: clicking Q1h in a paper where Q1 spans 2 pages scrolls to page 2 in the UI