# Sub-question Page Number Backfill — Requirements ## Problem All six `split_comp2211_*.py` scripts create sub-questions by inheriting `page_number` from their parent question: ```python "page_number": parent.get("page_number"), ``` This is wrong for sub-questions that span multiple pages. For example, Q1 True/False has 10 statements (a–j); if (a)–(f) are on page 1 and (g)–(j) are on page 2, all ten inherit page 1 from the parent. Clicking Q1h in the UI scrolls to page 1 instead of page 2. ## Goal Every `ChildSpec` in every split script should carry its own correct `page_number`. When the script runs, it writes that page number to the database instead of inheriting from the parent. ## Files to modify ``` backend/split_comp2211_2022_fall_midterm.py ← does not exist yet; parent is seed SQL backend/split_comp2211_2022_spring_midterm.py backend/split_comp2211_2022_spring_final_part_a.py backend/split_comp2211_2022_spring_final_part_b.py backend/split_comp2211_2023_spring_midterm.py backend/split_comp2211_2024_spring_midterm.py backend/split_comp2211_2024_spring_final.py ``` Note: `2022-fall-midterm` sub-questions were inserted directly via the seed SQL (`supabase/seeds/comp2211_problem_level_questions.sql`), not via a split script. Their page numbers must be fixed directly in that SQL file or via a separate UPDATE. ## How to determine page numbers Use PyMuPDF (`import pymupdf` — already in the venv) to search for question markers in the local PDF files. The PDFs are at: ``` ../pastpaper-scraper/papers/COMP2211/ ``` Filename mapping (from `upload_course_library_pdfs.py`): | Exam key | Local paper PDF | |----------|----------------| | COMP2211-2022-fall-midterm | (COMP2211)[2022](f)midterm~=yjz8dxdd^_27002.pdf | | COMP2211-2022-spring-midterm | (COMP2211)[2022](s)midterm~=b8bidkgs^_14629.pdf | | COMP2211-2022-spring-final-part-a | (COMP2211)[2022](s)final~=b8bidkgs^_33018.pdf | | COMP2211-2022-spring-final-part-b | (COMP2211)[2022](s)final~=b8bidkgs^_40627.pdf | | COMP2211-2023-spring-midterm | (COMP2211)[2023](s)midterm~=bxbidkmj^_26587.pdf | | COMP2211-2024-spring-midterm | (COMP2211)[2024](s)midterm~=rcidkjgf^_82003.pdf | | COMP2211-2024-spring-final | (COMP2211)[2024](s)final~=igk5mmg^_90365.pdf | ### Suggested search strategy ```python import pymupdf doc = pymupdf.open("path/to/paper.pdf") for page_num, page in enumerate(doc, start=1): text = page.get_text() print(f"--- Page {page_num} ---") print(text[:500]) ``` Search for markers like: - `"(a)"`, `"(b)"`, ... for True/False sub-statements - `"Q2(a)"`, `"2(a)"`, `"Question 2"` for major sub-questions - `"(i)"`, `"(ii)"` for nested sub-questions Page numbers are 1-indexed (matching the `page_number` field in the database). ## Code changes per split script ### Step 1 — Add `page_number` field to `ChildSpec` Each script has its own `ChildSpec` dataclass. Add the field with a default so existing call sites don't break immediately: ```python @dataclass(frozen=True) class ChildSpec: ... page_number: int = 1 # add this field ``` ### Step 2 — Set correct page numbers in each `ChildSpec` instance Fill in the actual page after inspecting the PDF: ```python ChildSpec("1a", "1", "1", ("a",), 1.5, "true_false", page_number=1), ChildSpec("1b", "1", "1", ("b",), 1.5, "true_false", page_number=1), ... ChildSpec("1h", "1", "1", ("h",), 1.5, "true_false", page_number=2), ``` ### Step 3 — Write `page_number` in the upsert payload Find where the script builds the INSERT/upsert dict and replace the inherited value: ```python # Before: "page_number": parent.get("page_number"), # After: "page_number": child.page_number, ``` ### Step 4 — Update existing rows in the database After modifying the scripts, run each script once — they already use upsert/update semantics, so re-running overwrites the old (inherited) page numbers with the correct ones. If a script does INSERT-only (not upsert), add a separate UPDATE pass: ```python sb.table("paper_questions").update({"page_number": child.page_number}) \ .eq("paper_id", paper_id) \ .eq("question_number", child.question_number) \ .execute() ``` ## 2022-fall-midterm (seed SQL) Sub-questions for this paper are in: `supabase/seeds/comp2211_problem_level_questions.sql` The seed has a `page_number` column in the VALUES rows. Find all rows for `COMP2211-2022-fall-midterm` and correct the values. Then run a direct UPDATE against the live database: ```sql -- Example — adjust actual page numbers after inspecting the PDF UPDATE paper_questions SET page_number = 2 WHERE paper_id = (SELECT id FROM papers WHERE source_exam_key = 'COMP2211-2022-fall-midterm') AND question_number IN ('1g', '1h', '1i', '1j'); ``` ## Definition of Done - [ ] Every `ChildSpec` in every split script has an explicit `page_number` - [ ] No script uses `parent.get("page_number")` for the upsert payload - [ ] All six scripts have been re-run against the live database - [ ] 2022-fall-midterm sub-questions updated via SQL - [ ] Spot-check: clicking Q1h in a paper where Q1 spans 2 pages scrolls to page 2 in the UI