Files
PastpaperMaster/docs/PAGE_NUMBER_BACKFILL.md
Zhao 7a09167261 Initial commit: PastPaper Master full stack
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 12:27:47 +07:00

5.0 KiB
Raw Permalink Blame History

Sub-question Page Number Backfill — Requirements

Problem

All six split_comp2211_*.py scripts create sub-questions by inheriting page_number from their parent question:

"page_number": parent.get("page_number"),

This is wrong for sub-questions that span multiple pages. For example, Q1 True/False has 10 statements (aj); if (a)(f) are on page 1 and (g)(j) are on page 2, all ten inherit page 1 from the parent. Clicking Q1h in the UI scrolls to page 1 instead of page 2.

Goal

Every ChildSpec in every split script should carry its own correct page_number. When the script runs, it writes that page number to the database instead of inheriting from the parent.

Files to modify

backend/split_comp2211_2022_fall_midterm.py      ← does not exist yet; parent is seed SQL
backend/split_comp2211_2022_spring_midterm.py
backend/split_comp2211_2022_spring_final_part_a.py
backend/split_comp2211_2022_spring_final_part_b.py
backend/split_comp2211_2023_spring_midterm.py
backend/split_comp2211_2024_spring_midterm.py
backend/split_comp2211_2024_spring_final.py

Note: 2022-fall-midterm sub-questions were inserted directly via the seed SQL (supabase/seeds/comp2211_problem_level_questions.sql), not via a split script. Their page numbers must be fixed directly in that SQL file or via a separate UPDATE.

How to determine page numbers

Use PyMuPDF (import pymupdf — already in the venv) to search for question markers in the local PDF files. The PDFs are at:

../pastpaper-scraper/papers/COMP2211/<filename>

Filename mapping (from upload_course_library_pdfs.py):

Exam key Local paper PDF
COMP2211-2022-fall-midterm (COMP2211)2022midterm~=yjz8dxdd^_27002.pdf
COMP2211-2022-spring-midterm (COMP2211)2022midterm~=b8bidkgs^_14629.pdf
COMP2211-2022-spring-final-part-a (COMP2211)2022final~=b8bidkgs^_33018.pdf
COMP2211-2022-spring-final-part-b (COMP2211)2022final~=b8bidkgs^_40627.pdf
COMP2211-2023-spring-midterm (COMP2211)2023midterm~=bxbidkmj^_26587.pdf
COMP2211-2024-spring-midterm (COMP2211)2024midterm~=rcidkjgf^_82003.pdf
COMP2211-2024-spring-final (COMP2211)2024final~=igk5mmg^_90365.pdf

Suggested search strategy

import pymupdf

doc = pymupdf.open("path/to/paper.pdf")
for page_num, page in enumerate(doc, start=1):
    text = page.get_text()
    print(f"--- Page {page_num} ---")
    print(text[:500])

Search for markers like:

  • "(a)", "(b)", ... for True/False sub-statements
  • "Q2(a)", "2(a)", "Question 2" for major sub-questions
  • "(i)", "(ii)" for nested sub-questions

Page numbers are 1-indexed (matching the page_number field in the database).

Code changes per split script

Step 1 — Add page_number field to ChildSpec

Each script has its own ChildSpec dataclass. Add the field with a default so existing call sites don't break immediately:

@dataclass(frozen=True)
class ChildSpec:
    ...
    page_number: int = 1   # add this field

Step 2 — Set correct page numbers in each ChildSpec instance

Fill in the actual page after inspecting the PDF:

ChildSpec("1a", "1", "1", ("a",), 1.5, "true_false", page_number=1),
ChildSpec("1b", "1", "1", ("b",), 1.5, "true_false", page_number=1),
...
ChildSpec("1h", "1", "1", ("h",), 1.5, "true_false", page_number=2),

Step 3 — Write page_number in the upsert payload

Find where the script builds the INSERT/upsert dict and replace the inherited value:

# Before:
"page_number": parent.get("page_number"),

# After:
"page_number": child.page_number,

Step 4 — Update existing rows in the database

After modifying the scripts, run each script once — they already use upsert/update semantics, so re-running overwrites the old (inherited) page numbers with the correct ones.

If a script does INSERT-only (not upsert), add a separate UPDATE pass:

sb.table("paper_questions").update({"page_number": child.page_number}) \
  .eq("paper_id", paper_id) \
  .eq("question_number", child.question_number) \
  .execute()

2022-fall-midterm (seed SQL)

Sub-questions for this paper are in: supabase/seeds/comp2211_problem_level_questions.sql

The seed has a page_number column in the VALUES rows. Find all rows for COMP2211-2022-fall-midterm and correct the values. Then run a direct UPDATE against the live database:

-- Example — adjust actual page numbers after inspecting the PDF
UPDATE paper_questions
SET page_number = 2
WHERE paper_id = (SELECT id FROM papers WHERE source_exam_key = 'COMP2211-2022-fall-midterm')
  AND question_number IN ('1g', '1h', '1i', '1j');

Definition of Done

  • Every ChildSpec in every split script has an explicit page_number
  • No script uses parent.get("page_number") for the upsert payload
  • All six scripts have been re-run against the live database
  • 2022-fall-midterm sub-questions updated via SQL
  • Spot-check: clicking Q1h in a paper where Q1 spans 2 pages scrolls to page 2 in the UI