Initial commit: PastPaper Master full stack

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 12:15:35 +07:00
commit 7a09167261
105 changed files with 24799 additions and 0 deletions
--- a/HANDOFF_COMP2211.md
+++ b/HANDOFF_COMP2211.md
@@ -0,0 +1,328 @@
+# COMP2211 Handoff
+
+## Current Status
+
+`COMP2211` course-library papers are now fully loaded into Supabase and normalized to subquestion-level granularity.
+
+Canonical papers currently in DB:
+
+- `COMP2211-2022-fall-midterm`
+- `COMP2211-2022-spring-midterm`
+- `COMP2211-2022-spring-final-part-a`
+- `COMP2211-2022-spring-final-part-b`
+- `COMP2211-2023-spring-midterm`
+- `COMP2211-2024-spring-midterm`
+- `COMP2211-2024-spring-final`
+
+All seven papers are:
+
+- `status = ready`
+- split to subquestion level
+- tagged with `analytics_topic`, `topic_primary`, `topic_tags`, `skill_tags`
+
+Question counts:
+
+- 2022 fall midterm: `43`
+- 2022 spring midterm: `38`
+- 2022 spring final part A: `24`
+- 2022 spring final part B: `19`
+- 2023 spring midterm: `36`
+- 2024 spring midterm: `42`
+- 2024 spring final: `48`
+
+## Key Files
+
+Schema / SQL:
+
+- [001_init_schema.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/001_init_schema.sql)
+- [002_course_library_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/002_course_library_fields.sql)
+- [003_question_taxonomy_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/003_question_taxonomy_fields.sql)
+- [004_decouple_course_library_from_auth.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/004_decouple_course_library_from_auth.sql)
+- [005_allow_long_question_format_alias.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/005_allow_long_question_format_alias.sql)
+- [006_make_scores_numeric.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/006_make_scores_numeric.sql)
+
+Course-library seeds:
+
+- [comp2211_course_library_papers.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_course_library_papers.sql)
+- [comp2211_problem_taxonomy_backfill.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_taxonomy_backfill.sql)
+- [comp2211_problem_level_questions.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_level_questions.sql)
+
+Manual splitters used for final subquestion rebuild:
+
+- [split_comp2211_2022_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_midterm.py)
+- [split_comp2211_2022_spring_final_part_a.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_a.py)
+- [split_comp2211_2022_spring_final_part_b.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_b.py)
+- [split_comp2211_2023_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2023_spring_midterm.py)
+- [split_comp2211_2024_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_midterm.py)
+- [split_comp2211_2024_spring_final.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_final.py)
+
+Deprecated filler script:
+
+- [fill_manual_study_aids.py](/Users/soda/Desktop/PastPaper%20Master/backend/fill_manual_study_aids.py)
+
+Audit / taxonomy references:
+
+- [COMP2211.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211.json)
+- [COMP2211_taxonomy.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211_taxonomy.json)
+- [summary.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/summary.json)
+- [problem_topics.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_topics.json)
+- [problem_seed.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_seed.json)
+
+Frontend / backend areas already adapted to real taxonomy:
+
+- [frontend/src/pages/HomePage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/HomePage.tsx)
+- [frontend/src/pages/AnalyticsPage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/ErrorBookPage.tsx)
+- [frontend/src/components/workbench/SimilarHistoryPanel.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/components/workbench/SimilarHistoryPanel.tsx)
+- [backend/app/routers/analytics.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/analytics.py)
+- [backend/app/routers/questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py)
+- [backend/app/routers/attempts.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/attempts.py)
+
+## Important Product / Data Decisions Already Made
+
+### Course library vs user upload
+
+This is now separated semantically inside `papers`:
+
+- `source_kind = 'course_library'` for platform-owned papers
+- `source_kind = 'user_upload'` for user-contributed papers
+
+Course-library papers no longer require `user_id`.
+
+### Taxonomy model
+
+`question_type` is not the main analytics dimension.
+
+Current intended usage:
+
+- `question_type` / `question_format`: rendering and answer interaction
+- `analytics_topic`: normalized analytics bucket
+- `topic_tags`: multi-tag topical indexing
+- `skill_tags`: finer-grained retrieval / grading / similarity support
+
+### Score field
+
+Scores are `NUMERIC`, not integer, because many subquestions use fractional marks like `1.5`.
+
+## Known Issues
+
+### 1. Similar question retrieval is still not truly production-ready
+
+Current state:
+
+- backend route exists
+- frontend panel exists
+- demo fallback still exists in the UI when retrieval returns empty / fails
+
+What needs to be done:
+
+- remove demo fallback behavior once real retrieval is stable
+- improve ranking beyond current basic topic/type matching
+- ideally add indexed text retrieval, then embeddings if needed
+
+Recommended order:
+
+1. build deterministic same-course retrieval first
+2. rank by `analytics_topic`, `topic_tags`, `skill_tags`, `question_format`, text similarity
+3. only then consider vector search
+
+### 2. Analytics is real, but still not the final version
+
+Current state:
+
+- analytics already reads real DB data
+- taxonomy fields are being used
+
+Still missing:
+
+- better topic normalization for edge cases
+- per-paper and per-subtopic drill-down
+- cleaner stats for mixed-format questions
+- confidence around aggregated counts across all courses, not only `COMP2211`
+
+### 3. LaTeX / math rendering is still fragile
+
+Known symptoms:
+
+- OCR / extracted math strings are noisy
+- some generated HTML contains malformed or hard-to-read math fragments
+- not all backend feedback is rendered with the same quality
+
+What needs work:
+
+- normalize math strings before rendering
+- improve KaTeX preprocessing
+- avoid dumping broken extracted formulas directly into UI
+- ensure solution / feedback content is consistently rendered through the same component path
+
+### 4. Presentation quality is still uneven
+
+Data is now real, but UI still needs polish:
+
+- question nav is still too weak for long real papers
+- status / difficulty / topic chips can be clearer
+- workbench hierarchy is inconsistent across question types
+- some pages still read like an internal demo rather than a finished study product
+
+### 5. User upload flow still lacks dedup / library filtering
+
+This is the next big backend product task.
+
+Desired logic:
+
+- when user uploads a paper, compare against existing course-library papers
+- if it is already covered, do not create a duplicate paper
+- if it is new, ingest it as `user_upload`
+- if high quality and non-duplicate, optionally promote into library workflow later
+
+### 6. Most non-Spring-2024 study aids are contaminated by template filler content
+
+Current state:
+
+- `COMP2211-2022-fall-midterm` has question-level LLM-authored study aids
+- `COMP2211-2024-spring-midterm` is the intended quality bar
+- the remaining papers were backfilled with a deprecated template script and should not be treated as production-quality AI content
+
+Impact:
+
+- `knowledge_reminder` is often generic topic boilerplate
+- `ai_hint` often points to a parent problem header instead of the actual subquestion
+- `solution` is often just wrapped reference text, not a true worked solution
+
+Required action:
+
+1. detect and clear templated study aids from affected papers
+2. regenerate them through the real LLM path in [paper_processor.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/services/paper_processor.py)
+3. review output quality before marking the papers as complete
+
+## Next Major Workstreams
+
+### A. Real similar-question retrieval
+
+Goal:
+
+- no demo fallback
+- same-course retrieval that feels trustworthy
+
+Suggested implementation:
+
+1. add a richer retrieval score in [questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py)
+2. use:
+   - same `course_code`
+   - same `analytics_topic`
+   - overlapping `topic_tags`
+   - overlapping `skill_tags`
+   - same or compatible `question_format`
+   - lexical similarity on `question_text`
+3. expose match reasons in response if useful
+4. update UI to show why a question was retrieved
+
+Potential DB improvement:
+
+- add `search_text` / `tsvector` on `paper_questions`
+- later optionally add `embedding`
+
+### B. Real paper / topic statistics
+
+Goal:
+
+- analytics should be fully trustworthy at subquestion level
+
+Suggested improvements:
+
+- topic frequency by `analytics_topic`
+- question-format distribution by subquestion, not by top-level problem
+- per-paper breakdown
+- high-yield topic trend across years
+- topic-to-question index page for drill mode
+
+### C. LaTeX and content rendering cleanup
+
+Goal:
+
+- all math-heavy content should render legibly
+
+Suggested work:
+
+- centralize HTML + KaTeX normalization
+- strip broken OCR artifacts before render
+- make study-aid content generation avoid malformed formula formatting
+- ensure grading feedback and solutions share the same renderer pipeline
+
+### D. User upload deduplication and library filtering
+
+Goal:
+
+- new uploads should not pollute the DB with duplicates
+
+Suggested logic:
+
+1. normalize upload metadata
+2. compare against existing papers in same course:
+   - year / term / exam_type / part_label
+   - title similarity
+   - extracted first-page markers
+   - optional text fingerprint
+3. if duplicate:
+   - attach to existing paper or reject with explanation
+4. if not duplicate:
+   - create `user_upload`
+   - process normally
+
+Likely schema additions later:
+
+- content fingerprint field on `papers`
+- upload provenance fields
+- moderation / promotion state for community uploads
+
+### E. UI / UX pass
+
+Priority items:
+
+- stronger question navigation for real papers
+- clearer ready / processing / failed states
+- better paper list and filtering UX
+- richer workbench metadata:
+  - topic
+  - difficulty
+  - format
+  - score
+  - answered / wrong / mastered state
+- unify visual style across analytics, error book, workbench
+
+## Suggested Development Order
+
+1. Remove similar-question demo fallback and ship real retrieval
+2. Improve analytics and topic drill views using subquestion-level data
+3. Fix LaTeX / rendering quality
+4. Build upload dedup / filtering against existing library papers
+5. Do a focused UI / UX pass after the real data flows are stable
+
+## Operational Notes
+
+### Frontend entry issue that was fixed
+
+Homepage was previously still using mock papers and an old hardcoded `COMP2211` id.
+It now reads real papers from `listPapers()`.
+
+### Manual content generation
+
+The current `COMP2211` three-piece study aids were filled manually through local scripts and deterministic templates, not through external LLM batch processing. This is deliberate and keeps the current dataset stable.
+
+### If rebuilding papers again
+
+For `COMP2211`, use the manual splitters rather than rerunning generic extraction blindly. `2024-spring-midterm` especially required reconstruction from PDF page spans because the earlier top-level extraction had already truncated `Problem 5` and `Problem 7`.
+
+## Ready-to-Verify Checklist
+
+If you want to sanity-check the current product quickly:
+
+1. Open home page and filter `COMP2211`
+2. Open each paper and confirm `status = ready`
+3. Check question count matches:
+   - `43 / 38 / 24 / 19 / 36 / 42 / 48`
+4. Open analytics page for `COMP2211`
+5. Open several papers and verify:
+   - question nav loads
+   - AI trio exists
+   - topics render
+   - similar-question panel does not block the page