# COMP2211 Handoff ## Current Status `COMP2211` course-library papers are now fully loaded into Supabase and normalized to subquestion-level granularity. Canonical papers currently in DB: - `COMP2211-2022-fall-midterm` - `COMP2211-2022-spring-midterm` - `COMP2211-2022-spring-final-part-a` - `COMP2211-2022-spring-final-part-b` - `COMP2211-2023-spring-midterm` - `COMP2211-2024-spring-midterm` - `COMP2211-2024-spring-final` All seven papers are: - `status = ready` - split to subquestion level - tagged with `analytics_topic`, `topic_primary`, `topic_tags`, `skill_tags` Question counts: - 2022 fall midterm: `43` - 2022 spring midterm: `38` - 2022 spring final part A: `24` - 2022 spring final part B: `19` - 2023 spring midterm: `36` - 2024 spring midterm: `42` - 2024 spring final: `48` ## Key Files Schema / SQL: - [001_init_schema.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/001_init_schema.sql) - [002_course_library_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/002_course_library_fields.sql) - [003_question_taxonomy_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/003_question_taxonomy_fields.sql) - [004_decouple_course_library_from_auth.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/004_decouple_course_library_from_auth.sql) - [005_allow_long_question_format_alias.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/005_allow_long_question_format_alias.sql) - [006_make_scores_numeric.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/006_make_scores_numeric.sql) Course-library seeds: - [comp2211_course_library_papers.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_course_library_papers.sql) - [comp2211_problem_taxonomy_backfill.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_taxonomy_backfill.sql) - [comp2211_problem_level_questions.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_level_questions.sql) Manual splitters used for final subquestion rebuild: - [split_comp2211_2022_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_midterm.py) - [split_comp2211_2022_spring_final_part_a.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_a.py) - [split_comp2211_2022_spring_final_part_b.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_b.py) - [split_comp2211_2023_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2023_spring_midterm.py) - [split_comp2211_2024_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_midterm.py) - [split_comp2211_2024_spring_final.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_final.py) Deprecated filler script: - [fill_manual_study_aids.py](/Users/soda/Desktop/PastPaper%20Master/backend/fill_manual_study_aids.py) Audit / taxonomy references: - [COMP2211.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211.json) - [COMP2211_taxonomy.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211_taxonomy.json) - [summary.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/summary.json) - [problem_topics.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_topics.json) - [problem_seed.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_seed.json) Frontend / backend areas already adapted to real taxonomy: - [frontend/src/pages/HomePage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/HomePage.tsx) - [frontend/src/pages/AnalyticsPage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/ErrorBookPage.tsx) - [frontend/src/components/workbench/SimilarHistoryPanel.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/components/workbench/SimilarHistoryPanel.tsx) - [backend/app/routers/analytics.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/analytics.py) - [backend/app/routers/questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py) - [backend/app/routers/attempts.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/attempts.py) ## Important Product / Data Decisions Already Made ### Course library vs user upload This is now separated semantically inside `papers`: - `source_kind = 'course_library'` for platform-owned papers - `source_kind = 'user_upload'` for user-contributed papers Course-library papers no longer require `user_id`. ### Taxonomy model `question_type` is not the main analytics dimension. Current intended usage: - `question_type` / `question_format`: rendering and answer interaction - `analytics_topic`: normalized analytics bucket - `topic_tags`: multi-tag topical indexing - `skill_tags`: finer-grained retrieval / grading / similarity support ### Score field Scores are `NUMERIC`, not integer, because many subquestions use fractional marks like `1.5`. ## Known Issues ### 1. Similar question retrieval is still not truly production-ready Current state: - backend route exists - frontend panel exists - demo fallback still exists in the UI when retrieval returns empty / fails What needs to be done: - remove demo fallback behavior once real retrieval is stable - improve ranking beyond current basic topic/type matching - ideally add indexed text retrieval, then embeddings if needed Recommended order: 1. build deterministic same-course retrieval first 2. rank by `analytics_topic`, `topic_tags`, `skill_tags`, `question_format`, text similarity 3. only then consider vector search ### 2. Analytics is real, but still not the final version Current state: - analytics already reads real DB data - taxonomy fields are being used Still missing: - better topic normalization for edge cases - per-paper and per-subtopic drill-down - cleaner stats for mixed-format questions - confidence around aggregated counts across all courses, not only `COMP2211` ### 3. LaTeX / math rendering is still fragile Known symptoms: - OCR / extracted math strings are noisy - some generated HTML contains malformed or hard-to-read math fragments - not all backend feedback is rendered with the same quality What needs work: - normalize math strings before rendering - improve KaTeX preprocessing - avoid dumping broken extracted formulas directly into UI - ensure solution / feedback content is consistently rendered through the same component path ### 4. Presentation quality is still uneven Data is now real, but UI still needs polish: - question nav is still too weak for long real papers - status / difficulty / topic chips can be clearer - workbench hierarchy is inconsistent across question types - some pages still read like an internal demo rather than a finished study product ### 5. User upload flow still lacks dedup / library filtering This is the next big backend product task. Desired logic: - when user uploads a paper, compare against existing course-library papers - if it is already covered, do not create a duplicate paper - if it is new, ingest it as `user_upload` - if high quality and non-duplicate, optionally promote into library workflow later ### 6. Most non-Spring-2024 study aids are contaminated by template filler content Current state: - `COMP2211-2022-fall-midterm` has question-level LLM-authored study aids - `COMP2211-2024-spring-midterm` is the intended quality bar - the remaining papers were backfilled with a deprecated template script and should not be treated as production-quality AI content Impact: - `knowledge_reminder` is often generic topic boilerplate - `ai_hint` often points to a parent problem header instead of the actual subquestion - `solution` is often just wrapped reference text, not a true worked solution Required action: 1. detect and clear templated study aids from affected papers 2. regenerate them through the real LLM path in [paper_processor.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/services/paper_processor.py) 3. review output quality before marking the papers as complete ## Next Major Workstreams ### A. Real similar-question retrieval Goal: - no demo fallback - same-course retrieval that feels trustworthy Suggested implementation: 1. add a richer retrieval score in [questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py) 2. use: - same `course_code` - same `analytics_topic` - overlapping `topic_tags` - overlapping `skill_tags` - same or compatible `question_format` - lexical similarity on `question_text` 3. expose match reasons in response if useful 4. update UI to show why a question was retrieved Potential DB improvement: - add `search_text` / `tsvector` on `paper_questions` - later optionally add `embedding` ### B. Real paper / topic statistics Goal: - analytics should be fully trustworthy at subquestion level Suggested improvements: - topic frequency by `analytics_topic` - question-format distribution by subquestion, not by top-level problem - per-paper breakdown - high-yield topic trend across years - topic-to-question index page for drill mode ### C. LaTeX and content rendering cleanup Goal: - all math-heavy content should render legibly Suggested work: - centralize HTML + KaTeX normalization - strip broken OCR artifacts before render - make study-aid content generation avoid malformed formula formatting - ensure grading feedback and solutions share the same renderer pipeline ### D. User upload deduplication and library filtering Goal: - new uploads should not pollute the DB with duplicates Suggested logic: 1. normalize upload metadata 2. compare against existing papers in same course: - year / term / exam_type / part_label - title similarity - extracted first-page markers - optional text fingerprint 3. if duplicate: - attach to existing paper or reject with explanation 4. if not duplicate: - create `user_upload` - process normally Likely schema additions later: - content fingerprint field on `papers` - upload provenance fields - moderation / promotion state for community uploads ### E. UI / UX pass Priority items: - stronger question navigation for real papers - clearer ready / processing / failed states - better paper list and filtering UX - richer workbench metadata: - topic - difficulty - format - score - answered / wrong / mastered state - unify visual style across analytics, error book, workbench ## Suggested Development Order 1. Remove similar-question demo fallback and ship real retrieval 2. Improve analytics and topic drill views using subquestion-level data 3. Fix LaTeX / rendering quality 4. Build upload dedup / filtering against existing library papers 5. Do a focused UI / UX pass after the real data flows are stable ## Operational Notes ### Frontend entry issue that was fixed Homepage was previously still using mock papers and an old hardcoded `COMP2211` id. It now reads real papers from `listPapers()`. ### Manual content generation The current `COMP2211` three-piece study aids were filled manually through local scripts and deterministic templates, not through external LLM batch processing. This is deliberate and keeps the current dataset stable. ### If rebuilding papers again For `COMP2211`, use the manual splitters rather than rerunning generic extraction blindly. `2024-spring-midterm` especially required reconstruction from PDF page spans because the earlier top-level extraction had already truncated `Problem 5` and `Problem 7`. ## Ready-to-Verify Checklist If you want to sanity-check the current product quickly: 1. Open home page and filter `COMP2211` 2. Open each paper and confirm `status = ready` 3. Check question count matches: - `43 / 38 / 24 / 19 / 36 / 42 / 48` 4. Open analytics page for `COMP2211` 5. Open several papers and verify: - question nav loads - AI trio exists - topics render - similar-question panel does not block the page