Files
PastpaperMaster/HANDOFF_COMP2211.md
Zhao 7a09167261 Initial commit: PastPaper Master full stack
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 12:27:47 +07:00

329 lines
12 KiB
Markdown

# COMP2211 Handoff
## Current Status
`COMP2211` course-library papers are now fully loaded into Supabase and normalized to subquestion-level granularity.
Canonical papers currently in DB:
- `COMP2211-2022-fall-midterm`
- `COMP2211-2022-spring-midterm`
- `COMP2211-2022-spring-final-part-a`
- `COMP2211-2022-spring-final-part-b`
- `COMP2211-2023-spring-midterm`
- `COMP2211-2024-spring-midterm`
- `COMP2211-2024-spring-final`
All seven papers are:
- `status = ready`
- split to subquestion level
- tagged with `analytics_topic`, `topic_primary`, `topic_tags`, `skill_tags`
Question counts:
- 2022 fall midterm: `43`
- 2022 spring midterm: `38`
- 2022 spring final part A: `24`
- 2022 spring final part B: `19`
- 2023 spring midterm: `36`
- 2024 spring midterm: `42`
- 2024 spring final: `48`
## Key Files
Schema / SQL:
- [001_init_schema.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/001_init_schema.sql)
- [002_course_library_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/002_course_library_fields.sql)
- [003_question_taxonomy_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/003_question_taxonomy_fields.sql)
- [004_decouple_course_library_from_auth.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/004_decouple_course_library_from_auth.sql)
- [005_allow_long_question_format_alias.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/005_allow_long_question_format_alias.sql)
- [006_make_scores_numeric.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/006_make_scores_numeric.sql)
Course-library seeds:
- [comp2211_course_library_papers.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_course_library_papers.sql)
- [comp2211_problem_taxonomy_backfill.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_taxonomy_backfill.sql)
- [comp2211_problem_level_questions.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_level_questions.sql)
Manual splitters used for final subquestion rebuild:
- [split_comp2211_2022_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_midterm.py)
- [split_comp2211_2022_spring_final_part_a.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_a.py)
- [split_comp2211_2022_spring_final_part_b.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_b.py)
- [split_comp2211_2023_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2023_spring_midterm.py)
- [split_comp2211_2024_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_midterm.py)
- [split_comp2211_2024_spring_final.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_final.py)
Deprecated filler script:
- [fill_manual_study_aids.py](/Users/soda/Desktop/PastPaper%20Master/backend/fill_manual_study_aids.py)
Audit / taxonomy references:
- [COMP2211.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211.json)
- [COMP2211_taxonomy.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211_taxonomy.json)
- [summary.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/summary.json)
- [problem_topics.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_topics.json)
- [problem_seed.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_seed.json)
Frontend / backend areas already adapted to real taxonomy:
- [frontend/src/pages/HomePage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/HomePage.tsx)
- [frontend/src/pages/AnalyticsPage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/ErrorBookPage.tsx)
- [frontend/src/components/workbench/SimilarHistoryPanel.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/components/workbench/SimilarHistoryPanel.tsx)
- [backend/app/routers/analytics.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/analytics.py)
- [backend/app/routers/questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py)
- [backend/app/routers/attempts.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/attempts.py)
## Important Product / Data Decisions Already Made
### Course library vs user upload
This is now separated semantically inside `papers`:
- `source_kind = 'course_library'` for platform-owned papers
- `source_kind = 'user_upload'` for user-contributed papers
Course-library papers no longer require `user_id`.
### Taxonomy model
`question_type` is not the main analytics dimension.
Current intended usage:
- `question_type` / `question_format`: rendering and answer interaction
- `analytics_topic`: normalized analytics bucket
- `topic_tags`: multi-tag topical indexing
- `skill_tags`: finer-grained retrieval / grading / similarity support
### Score field
Scores are `NUMERIC`, not integer, because many subquestions use fractional marks like `1.5`.
## Known Issues
### 1. Similar question retrieval is still not truly production-ready
Current state:
- backend route exists
- frontend panel exists
- demo fallback still exists in the UI when retrieval returns empty / fails
What needs to be done:
- remove demo fallback behavior once real retrieval is stable
- improve ranking beyond current basic topic/type matching
- ideally add indexed text retrieval, then embeddings if needed
Recommended order:
1. build deterministic same-course retrieval first
2. rank by `analytics_topic`, `topic_tags`, `skill_tags`, `question_format`, text similarity
3. only then consider vector search
### 2. Analytics is real, but still not the final version
Current state:
- analytics already reads real DB data
- taxonomy fields are being used
Still missing:
- better topic normalization for edge cases
- per-paper and per-subtopic drill-down
- cleaner stats for mixed-format questions
- confidence around aggregated counts across all courses, not only `COMP2211`
### 3. LaTeX / math rendering is still fragile
Known symptoms:
- OCR / extracted math strings are noisy
- some generated HTML contains malformed or hard-to-read math fragments
- not all backend feedback is rendered with the same quality
What needs work:
- normalize math strings before rendering
- improve KaTeX preprocessing
- avoid dumping broken extracted formulas directly into UI
- ensure solution / feedback content is consistently rendered through the same component path
### 4. Presentation quality is still uneven
Data is now real, but UI still needs polish:
- question nav is still too weak for long real papers
- status / difficulty / topic chips can be clearer
- workbench hierarchy is inconsistent across question types
- some pages still read like an internal demo rather than a finished study product
### 5. User upload flow still lacks dedup / library filtering
This is the next big backend product task.
Desired logic:
- when user uploads a paper, compare against existing course-library papers
- if it is already covered, do not create a duplicate paper
- if it is new, ingest it as `user_upload`
- if high quality and non-duplicate, optionally promote into library workflow later
### 6. Most non-Spring-2024 study aids are contaminated by template filler content
Current state:
- `COMP2211-2022-fall-midterm` has question-level LLM-authored study aids
- `COMP2211-2024-spring-midterm` is the intended quality bar
- the remaining papers were backfilled with a deprecated template script and should not be treated as production-quality AI content
Impact:
- `knowledge_reminder` is often generic topic boilerplate
- `ai_hint` often points to a parent problem header instead of the actual subquestion
- `solution` is often just wrapped reference text, not a true worked solution
Required action:
1. detect and clear templated study aids from affected papers
2. regenerate them through the real LLM path in [paper_processor.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/services/paper_processor.py)
3. review output quality before marking the papers as complete
## Next Major Workstreams
### A. Real similar-question retrieval
Goal:
- no demo fallback
- same-course retrieval that feels trustworthy
Suggested implementation:
1. add a richer retrieval score in [questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py)
2. use:
- same `course_code`
- same `analytics_topic`
- overlapping `topic_tags`
- overlapping `skill_tags`
- same or compatible `question_format`
- lexical similarity on `question_text`
3. expose match reasons in response if useful
4. update UI to show why a question was retrieved
Potential DB improvement:
- add `search_text` / `tsvector` on `paper_questions`
- later optionally add `embedding`
### B. Real paper / topic statistics
Goal:
- analytics should be fully trustworthy at subquestion level
Suggested improvements:
- topic frequency by `analytics_topic`
- question-format distribution by subquestion, not by top-level problem
- per-paper breakdown
- high-yield topic trend across years
- topic-to-question index page for drill mode
### C. LaTeX and content rendering cleanup
Goal:
- all math-heavy content should render legibly
Suggested work:
- centralize HTML + KaTeX normalization
- strip broken OCR artifacts before render
- make study-aid content generation avoid malformed formula formatting
- ensure grading feedback and solutions share the same renderer pipeline
### D. User upload deduplication and library filtering
Goal:
- new uploads should not pollute the DB with duplicates
Suggested logic:
1. normalize upload metadata
2. compare against existing papers in same course:
- year / term / exam_type / part_label
- title similarity
- extracted first-page markers
- optional text fingerprint
3. if duplicate:
- attach to existing paper or reject with explanation
4. if not duplicate:
- create `user_upload`
- process normally
Likely schema additions later:
- content fingerprint field on `papers`
- upload provenance fields
- moderation / promotion state for community uploads
### E. UI / UX pass
Priority items:
- stronger question navigation for real papers
- clearer ready / processing / failed states
- better paper list and filtering UX
- richer workbench metadata:
- topic
- difficulty
- format
- score
- answered / wrong / mastered state
- unify visual style across analytics, error book, workbench
## Suggested Development Order
1. Remove similar-question demo fallback and ship real retrieval
2. Improve analytics and topic drill views using subquestion-level data
3. Fix LaTeX / rendering quality
4. Build upload dedup / filtering against existing library papers
5. Do a focused UI / UX pass after the real data flows are stable
## Operational Notes
### Frontend entry issue that was fixed
Homepage was previously still using mock papers and an old hardcoded `COMP2211` id.
It now reads real papers from `listPapers()`.
### Manual content generation
The current `COMP2211` three-piece study aids were filled manually through local scripts and deterministic templates, not through external LLM batch processing. This is deliberate and keeps the current dataset stable.
### If rebuilding papers again
For `COMP2211`, use the manual splitters rather than rerunning generic extraction blindly. `2024-spring-midterm` especially required reconstruction from PDF page spans because the earlier top-level extraction had already truncated `Problem 5` and `Problem 7`.
## Ready-to-Verify Checklist
If you want to sanity-check the current product quickly:
1. Open home page and filter `COMP2211`
2. Open each paper and confirm `status = ready`
3. Check question count matches:
- `43 / 38 / 24 / 19 / 36 / 42 / 48`
4. Open analytics page for `COMP2211`
5. Open several papers and verify:
- question nav loads
- AI trio exists
- topics render
- similar-question panel does not block the page