Initial commit: PastPaper Master full stack
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
328
HANDOFF_COMP2211.md
Normal file
328
HANDOFF_COMP2211.md
Normal file
@@ -0,0 +1,328 @@
|
||||
# COMP2211 Handoff
|
||||
|
||||
## Current Status
|
||||
|
||||
`COMP2211` course-library papers are now fully loaded into Supabase and normalized to subquestion-level granularity.
|
||||
|
||||
Canonical papers currently in DB:
|
||||
|
||||
- `COMP2211-2022-fall-midterm`
|
||||
- `COMP2211-2022-spring-midterm`
|
||||
- `COMP2211-2022-spring-final-part-a`
|
||||
- `COMP2211-2022-spring-final-part-b`
|
||||
- `COMP2211-2023-spring-midterm`
|
||||
- `COMP2211-2024-spring-midterm`
|
||||
- `COMP2211-2024-spring-final`
|
||||
|
||||
All seven papers are:
|
||||
|
||||
- `status = ready`
|
||||
- split to subquestion level
|
||||
- tagged with `analytics_topic`, `topic_primary`, `topic_tags`, `skill_tags`
|
||||
|
||||
Question counts:
|
||||
|
||||
- 2022 fall midterm: `43`
|
||||
- 2022 spring midterm: `38`
|
||||
- 2022 spring final part A: `24`
|
||||
- 2022 spring final part B: `19`
|
||||
- 2023 spring midterm: `36`
|
||||
- 2024 spring midterm: `42`
|
||||
- 2024 spring final: `48`
|
||||
|
||||
## Key Files
|
||||
|
||||
Schema / SQL:
|
||||
|
||||
- [001_init_schema.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/001_init_schema.sql)
|
||||
- [002_course_library_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/002_course_library_fields.sql)
|
||||
- [003_question_taxonomy_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/003_question_taxonomy_fields.sql)
|
||||
- [004_decouple_course_library_from_auth.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/004_decouple_course_library_from_auth.sql)
|
||||
- [005_allow_long_question_format_alias.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/005_allow_long_question_format_alias.sql)
|
||||
- [006_make_scores_numeric.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/006_make_scores_numeric.sql)
|
||||
|
||||
Course-library seeds:
|
||||
|
||||
- [comp2211_course_library_papers.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_course_library_papers.sql)
|
||||
- [comp2211_problem_taxonomy_backfill.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_taxonomy_backfill.sql)
|
||||
- [comp2211_problem_level_questions.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_level_questions.sql)
|
||||
|
||||
Manual splitters used for final subquestion rebuild:
|
||||
|
||||
- [split_comp2211_2022_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_midterm.py)
|
||||
- [split_comp2211_2022_spring_final_part_a.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_a.py)
|
||||
- [split_comp2211_2022_spring_final_part_b.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_b.py)
|
||||
- [split_comp2211_2023_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2023_spring_midterm.py)
|
||||
- [split_comp2211_2024_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_midterm.py)
|
||||
- [split_comp2211_2024_spring_final.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_final.py)
|
||||
|
||||
Deprecated filler script:
|
||||
|
||||
- [fill_manual_study_aids.py](/Users/soda/Desktop/PastPaper%20Master/backend/fill_manual_study_aids.py)
|
||||
|
||||
Audit / taxonomy references:
|
||||
|
||||
- [COMP2211.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211.json)
|
||||
- [COMP2211_taxonomy.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211_taxonomy.json)
|
||||
- [summary.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/summary.json)
|
||||
- [problem_topics.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_topics.json)
|
||||
- [problem_seed.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_seed.json)
|
||||
|
||||
Frontend / backend areas already adapted to real taxonomy:
|
||||
|
||||
- [frontend/src/pages/HomePage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/HomePage.tsx)
|
||||
- [frontend/src/pages/AnalyticsPage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/ErrorBookPage.tsx)
|
||||
- [frontend/src/components/workbench/SimilarHistoryPanel.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/components/workbench/SimilarHistoryPanel.tsx)
|
||||
- [backend/app/routers/analytics.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/analytics.py)
|
||||
- [backend/app/routers/questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py)
|
||||
- [backend/app/routers/attempts.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/attempts.py)
|
||||
|
||||
## Important Product / Data Decisions Already Made
|
||||
|
||||
### Course library vs user upload
|
||||
|
||||
This is now separated semantically inside `papers`:
|
||||
|
||||
- `source_kind = 'course_library'` for platform-owned papers
|
||||
- `source_kind = 'user_upload'` for user-contributed papers
|
||||
|
||||
Course-library papers no longer require `user_id`.
|
||||
|
||||
### Taxonomy model
|
||||
|
||||
`question_type` is not the main analytics dimension.
|
||||
|
||||
Current intended usage:
|
||||
|
||||
- `question_type` / `question_format`: rendering and answer interaction
|
||||
- `analytics_topic`: normalized analytics bucket
|
||||
- `topic_tags`: multi-tag topical indexing
|
||||
- `skill_tags`: finer-grained retrieval / grading / similarity support
|
||||
|
||||
### Score field
|
||||
|
||||
Scores are `NUMERIC`, not integer, because many subquestions use fractional marks like `1.5`.
|
||||
|
||||
## Known Issues
|
||||
|
||||
### 1. Similar question retrieval is still not truly production-ready
|
||||
|
||||
Current state:
|
||||
|
||||
- backend route exists
|
||||
- frontend panel exists
|
||||
- demo fallback still exists in the UI when retrieval returns empty / fails
|
||||
|
||||
What needs to be done:
|
||||
|
||||
- remove demo fallback behavior once real retrieval is stable
|
||||
- improve ranking beyond current basic topic/type matching
|
||||
- ideally add indexed text retrieval, then embeddings if needed
|
||||
|
||||
Recommended order:
|
||||
|
||||
1. build deterministic same-course retrieval first
|
||||
2. rank by `analytics_topic`, `topic_tags`, `skill_tags`, `question_format`, text similarity
|
||||
3. only then consider vector search
|
||||
|
||||
### 2. Analytics is real, but still not the final version
|
||||
|
||||
Current state:
|
||||
|
||||
- analytics already reads real DB data
|
||||
- taxonomy fields are being used
|
||||
|
||||
Still missing:
|
||||
|
||||
- better topic normalization for edge cases
|
||||
- per-paper and per-subtopic drill-down
|
||||
- cleaner stats for mixed-format questions
|
||||
- confidence around aggregated counts across all courses, not only `COMP2211`
|
||||
|
||||
### 3. LaTeX / math rendering is still fragile
|
||||
|
||||
Known symptoms:
|
||||
|
||||
- OCR / extracted math strings are noisy
|
||||
- some generated HTML contains malformed or hard-to-read math fragments
|
||||
- not all backend feedback is rendered with the same quality
|
||||
|
||||
What needs work:
|
||||
|
||||
- normalize math strings before rendering
|
||||
- improve KaTeX preprocessing
|
||||
- avoid dumping broken extracted formulas directly into UI
|
||||
- ensure solution / feedback content is consistently rendered through the same component path
|
||||
|
||||
### 4. Presentation quality is still uneven
|
||||
|
||||
Data is now real, but UI still needs polish:
|
||||
|
||||
- question nav is still too weak for long real papers
|
||||
- status / difficulty / topic chips can be clearer
|
||||
- workbench hierarchy is inconsistent across question types
|
||||
- some pages still read like an internal demo rather than a finished study product
|
||||
|
||||
### 5. User upload flow still lacks dedup / library filtering
|
||||
|
||||
This is the next big backend product task.
|
||||
|
||||
Desired logic:
|
||||
|
||||
- when user uploads a paper, compare against existing course-library papers
|
||||
- if it is already covered, do not create a duplicate paper
|
||||
- if it is new, ingest it as `user_upload`
|
||||
- if high quality and non-duplicate, optionally promote into library workflow later
|
||||
|
||||
### 6. Most non-Spring-2024 study aids are contaminated by template filler content
|
||||
|
||||
Current state:
|
||||
|
||||
- `COMP2211-2022-fall-midterm` has question-level LLM-authored study aids
|
||||
- `COMP2211-2024-spring-midterm` is the intended quality bar
|
||||
- the remaining papers were backfilled with a deprecated template script and should not be treated as production-quality AI content
|
||||
|
||||
Impact:
|
||||
|
||||
- `knowledge_reminder` is often generic topic boilerplate
|
||||
- `ai_hint` often points to a parent problem header instead of the actual subquestion
|
||||
- `solution` is often just wrapped reference text, not a true worked solution
|
||||
|
||||
Required action:
|
||||
|
||||
1. detect and clear templated study aids from affected papers
|
||||
2. regenerate them through the real LLM path in [paper_processor.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/services/paper_processor.py)
|
||||
3. review output quality before marking the papers as complete
|
||||
|
||||
## Next Major Workstreams
|
||||
|
||||
### A. Real similar-question retrieval
|
||||
|
||||
Goal:
|
||||
|
||||
- no demo fallback
|
||||
- same-course retrieval that feels trustworthy
|
||||
|
||||
Suggested implementation:
|
||||
|
||||
1. add a richer retrieval score in [questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py)
|
||||
2. use:
|
||||
- same `course_code`
|
||||
- same `analytics_topic`
|
||||
- overlapping `topic_tags`
|
||||
- overlapping `skill_tags`
|
||||
- same or compatible `question_format`
|
||||
- lexical similarity on `question_text`
|
||||
3. expose match reasons in response if useful
|
||||
4. update UI to show why a question was retrieved
|
||||
|
||||
Potential DB improvement:
|
||||
|
||||
- add `search_text` / `tsvector` on `paper_questions`
|
||||
- later optionally add `embedding`
|
||||
|
||||
### B. Real paper / topic statistics
|
||||
|
||||
Goal:
|
||||
|
||||
- analytics should be fully trustworthy at subquestion level
|
||||
|
||||
Suggested improvements:
|
||||
|
||||
- topic frequency by `analytics_topic`
|
||||
- question-format distribution by subquestion, not by top-level problem
|
||||
- per-paper breakdown
|
||||
- high-yield topic trend across years
|
||||
- topic-to-question index page for drill mode
|
||||
|
||||
### C. LaTeX and content rendering cleanup
|
||||
|
||||
Goal:
|
||||
|
||||
- all math-heavy content should render legibly
|
||||
|
||||
Suggested work:
|
||||
|
||||
- centralize HTML + KaTeX normalization
|
||||
- strip broken OCR artifacts before render
|
||||
- make study-aid content generation avoid malformed formula formatting
|
||||
- ensure grading feedback and solutions share the same renderer pipeline
|
||||
|
||||
### D. User upload deduplication and library filtering
|
||||
|
||||
Goal:
|
||||
|
||||
- new uploads should not pollute the DB with duplicates
|
||||
|
||||
Suggested logic:
|
||||
|
||||
1. normalize upload metadata
|
||||
2. compare against existing papers in same course:
|
||||
- year / term / exam_type / part_label
|
||||
- title similarity
|
||||
- extracted first-page markers
|
||||
- optional text fingerprint
|
||||
3. if duplicate:
|
||||
- attach to existing paper or reject with explanation
|
||||
4. if not duplicate:
|
||||
- create `user_upload`
|
||||
- process normally
|
||||
|
||||
Likely schema additions later:
|
||||
|
||||
- content fingerprint field on `papers`
|
||||
- upload provenance fields
|
||||
- moderation / promotion state for community uploads
|
||||
|
||||
### E. UI / UX pass
|
||||
|
||||
Priority items:
|
||||
|
||||
- stronger question navigation for real papers
|
||||
- clearer ready / processing / failed states
|
||||
- better paper list and filtering UX
|
||||
- richer workbench metadata:
|
||||
- topic
|
||||
- difficulty
|
||||
- format
|
||||
- score
|
||||
- answered / wrong / mastered state
|
||||
- unify visual style across analytics, error book, workbench
|
||||
|
||||
## Suggested Development Order
|
||||
|
||||
1. Remove similar-question demo fallback and ship real retrieval
|
||||
2. Improve analytics and topic drill views using subquestion-level data
|
||||
3. Fix LaTeX / rendering quality
|
||||
4. Build upload dedup / filtering against existing library papers
|
||||
5. Do a focused UI / UX pass after the real data flows are stable
|
||||
|
||||
## Operational Notes
|
||||
|
||||
### Frontend entry issue that was fixed
|
||||
|
||||
Homepage was previously still using mock papers and an old hardcoded `COMP2211` id.
|
||||
It now reads real papers from `listPapers()`.
|
||||
|
||||
### Manual content generation
|
||||
|
||||
The current `COMP2211` three-piece study aids were filled manually through local scripts and deterministic templates, not through external LLM batch processing. This is deliberate and keeps the current dataset stable.
|
||||
|
||||
### If rebuilding papers again
|
||||
|
||||
For `COMP2211`, use the manual splitters rather than rerunning generic extraction blindly. `2024-spring-midterm` especially required reconstruction from PDF page spans because the earlier top-level extraction had already truncated `Problem 5` and `Problem 7`.
|
||||
|
||||
## Ready-to-Verify Checklist
|
||||
|
||||
If you want to sanity-check the current product quickly:
|
||||
|
||||
1. Open home page and filter `COMP2211`
|
||||
2. Open each paper and confirm `status = ready`
|
||||
3. Check question count matches:
|
||||
- `43 / 38 / 24 / 19 / 36 / 42 / 48`
|
||||
4. Open analytics page for `COMP2211`
|
||||
5. Open several papers and verify:
|
||||
- question nav loads
|
||||
- AI trio exists
|
||||
- topics render
|
||||
- similar-question panel does not block the page
|
||||
Reference in New Issue
Block a user