329 lines
12 KiB
Markdown
329 lines
12 KiB
Markdown
# COMP2211 Handoff
|
|
|
|
## Current Status
|
|
|
|
`COMP2211` course-library papers are now fully loaded into Supabase and normalized to subquestion-level granularity.
|
|
|
|
Canonical papers currently in DB:
|
|
|
|
- `COMP2211-2022-fall-midterm`
|
|
- `COMP2211-2022-spring-midterm`
|
|
- `COMP2211-2022-spring-final-part-a`
|
|
- `COMP2211-2022-spring-final-part-b`
|
|
- `COMP2211-2023-spring-midterm`
|
|
- `COMP2211-2024-spring-midterm`
|
|
- `COMP2211-2024-spring-final`
|
|
|
|
All seven papers are:
|
|
|
|
- `status = ready`
|
|
- split to subquestion level
|
|
- tagged with `analytics_topic`, `topic_primary`, `topic_tags`, `skill_tags`
|
|
|
|
Question counts:
|
|
|
|
- 2022 fall midterm: `43`
|
|
- 2022 spring midterm: `38`
|
|
- 2022 spring final part A: `24`
|
|
- 2022 spring final part B: `19`
|
|
- 2023 spring midterm: `36`
|
|
- 2024 spring midterm: `42`
|
|
- 2024 spring final: `48`
|
|
|
|
## Key Files
|
|
|
|
Schema / SQL:
|
|
|
|
- [001_init_schema.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/001_init_schema.sql)
|
|
- [002_course_library_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/002_course_library_fields.sql)
|
|
- [003_question_taxonomy_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/003_question_taxonomy_fields.sql)
|
|
- [004_decouple_course_library_from_auth.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/004_decouple_course_library_from_auth.sql)
|
|
- [005_allow_long_question_format_alias.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/005_allow_long_question_format_alias.sql)
|
|
- [006_make_scores_numeric.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/006_make_scores_numeric.sql)
|
|
|
|
Course-library seeds:
|
|
|
|
- [comp2211_course_library_papers.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_course_library_papers.sql)
|
|
- [comp2211_problem_taxonomy_backfill.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_taxonomy_backfill.sql)
|
|
- [comp2211_problem_level_questions.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_level_questions.sql)
|
|
|
|
Manual splitters used for final subquestion rebuild:
|
|
|
|
- [split_comp2211_2022_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_midterm.py)
|
|
- [split_comp2211_2022_spring_final_part_a.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_a.py)
|
|
- [split_comp2211_2022_spring_final_part_b.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_b.py)
|
|
- [split_comp2211_2023_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2023_spring_midterm.py)
|
|
- [split_comp2211_2024_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_midterm.py)
|
|
- [split_comp2211_2024_spring_final.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_final.py)
|
|
|
|
Deprecated filler script:
|
|
|
|
- [fill_manual_study_aids.py](/Users/soda/Desktop/PastPaper%20Master/backend/fill_manual_study_aids.py)
|
|
|
|
Audit / taxonomy references:
|
|
|
|
- [COMP2211.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211.json)
|
|
- [COMP2211_taxonomy.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211_taxonomy.json)
|
|
- [summary.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/summary.json)
|
|
- [problem_topics.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_topics.json)
|
|
- [problem_seed.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_seed.json)
|
|
|
|
Frontend / backend areas already adapted to real taxonomy:
|
|
|
|
- [frontend/src/pages/HomePage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/HomePage.tsx)
|
|
- [frontend/src/pages/AnalyticsPage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/ErrorBookPage.tsx)
|
|
- [frontend/src/components/workbench/SimilarHistoryPanel.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/components/workbench/SimilarHistoryPanel.tsx)
|
|
- [backend/app/routers/analytics.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/analytics.py)
|
|
- [backend/app/routers/questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py)
|
|
- [backend/app/routers/attempts.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/attempts.py)
|
|
|
|
## Important Product / Data Decisions Already Made
|
|
|
|
### Course library vs user upload
|
|
|
|
This is now separated semantically inside `papers`:
|
|
|
|
- `source_kind = 'course_library'` for platform-owned papers
|
|
- `source_kind = 'user_upload'` for user-contributed papers
|
|
|
|
Course-library papers no longer require `user_id`.
|
|
|
|
### Taxonomy model
|
|
|
|
`question_type` is not the main analytics dimension.
|
|
|
|
Current intended usage:
|
|
|
|
- `question_type` / `question_format`: rendering and answer interaction
|
|
- `analytics_topic`: normalized analytics bucket
|
|
- `topic_tags`: multi-tag topical indexing
|
|
- `skill_tags`: finer-grained retrieval / grading / similarity support
|
|
|
|
### Score field
|
|
|
|
Scores are `NUMERIC`, not integer, because many subquestions use fractional marks like `1.5`.
|
|
|
|
## Known Issues
|
|
|
|
### 1. Similar question retrieval is still not truly production-ready
|
|
|
|
Current state:
|
|
|
|
- backend route exists
|
|
- frontend panel exists
|
|
- demo fallback still exists in the UI when retrieval returns empty / fails
|
|
|
|
What needs to be done:
|
|
|
|
- remove demo fallback behavior once real retrieval is stable
|
|
- improve ranking beyond current basic topic/type matching
|
|
- ideally add indexed text retrieval, then embeddings if needed
|
|
|
|
Recommended order:
|
|
|
|
1. build deterministic same-course retrieval first
|
|
2. rank by `analytics_topic`, `topic_tags`, `skill_tags`, `question_format`, text similarity
|
|
3. only then consider vector search
|
|
|
|
### 2. Analytics is real, but still not the final version
|
|
|
|
Current state:
|
|
|
|
- analytics already reads real DB data
|
|
- taxonomy fields are being used
|
|
|
|
Still missing:
|
|
|
|
- better topic normalization for edge cases
|
|
- per-paper and per-subtopic drill-down
|
|
- cleaner stats for mixed-format questions
|
|
- confidence around aggregated counts across all courses, not only `COMP2211`
|
|
|
|
### 3. LaTeX / math rendering is still fragile
|
|
|
|
Known symptoms:
|
|
|
|
- OCR / extracted math strings are noisy
|
|
- some generated HTML contains malformed or hard-to-read math fragments
|
|
- not all backend feedback is rendered with the same quality
|
|
|
|
What needs work:
|
|
|
|
- normalize math strings before rendering
|
|
- improve KaTeX preprocessing
|
|
- avoid dumping broken extracted formulas directly into UI
|
|
- ensure solution / feedback content is consistently rendered through the same component path
|
|
|
|
### 4. Presentation quality is still uneven
|
|
|
|
Data is now real, but UI still needs polish:
|
|
|
|
- question nav is still too weak for long real papers
|
|
- status / difficulty / topic chips can be clearer
|
|
- workbench hierarchy is inconsistent across question types
|
|
- some pages still read like an internal demo rather than a finished study product
|
|
|
|
### 5. User upload flow still lacks dedup / library filtering
|
|
|
|
This is the next big backend product task.
|
|
|
|
Desired logic:
|
|
|
|
- when user uploads a paper, compare against existing course-library papers
|
|
- if it is already covered, do not create a duplicate paper
|
|
- if it is new, ingest it as `user_upload`
|
|
- if high quality and non-duplicate, optionally promote into library workflow later
|
|
|
|
### 6. Most non-Spring-2024 study aids are contaminated by template filler content
|
|
|
|
Current state:
|
|
|
|
- `COMP2211-2022-fall-midterm` has question-level LLM-authored study aids
|
|
- `COMP2211-2024-spring-midterm` is the intended quality bar
|
|
- the remaining papers were backfilled with a deprecated template script and should not be treated as production-quality AI content
|
|
|
|
Impact:
|
|
|
|
- `knowledge_reminder` is often generic topic boilerplate
|
|
- `ai_hint` often points to a parent problem header instead of the actual subquestion
|
|
- `solution` is often just wrapped reference text, not a true worked solution
|
|
|
|
Required action:
|
|
|
|
1. detect and clear templated study aids from affected papers
|
|
2. regenerate them through the real LLM path in [paper_processor.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/services/paper_processor.py)
|
|
3. review output quality before marking the papers as complete
|
|
|
|
## Next Major Workstreams
|
|
|
|
### A. Real similar-question retrieval
|
|
|
|
Goal:
|
|
|
|
- no demo fallback
|
|
- same-course retrieval that feels trustworthy
|
|
|
|
Suggested implementation:
|
|
|
|
1. add a richer retrieval score in [questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py)
|
|
2. use:
|
|
- same `course_code`
|
|
- same `analytics_topic`
|
|
- overlapping `topic_tags`
|
|
- overlapping `skill_tags`
|
|
- same or compatible `question_format`
|
|
- lexical similarity on `question_text`
|
|
3. expose match reasons in response if useful
|
|
4. update UI to show why a question was retrieved
|
|
|
|
Potential DB improvement:
|
|
|
|
- add `search_text` / `tsvector` on `paper_questions`
|
|
- later optionally add `embedding`
|
|
|
|
### B. Real paper / topic statistics
|
|
|
|
Goal:
|
|
|
|
- analytics should be fully trustworthy at subquestion level
|
|
|
|
Suggested improvements:
|
|
|
|
- topic frequency by `analytics_topic`
|
|
- question-format distribution by subquestion, not by top-level problem
|
|
- per-paper breakdown
|
|
- high-yield topic trend across years
|
|
- topic-to-question index page for drill mode
|
|
|
|
### C. LaTeX and content rendering cleanup
|
|
|
|
Goal:
|
|
|
|
- all math-heavy content should render legibly
|
|
|
|
Suggested work:
|
|
|
|
- centralize HTML + KaTeX normalization
|
|
- strip broken OCR artifacts before render
|
|
- make study-aid content generation avoid malformed formula formatting
|
|
- ensure grading feedback and solutions share the same renderer pipeline
|
|
|
|
### D. User upload deduplication and library filtering
|
|
|
|
Goal:
|
|
|
|
- new uploads should not pollute the DB with duplicates
|
|
|
|
Suggested logic:
|
|
|
|
1. normalize upload metadata
|
|
2. compare against existing papers in same course:
|
|
- year / term / exam_type / part_label
|
|
- title similarity
|
|
- extracted first-page markers
|
|
- optional text fingerprint
|
|
3. if duplicate:
|
|
- attach to existing paper or reject with explanation
|
|
4. if not duplicate:
|
|
- create `user_upload`
|
|
- process normally
|
|
|
|
Likely schema additions later:
|
|
|
|
- content fingerprint field on `papers`
|
|
- upload provenance fields
|
|
- moderation / promotion state for community uploads
|
|
|
|
### E. UI / UX pass
|
|
|
|
Priority items:
|
|
|
|
- stronger question navigation for real papers
|
|
- clearer ready / processing / failed states
|
|
- better paper list and filtering UX
|
|
- richer workbench metadata:
|
|
- topic
|
|
- difficulty
|
|
- format
|
|
- score
|
|
- answered / wrong / mastered state
|
|
- unify visual style across analytics, error book, workbench
|
|
|
|
## Suggested Development Order
|
|
|
|
1. Remove similar-question demo fallback and ship real retrieval
|
|
2. Improve analytics and topic drill views using subquestion-level data
|
|
3. Fix LaTeX / rendering quality
|
|
4. Build upload dedup / filtering against existing library papers
|
|
5. Do a focused UI / UX pass after the real data flows are stable
|
|
|
|
## Operational Notes
|
|
|
|
### Frontend entry issue that was fixed
|
|
|
|
Homepage was previously still using mock papers and an old hardcoded `COMP2211` id.
|
|
It now reads real papers from `listPapers()`.
|
|
|
|
### Manual content generation
|
|
|
|
The current `COMP2211` three-piece study aids were filled manually through local scripts and deterministic templates, not through external LLM batch processing. This is deliberate and keeps the current dataset stable.
|
|
|
|
### If rebuilding papers again
|
|
|
|
For `COMP2211`, use the manual splitters rather than rerunning generic extraction blindly. `2024-spring-midterm` especially required reconstruction from PDF page spans because the earlier top-level extraction had already truncated `Problem 5` and `Problem 7`.
|
|
|
|
## Ready-to-Verify Checklist
|
|
|
|
If you want to sanity-check the current product quickly:
|
|
|
|
1. Open home page and filter `COMP2211`
|
|
2. Open each paper and confirm `status = ready`
|
|
3. Check question count matches:
|
|
- `43 / 38 / 24 / 19 / 36 / 42 / 48`
|
|
4. Open analytics page for `COMP2211`
|
|
5. Open several papers and verify:
|
|
- question nav loads
|
|
- AI trio exists
|
|
- topics render
|
|
- similar-question panel does not block the page
|