PastpaperMaster/HANDOFF_COMP2211.md

# COMP2211 Handoff

## Current Status

`COMP2211` course-library papers are now fully loaded into Supabase and normalized to subquestion-level granularity.

Canonical papers currently in DB:

- `COMP2211-2022-fall-midterm`
- `COMP2211-2022-spring-midterm`
- `COMP2211-2022-spring-final-part-a`
- `COMP2211-2022-spring-final-part-b`
- `COMP2211-2023-spring-midterm`
- `COMP2211-2024-spring-midterm`
- `COMP2211-2024-spring-final`

All seven papers are:

- `status = ready`
- split to subquestion level
- tagged with `analytics_topic`, `topic_primary`, `topic_tags`, `skill_tags`

Question counts:

- 2022 fall midterm: `43`
- 2022 spring midterm: `38`
- 2022 spring final part A: `24`
- 2022 spring final part B: `19`
- 2023 spring midterm: `36`
- 2024 spring midterm: `42`
- 2024 spring final: `48`

## Key Files

Schema / SQL:

- [001_init_schema.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/001_init_schema.sql)
- [002_course_library_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/002_course_library_fields.sql)
- [003_question_taxonomy_fields.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/003_question_taxonomy_fields.sql)
- [004_decouple_course_library_from_auth.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/004_decouple_course_library_from_auth.sql)
- [005_allow_long_question_format_alias.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/005_allow_long_question_format_alias.sql)
- [006_make_scores_numeric.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/migrations/006_make_scores_numeric.sql)

Course-library seeds:

- [comp2211_course_library_papers.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_course_library_papers.sql)
- [comp2211_problem_taxonomy_backfill.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_taxonomy_backfill.sql)
- [comp2211_problem_level_questions.sql](/Users/soda/Desktop/PastPaper%20Master/supabase/seeds/comp2211_problem_level_questions.sql)

Manual splitters used for final subquestion rebuild:

- [split_comp2211_2022_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_midterm.py)
- [split_comp2211_2022_spring_final_part_a.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_a.py)
- [split_comp2211_2022_spring_final_part_b.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2022_spring_final_part_b.py)
- [split_comp2211_2023_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2023_spring_midterm.py)
- [split_comp2211_2024_spring_midterm.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_midterm.py)
- [split_comp2211_2024_spring_final.py](/Users/soda/Desktop/PastPaper%20Master/backend/split_comp2211_2024_spring_final.py)

Deprecated filler script:

- [fill_manual_study_aids.py](/Users/soda/Desktop/PastPaper%20Master/backend/fill_manual_study_aids.py)

Audit / taxonomy references:

- [COMP2211.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211.json)
- [COMP2211_taxonomy.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/manifests/COMP2211_taxonomy.json)
- [summary.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/summary.json)
- [problem_topics.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_topics.json)
- [problem_seed.json](/Users/soda/Desktop/PastPaper%20Master/pastpaper-scraper/reviews/COMP2211/problem_seed.json)

Frontend / backend areas already adapted to real taxonomy:

- [frontend/src/pages/HomePage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/HomePage.tsx)
- [frontend/src/pages/AnalyticsPage.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/pages/ErrorBookPage.tsx)
- [frontend/src/components/workbench/SimilarHistoryPanel.tsx](/Users/soda/Desktop/PastPaper%20Master/frontend/src/components/workbench/SimilarHistoryPanel.tsx)
- [backend/app/routers/analytics.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/analytics.py)
- [backend/app/routers/questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py)
- [backend/app/routers/attempts.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/attempts.py)

## Important Product / Data Decisions Already Made

### Course library vs user upload

This is now separated semantically inside `papers`:

- `source_kind = 'course_library'` for platform-owned papers
- `source_kind = 'user_upload'` for user-contributed papers

Course-library papers no longer require `user_id`.

### Taxonomy model

`question_type` is not the main analytics dimension.

Current intended usage:

- `question_type` / `question_format`: rendering and answer interaction
- `analytics_topic`: normalized analytics bucket
- `topic_tags`: multi-tag topical indexing
- `skill_tags`: finer-grained retrieval / grading / similarity support

### Score field

Scores are `NUMERIC`, not integer, because many subquestions use fractional marks like `1.5`.

## Known Issues

### 1. Similar question retrieval is still not truly production-ready

Current state:

- backend route exists
- frontend panel exists
- demo fallback still exists in the UI when retrieval returns empty / fails

What needs to be done:

- remove demo fallback behavior once real retrieval is stable
- improve ranking beyond current basic topic/type matching
- ideally add indexed text retrieval, then embeddings if needed

Recommended order:

1. build deterministic same-course retrieval first
2. rank by `analytics_topic`, `topic_tags`, `skill_tags`, `question_format`, text similarity
3. only then consider vector search

### 2. Analytics is real, but still not the final version

Current state:

- analytics already reads real DB data
- taxonomy fields are being used

Still missing:

- better topic normalization for edge cases
- per-paper and per-subtopic drill-down
- cleaner stats for mixed-format questions
- confidence around aggregated counts across all courses, not only `COMP2211`

### 3. LaTeX / math rendering is still fragile

Known symptoms:

- OCR / extracted math strings are noisy
- some generated HTML contains malformed or hard-to-read math fragments
- not all backend feedback is rendered with the same quality

What needs work:

- normalize math strings before rendering
- improve KaTeX preprocessing
- avoid dumping broken extracted formulas directly into UI
- ensure solution / feedback content is consistently rendered through the same component path

### 4. Presentation quality is still uneven

Data is now real, but UI still needs polish:

- question nav is still too weak for long real papers
- status / difficulty / topic chips can be clearer
- workbench hierarchy is inconsistent across question types
- some pages still read like an internal demo rather than a finished study product

### 5. User upload flow still lacks dedup / library filtering

This is the next big backend product task.

Desired logic:

- when user uploads a paper, compare against existing course-library papers
- if it is already covered, do not create a duplicate paper
- if it is new, ingest it as `user_upload`
- if high quality and non-duplicate, optionally promote into library workflow later

### 6. Most non-Spring-2024 study aids are contaminated by template filler content

Current state:

- `COMP2211-2022-fall-midterm` has question-level LLM-authored study aids
- `COMP2211-2024-spring-midterm` is the intended quality bar
- the remaining papers were backfilled with a deprecated template script and should not be treated as production-quality AI content

Impact:

- `knowledge_reminder` is often generic topic boilerplate
- `ai_hint` often points to a parent problem header instead of the actual subquestion
- `solution` is often just wrapped reference text, not a true worked solution

Required action:

1. detect and clear templated study aids from affected papers
2. regenerate them through the real LLM path in [paper_processor.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/services/paper_processor.py)
3. review output quality before marking the papers as complete

## Next Major Workstreams

### A. Real similar-question retrieval

Goal:

- no demo fallback
- same-course retrieval that feels trustworthy

Suggested implementation:

1. add a richer retrieval score in [questions.py](/Users/soda/Desktop/PastPaper%20Master/backend/app/routers/questions.py)
2. use:
   - same `course_code`
   - same `analytics_topic`
   - overlapping `topic_tags`
   - overlapping `skill_tags`
   - same or compatible `question_format`
   - lexical similarity on `question_text`
3. expose match reasons in response if useful
4. update UI to show why a question was retrieved

Potential DB improvement:

- add `search_text` / `tsvector` on `paper_questions`
- later optionally add `embedding`

### B. Real paper / topic statistics

Goal:

- analytics should be fully trustworthy at subquestion level

Suggested improvements:

- topic frequency by `analytics_topic`
- question-format distribution by subquestion, not by top-level problem
- per-paper breakdown
- high-yield topic trend across years
- topic-to-question index page for drill mode

### C. LaTeX and content rendering cleanup

Goal:

- all math-heavy content should render legibly

Suggested work:

- centralize HTML + KaTeX normalization
- strip broken OCR artifacts before render
- make study-aid content generation avoid malformed formula formatting
- ensure grading feedback and solutions share the same renderer pipeline

### D. User upload deduplication and library filtering

Goal:

- new uploads should not pollute the DB with duplicates

Suggested logic:

1. normalize upload metadata
2. compare against existing papers in same course:
   - year / term / exam_type / part_label
   - title similarity
   - extracted first-page markers
   - optional text fingerprint
3. if duplicate:
   - attach to existing paper or reject with explanation
4. if not duplicate:
   - create `user_upload`
   - process normally

Likely schema additions later:

- content fingerprint field on `papers`
- upload provenance fields
- moderation / promotion state for community uploads

### E. UI / UX pass

Priority items:

- stronger question navigation for real papers
- clearer ready / processing / failed states
- better paper list and filtering UX
- richer workbench metadata:
  - topic
  - difficulty
  - format
  - score
  - answered / wrong / mastered state
- unify visual style across analytics, error book, workbench

## Suggested Development Order

1. Remove similar-question demo fallback and ship real retrieval
2. Improve analytics and topic drill views using subquestion-level data
3. Fix LaTeX / rendering quality
4. Build upload dedup / filtering against existing library papers
5. Do a focused UI / UX pass after the real data flows are stable

## Operational Notes

### Frontend entry issue that was fixed

Homepage was previously still using mock papers and an old hardcoded `COMP2211` id.
It now reads real papers from `listPapers()`.

### Manual content generation

The current `COMP2211` three-piece study aids were filled manually through local scripts and deterministic templates, not through external LLM batch processing. This is deliberate and keeps the current dataset stable.

### If rebuilding papers again

For `COMP2211`, use the manual splitters rather than rerunning generic extraction blindly. `2024-spring-midterm` especially required reconstruction from PDF page spans because the earlier top-level extraction had already truncated `Problem 5` and `Problem 7`.

## Ready-to-Verify Checklist

If you want to sanity-check the current product quickly:

1. Open home page and filter `COMP2211`
2. Open each paper and confirm `status = ready`
3. Check question count matches:
   - `43 / 38 / 24 / 19 / 36 / 42 / 48`
4. Open analytics page for `COMP2211`
5. Open several papers and verify:
   - question nav loads
   - AI trio exists
   - topics render
   - similar-question panel does not block the page