12 KiB
COMP2211 Handoff
Current Status
COMP2211 course-library papers are now fully loaded into Supabase and normalized to subquestion-level granularity.
Canonical papers currently in DB:
COMP2211-2022-fall-midtermCOMP2211-2022-spring-midtermCOMP2211-2022-spring-final-part-aCOMP2211-2022-spring-final-part-bCOMP2211-2023-spring-midtermCOMP2211-2024-spring-midtermCOMP2211-2024-spring-final
All seven papers are:
status = ready- split to subquestion level
- tagged with
analytics_topic,topic_primary,topic_tags,skill_tags
Question counts:
- 2022 fall midterm:
43 - 2022 spring midterm:
38 - 2022 spring final part A:
24 - 2022 spring final part B:
19 - 2023 spring midterm:
36 - 2024 spring midterm:
42 - 2024 spring final:
48
Key Files
Schema / SQL:
- 001_init_schema.sql
- 002_course_library_fields.sql
- 003_question_taxonomy_fields.sql
- 004_decouple_course_library_from_auth.sql
- 005_allow_long_question_format_alias.sql
- 006_make_scores_numeric.sql
Course-library seeds:
- comp2211_course_library_papers.sql
- comp2211_problem_taxonomy_backfill.sql
- comp2211_problem_level_questions.sql
Manual splitters used for final subquestion rebuild:
- split_comp2211_2022_spring_midterm.py
- split_comp2211_2022_spring_final_part_a.py
- split_comp2211_2022_spring_final_part_b.py
- split_comp2211_2023_spring_midterm.py
- split_comp2211_2024_spring_midterm.py
- split_comp2211_2024_spring_final.py
Deprecated filler script:
Audit / taxonomy references:
Frontend / backend areas already adapted to real taxonomy:
- frontend/src/pages/HomePage.tsx
- frontend/src/pages/AnalyticsPage.tsx
- frontend/src/components/workbench/SimilarHistoryPanel.tsx
- backend/app/routers/analytics.py
- backend/app/routers/questions.py
- backend/app/routers/attempts.py
Important Product / Data Decisions Already Made
Course library vs user upload
This is now separated semantically inside papers:
source_kind = 'course_library'for platform-owned paperssource_kind = 'user_upload'for user-contributed papers
Course-library papers no longer require user_id.
Taxonomy model
question_type is not the main analytics dimension.
Current intended usage:
question_type/question_format: rendering and answer interactionanalytics_topic: normalized analytics buckettopic_tags: multi-tag topical indexingskill_tags: finer-grained retrieval / grading / similarity support
Score field
Scores are NUMERIC, not integer, because many subquestions use fractional marks like 1.5.
Known Issues
1. Similar question retrieval is still not truly production-ready
Current state:
- backend route exists
- frontend panel exists
- demo fallback still exists in the UI when retrieval returns empty / fails
What needs to be done:
- remove demo fallback behavior once real retrieval is stable
- improve ranking beyond current basic topic/type matching
- ideally add indexed text retrieval, then embeddings if needed
Recommended order:
- build deterministic same-course retrieval first
- rank by
analytics_topic,topic_tags,skill_tags,question_format, text similarity - only then consider vector search
2. Analytics is real, but still not the final version
Current state:
- analytics already reads real DB data
- taxonomy fields are being used
Still missing:
- better topic normalization for edge cases
- per-paper and per-subtopic drill-down
- cleaner stats for mixed-format questions
- confidence around aggregated counts across all courses, not only
COMP2211
3. LaTeX / math rendering is still fragile
Known symptoms:
- OCR / extracted math strings are noisy
- some generated HTML contains malformed or hard-to-read math fragments
- not all backend feedback is rendered with the same quality
What needs work:
- normalize math strings before rendering
- improve KaTeX preprocessing
- avoid dumping broken extracted formulas directly into UI
- ensure solution / feedback content is consistently rendered through the same component path
4. Presentation quality is still uneven
Data is now real, but UI still needs polish:
- question nav is still too weak for long real papers
- status / difficulty / topic chips can be clearer
- workbench hierarchy is inconsistent across question types
- some pages still read like an internal demo rather than a finished study product
5. User upload flow still lacks dedup / library filtering
This is the next big backend product task.
Desired logic:
- when user uploads a paper, compare against existing course-library papers
- if it is already covered, do not create a duplicate paper
- if it is new, ingest it as
user_upload - if high quality and non-duplicate, optionally promote into library workflow later
6. Most non-Spring-2024 study aids are contaminated by template filler content
Current state:
COMP2211-2022-fall-midtermhas question-level LLM-authored study aidsCOMP2211-2024-spring-midtermis the intended quality bar- the remaining papers were backfilled with a deprecated template script and should not be treated as production-quality AI content
Impact:
knowledge_reminderis often generic topic boilerplateai_hintoften points to a parent problem header instead of the actual subquestionsolutionis often just wrapped reference text, not a true worked solution
Required action:
- detect and clear templated study aids from affected papers
- regenerate them through the real LLM path in paper_processor.py
- review output quality before marking the papers as complete
Next Major Workstreams
A. Real similar-question retrieval
Goal:
- no demo fallback
- same-course retrieval that feels trustworthy
Suggested implementation:
- add a richer retrieval score in questions.py
- use:
- same
course_code - same
analytics_topic - overlapping
topic_tags - overlapping
skill_tags - same or compatible
question_format - lexical similarity on
question_text
- same
- expose match reasons in response if useful
- update UI to show why a question was retrieved
Potential DB improvement:
- add
search_text/tsvectoronpaper_questions - later optionally add
embedding
B. Real paper / topic statistics
Goal:
- analytics should be fully trustworthy at subquestion level
Suggested improvements:
- topic frequency by
analytics_topic - question-format distribution by subquestion, not by top-level problem
- per-paper breakdown
- high-yield topic trend across years
- topic-to-question index page for drill mode
C. LaTeX and content rendering cleanup
Goal:
- all math-heavy content should render legibly
Suggested work:
- centralize HTML + KaTeX normalization
- strip broken OCR artifacts before render
- make study-aid content generation avoid malformed formula formatting
- ensure grading feedback and solutions share the same renderer pipeline
D. User upload deduplication and library filtering
Goal:
- new uploads should not pollute the DB with duplicates
Suggested logic:
- normalize upload metadata
- compare against existing papers in same course:
- year / term / exam_type / part_label
- title similarity
- extracted first-page markers
- optional text fingerprint
- if duplicate:
- attach to existing paper or reject with explanation
- if not duplicate:
- create
user_upload - process normally
- create
Likely schema additions later:
- content fingerprint field on
papers - upload provenance fields
- moderation / promotion state for community uploads
E. UI / UX pass
Priority items:
- stronger question navigation for real papers
- clearer ready / processing / failed states
- better paper list and filtering UX
- richer workbench metadata:
- topic
- difficulty
- format
- score
- answered / wrong / mastered state
- unify visual style across analytics, error book, workbench
Suggested Development Order
- Remove similar-question demo fallback and ship real retrieval
- Improve analytics and topic drill views using subquestion-level data
- Fix LaTeX / rendering quality
- Build upload dedup / filtering against existing library papers
- Do a focused UI / UX pass after the real data flows are stable
Operational Notes
Frontend entry issue that was fixed
Homepage was previously still using mock papers and an old hardcoded COMP2211 id.
It now reads real papers from listPapers().
Manual content generation
The current COMP2211 three-piece study aids were filled manually through local scripts and deterministic templates, not through external LLM batch processing. This is deliberate and keeps the current dataset stable.
If rebuilding papers again
For COMP2211, use the manual splitters rather than rerunning generic extraction blindly. 2024-spring-midterm especially required reconstruction from PDF page spans because the earlier top-level extraction had already truncated Problem 5 and Problem 7.
Ready-to-Verify Checklist
If you want to sanity-check the current product quickly:
- Open home page and filter
COMP2211 - Open each paper and confirm
status = ready - Check question count matches:
43 / 38 / 24 / 19 / 36 / 42 / 48
- Open analytics page for
COMP2211 - Open several papers and verify:
- question nav loads
- AI trio exists
- topics render
- similar-question panel does not block the page