Files

Zhao 7a09167261 Initial commit: PastPaper Master full stack

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-21 12:27:47 +07:00

12 KiB

Raw Permalink Blame History

COMP2211 Handoff

Current Status

COMP2211 course-library papers are now fully loaded into Supabase and normalized to subquestion-level granularity.

Canonical papers currently in DB:

COMP2211-2022-fall-midterm
COMP2211-2022-spring-midterm
COMP2211-2022-spring-final-part-a
COMP2211-2022-spring-final-part-b
COMP2211-2023-spring-midterm
COMP2211-2024-spring-midterm
COMP2211-2024-spring-final

All seven papers are:

status = ready
split to subquestion level
tagged with analytics_topic, topic_primary, topic_tags, skill_tags

Question counts:

2022 fall midterm: 43
2022 spring midterm: 38
2022 spring final part A: 24
2022 spring final part B: 19
2023 spring midterm: 36
2024 spring midterm: 42
2024 spring final: 48

Key Files

Schema / SQL:

Course-library seeds:

Manual splitters used for final subquestion rebuild:

Deprecated filler script:

fill_manual_study_aids.py

Audit / taxonomy references:

Frontend / backend areas already adapted to real taxonomy:

Important Product / Data Decisions Already Made

Course library vs user upload

This is now separated semantically inside papers:

source_kind = 'course_library' for platform-owned papers
source_kind = 'user_upload' for user-contributed papers

Course-library papers no longer require user_id.

Taxonomy model

question_type is not the main analytics dimension.

Current intended usage:

question_type / question_format: rendering and answer interaction
analytics_topic: normalized analytics bucket
topic_tags: multi-tag topical indexing
skill_tags: finer-grained retrieval / grading / similarity support

Score field

Scores are NUMERIC, not integer, because many subquestions use fractional marks like 1.5.

Known Issues

1. Similar question retrieval is still not truly production-ready

Current state:

backend route exists
frontend panel exists
demo fallback still exists in the UI when retrieval returns empty / fails

What needs to be done:

remove demo fallback behavior once real retrieval is stable
improve ranking beyond current basic topic/type matching
ideally add indexed text retrieval, then embeddings if needed

Recommended order:

build deterministic same-course retrieval first
rank by analytics_topic, topic_tags, skill_tags, question_format, text similarity
only then consider vector search

2. Analytics is real, but still not the final version

Current state:

analytics already reads real DB data
taxonomy fields are being used

Still missing:

better topic normalization for edge cases
per-paper and per-subtopic drill-down
cleaner stats for mixed-format questions
confidence around aggregated counts across all courses, not only COMP2211

3. LaTeX / math rendering is still fragile

Known symptoms:

OCR / extracted math strings are noisy
some generated HTML contains malformed or hard-to-read math fragments
not all backend feedback is rendered with the same quality

What needs work:

normalize math strings before rendering
improve KaTeX preprocessing
avoid dumping broken extracted formulas directly into UI
ensure solution / feedback content is consistently rendered through the same component path

4. Presentation quality is still uneven

Data is now real, but UI still needs polish:

question nav is still too weak for long real papers
status / difficulty / topic chips can be clearer
workbench hierarchy is inconsistent across question types
some pages still read like an internal demo rather than a finished study product

5. User upload flow still lacks dedup / library filtering

This is the next big backend product task.

Desired logic:

when user uploads a paper, compare against existing course-library papers
if it is already covered, do not create a duplicate paper
if it is new, ingest it as user_upload
if high quality and non-duplicate, optionally promote into library workflow later

6. Most non-Spring-2024 study aids are contaminated by template filler content

Current state:

COMP2211-2022-fall-midterm has question-level LLM-authored study aids
COMP2211-2024-spring-midterm is the intended quality bar
the remaining papers were backfilled with a deprecated template script and should not be treated as production-quality AI content

Impact:

knowledge_reminder is often generic topic boilerplate
ai_hint often points to a parent problem header instead of the actual subquestion
solution is often just wrapped reference text, not a true worked solution

Required action:

detect and clear templated study aids from affected papers
regenerate them through the real LLM path in paper_processor.py
review output quality before marking the papers as complete

Next Major Workstreams

A. Real similar-question retrieval

Goal:

no demo fallback
same-course retrieval that feels trustworthy

Suggested implementation:

add a richer retrieval score in questions.py
use:
- same course_code
- same analytics_topic
- overlapping topic_tags
- overlapping skill_tags
- same or compatible question_format
- lexical similarity on question_text
expose match reasons in response if useful
update UI to show why a question was retrieved

Potential DB improvement:

add search_text / tsvector on paper_questions
later optionally add embedding

B. Real paper / topic statistics

Goal:

analytics should be fully trustworthy at subquestion level

Suggested improvements:

topic frequency by analytics_topic
question-format distribution by subquestion, not by top-level problem
per-paper breakdown
high-yield topic trend across years
topic-to-question index page for drill mode

C. LaTeX and content rendering cleanup

Goal:

all math-heavy content should render legibly

Suggested work:

centralize HTML + KaTeX normalization
strip broken OCR artifacts before render
make study-aid content generation avoid malformed formula formatting
ensure grading feedback and solutions share the same renderer pipeline

D. User upload deduplication and library filtering

Goal:

new uploads should not pollute the DB with duplicates

Suggested logic:

normalize upload metadata
compare against existing papers in same course:
- year / term / exam_type / part_label
- title similarity
- extracted first-page markers
- optional text fingerprint
if duplicate:
- attach to existing paper or reject with explanation
if not duplicate:
- create user_upload
- process normally

Likely schema additions later:

content fingerprint field on papers
upload provenance fields
moderation / promotion state for community uploads

E. UI / UX pass

Priority items:

stronger question navigation for real papers
clearer ready / processing / failed states
better paper list and filtering UX
richer workbench metadata:
- topic
- difficulty
- format
- score
- answered / wrong / mastered state
unify visual style across analytics, error book, workbench

Suggested Development Order

Remove similar-question demo fallback and ship real retrieval
Improve analytics and topic drill views using subquestion-level data
Fix LaTeX / rendering quality
Build upload dedup / filtering against existing library papers
Do a focused UI / UX pass after the real data flows are stable

Operational Notes

Frontend entry issue that was fixed

Homepage was previously still using mock papers and an old hardcoded COMP2211 id. It now reads real papers from listPapers().

Manual content generation

The current COMP2211 three-piece study aids were filled manually through local scripts and deterministic templates, not through external LLM batch processing. This is deliberate and keeps the current dataset stable.

If rebuilding papers again

For COMP2211, use the manual splitters rather than rerunning generic extraction blindly. 2024-spring-midterm especially required reconstruction from PDF page spans because the earlier top-level extraction had already truncated Problem 5 and Problem 7.

Ready-to-Verify Checklist

If you want to sanity-check the current product quickly:

Open home page and filter COMP2211
Open each paper and confirm status = ready
Check question count matches:
- 43 / 38 / 24 / 19 / 36 / 42 / 48
Open analytics page for COMP2211
Open several papers and verify:
- question nav loads
- AI trio exists
- topics render
- similar-question panel does not block the page

12 KiB Raw Permalink Blame History

COMP2211 Handoff

Current Status

Key Files

Important Product / Data Decisions Already Made

Course library vs user upload

Taxonomy model

Score field

Known Issues

1. Similar question retrieval is still not truly production-ready

2. Analytics is real, but still not the final version

3. LaTeX / math rendering is still fragile

4. Presentation quality is still uneven

5. User upload flow still lacks dedup / library filtering

6. Most non-Spring-2024 study aids are contaminated by template filler content

Next Major Workstreams

A. Real similar-question retrieval

B. Real paper / topic statistics

C. LaTeX and content rendering cleanup

D. User upload deduplication and library filtering

E. UI / UX pass

Suggested Development Order

Operational Notes

Frontend entry issue that was fixed

Manual content generation

If rebuilding papers again

Ready-to-Verify Checklist

12 KiB

Raw Permalink Blame History