Files

Zhao 7a09167261 Initial commit: PastPaper Master full stack

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-21 12:27:47 +07:00

16 KiB

Raw Permalink Blame History

KnowIt Technical Defense Q&A

Part 1: AI & Product Technical Questions

Q: How does your AI analyze past papers? What model do you use?

We use a multi-model pipeline with clear separation of concerns:

Vision extraction (Gemini 2.5 Flash): We render each PDF page to a 96 DPI PNG image using PyMuPDF, encode it as base64, and send it to Gemini's vision API via OpenAI-compatible endpoint. The model extracts every question into structured JSON — question number, type, text, options, score, topics, difficulty. We process in batches of 8 pages to stay within token limits.
Solution generation (DeepSeek V3): For each extracted question, we generate three components — a knowledge reminder, a progressive hint, and a complete step-by-step solution. This is done in batches of 3 questions per API call to balance throughput and quality.
Answer matching (Gemini Vision): If an answer PDF is provided, we send its pages to the vision model and match answers to corresponding questions by question number.

The key architectural decision is splitting vision and text tasks across different models — Gemini for anything that needs to "see" the document, DeepSeek for pure text reasoning. This cuts cost by ~60% compared to using a single vision model for everything.

Q: Why vision mode instead of traditional PDF text extraction?

We started with pdfplumber-based text extraction and hit critical failures:

Multi-line code blocks break apart: C = np.array([[[0,1,2,3], gets separated from its closing brackets across lines
Mathematical notation is lost or garbled
Table structures collapse into unreadable strings
Mixed formatting (code + text + formulas on the same page) confuses parsers

Vision mode sends the raw page image — the model sees exactly what a student sees. On our COMP2211 benchmark (Python-heavy, lots of NumPy arrays), vision mode correctly extracts 100% of questions vs ~70% with text extraction.

Q: How accurate is the AI on code questions?

For code output questions (e.g., "What does print(A[2:-2:3]) output?"), we don't rely on the LLM to calculate. We run Python exec() on the actual code:

ns = {"np": np}
exec(extract_code_lines(question_text), ns)  # setup variables
output = exec("print(A[2:-2:3])", ns)         # capture stdout
# output = "[ 7 10]" — ground truth, fed to AI as reference

We maintain a shared namespace per question group so variables defined in a parent question (e.g., A = np.arange(5,15)) are available to all sub-questions. This gives us 100% accuracy on Python output questions.

Q: How does auto-grading work technically?

Three-step pipeline, all non-blocking (run in thread pool via asyncio.to_thread):

Gemini Vision OCR: Student's photo → base64 → Gemini with OCR prompt → extracted text with LaTeX formulas preserved ( $\mu = 8.03$ )

DeepSeek Grading: We inject the question text, reference answer (from DB), and OCR'd student answer into a structured prompt. The model returns JSON:

{
  "is_correct": false,
  "score_given": 2,
  "feedback": "<HTML with KaTeX>Step 1 correct, but in Step 3...",
  "error_at_step": 3
}

Persistence: Result stored in user_attempts table with user_id, question_id, feedback, photo_url. Wrong answers auto-added to error book. Frontend loads historical results on page load via GET /api/attempts/by-paper/{paper_id}.

Grading runs in a thread pool (asyncio.to_thread) so it never blocks the main event loop — other users can browse papers, load questions, etc. while grading runs.

Q: How does the similar question retrieval work?

Multi-signal similarity scoring with caching:

Topic normalization: We maintain an alias map (e.g., "Numpy"/"NumPy" → "NumPy", "Naïve Bayes"/"Naive Bayes Classifier" → "Naive Bayes") — about 80 aliases covering common variations.
Candidate filtering: Pre-filter by analytics_topic in PostgreSQL (cuts candidates from ~250 to ~30 for a given course).
Scoring (up to 100 points):
- Topic overlap: up to 40 pts (exact analytics_topic match = 30, shared topic_tags = 10)
- Question type match: 15 pts
- Difficulty match: 10 pts
- PostgreSQL ts_rank_cd full-text similarity: up to 20 pts
- Same parent question structure: 5 pts
Caching: First computation is stored in similar_questions JSONB column on the question row. Subsequent loads are instant. In-memory cache with 5-minute TTL for hot questions.
Deduplication: Only the best-matching question per paper is shown (avoid showing 5 questions from the same exam).

Q: What's the processing pipeline architecture? How do you handle failures?

Checkpoint-based processing with auto-resume:

The pipeline has 5 stages, each checkpointed to the database:

Stage	What happens	Checkpoint
1. Render	PDF → PNG images (96 DPI)	In memory only
2. Extract	Vision API → structured questions	Progress bar updated per batch
3. Match answers	Answer PDF → question mapping	Optional, failure skipped
4. Save questions	Write all questions to DB	Each question persisted immediately
5. AI trio	Generate solutions per question	Each solution written individually

If the server crashes at stage 5 (say, 15/35 solutions generated), on restart:

lifespan startup hook detects papers with status=processing
Checks paper_questions table — finds 35 questions, 15 with solutions
Calls _resume_ai_trio() which only processes the 20 missing ones
Marks paper as ready when done

The processing runs in a daemon thread with its own event loop (threading.Thread + asyncio.run), completely isolated from the FastAPI server.

Q: What's your RAG pipeline for the AI Tutor?

We use LangChain with a vector database (SurrealDB) to index three content types:

Lecture recordings: Downloaded from Canvas → FFmpeg extracts audio → Whisper transcribes with timestamps → chunked into ~500 token segments with navigation markers
Courseware PDFs/PPTs: Extracted and chunked with metadata (course code, topic, page)
Past paper content: Question text + solutions indexed with topic tags

Retrieval flow: Student query → embedding → top-K vector search → retrieved chunks + question context → LLM generates grounded answer with source citations.

Q: How do you handle API costs?

Cost optimization at every layer:

Strategy	Savings
Model splitting (Gemini vision + DeepSeek text)	~60% vs single model
96 DPI rendering (down from 120)	~26% fewer tokens per page
8-page batches for vision	Fewer API calls
Answer matching failure = skip, not retry forever	Prevents cost runaway
`similar_questions` cached in DB column	One-time compute per question
DeepSeek at $0.28/M input vs Gemini at $0.15/M + vision overhead	Text tasks 2-3x cheaper

Per-paper cost: ~$1-2 USD for full processing (extraction + answer matching + 40 solutions). Per-grading: ~$0.02 (one vision OCR + one text grading call).

Q: What's your tech stack?

Layer	Technology	Why
Frontend	React 18 + Vite + TypeScript	Fast SPA, hot reload
Backend	FastAPI (Python 3.12, async)	Native async, OpenAPI docs
Database	PostgreSQL via Supabase	Relational + Auth + Storage + RLS
Vector DB	SurrealDB	RAG retrieval for AI Tutor
Cache	Redis	Session cache, rate limiting
Vision AI	Gemini 2.5 Flash (Google official API)	Best vision quality, free tier
Text AI	DeepSeek V3 (deepseek-chat)	Cheapest frontier model, no rate limits
PDF Rendering	PyMuPDF (fitz)	Fast, accurate page-to-image
Code Execution	Python exec() with sandboxed namespace	Ground-truth for code output questions
Math Rendering	KaTeX (client-side)	Fast LaTeX rendering, no server round-trip
Transcription	Whisper + FFmpeg	Lecture recording → text
Deployment	Docker + OpenResty + Let's Encrypt	Single server, HTTPS, reverse proxy
Hosting	Tencent Cloud Singapore (2C4G)	Low latency to HK, Gemini API accessible

Q: How do you handle concurrent uploads?

Upload endpoint reads file bytes, creates DB record (status: processing), returns paper ID immediately (~200ms response)
Processing spawns in a daemon thread with its own asyncio event loop — completely isolated from the FastAPI server
Frontend polls GET /api/papers/mine every 4 seconds, shows real-time progress bar ("Reading pages 1-8...", "Generating solutions 12/35 questions")
Multiple papers can process simultaneously (each in its own thread)
Server stays responsive for all other requests during processing

Q: How do you handle JSON parsing issues from LLM responses?

LLMs often return invalid JSON, especially with LaTeX content. We handle three categories:

Markdown code fences: Strip ```json ... ``` wrappers
Control characters: Remove \x00-\x1f except \t\n\r
Invalid escape sequences: LaTeX like \sqrt, \sigma produces invalid JSON escapes. We use a regex that only fixes odd-count backslash sequences before non-escape characters:
```
re.sub(r'(?<!\\)((?:\\\\)*)\\([^"\\/bfnrtu])', r'\1\\\\\2', text)
```
This correctly handles \\sqrt (valid: literal backslash + sqrt) vs \sqrt (invalid: needs fixing) vs \\\sqrt (odd count: needs fixing).

Q: What about data privacy and security?

All data stored in Supabase with Row Level Security (RLS) — users can only access their own attempts, error books, and uploads
Photo uploads stored in Supabase Storage with per-user path isolation (attempts/{user_id}/{question_id}/)
API authentication via Supabase JWT tokens, validated on every request
No student data is sent to AI models beyond the current question context — no cross-user data leakage
Server deployed in Singapore (Tencent Cloud), compliant with HK data regulations

Part 2: Blockchain & KnowIt Coin

Q: Why blockchain? Isn't this just a points system?

No — a points system is centralized and opaque. Blockchain gives us three things a database can't:

Verifiable attribution — When a student uploads a high-quality note or past paper analysis, the contribution is recorded on-chain with a tamper-proof timestamp and creator identity. This isn't just a database entry we control — it's a credential the student owns.
Transparent usage tracking — When that content gets used by other students, referenced by AI, or bundled into a paid study pack, the usage chain is publicly verifiable. No black-box algorithms deciding who gets credit.
Trustless revenue sharing — Smart contracts can automatically distribute earnings to original creators based on on-chain usage records, without requiring users to trust our platform's accounting.

Q: How would the on-chain content attribution actually work technically?

When a student uploads content (notes, flashcards, paper analysis), we:

Generate a content hash (SHA-256 of the file/text)
Record a transaction on-chain containing: creator_address, content_hash, timestamp, course_tags, content_type
Store the actual content off-chain (IPFS or our own storage) — only the hash goes on-chain for cost efficiency
When content is referenced or reused, a new transaction records the citation relationship: derived_from: [original_content_hash]

This creates an immutable provenance graph — you can trace any piece of study material back to its original creator(s).

Q: Which blockchain? Gas fees would kill micro-transactions for student content.

We'd use a Layer 2 solution or a low-cost chain like Polygon, Base, or Arbitrum. Transaction costs are fractions of a cent. For batch efficiency, we can aggregate multiple attribution records into a single on-chain transaction using Merkle trees — record 100 contributions in one tx.

Alternatively, we operate a hybrid model: maintain an off-chain ledger for real-time tracking, then periodically anchor batch proofs on-chain (e.g., daily settlement). This gives us the speed of centralized systems with the verifiability of blockchain.

Q: What's the role of the stablecoin vs KnowIt Coin?

KnowIt Coin = internal contribution token

Earned by: uploading quality content, annotating questions, correcting errors, community moderation
Spent on: premium AI features, accessing curated study packs, unlocking advanced analytics
Non-speculative, platform-governed supply

Stablecoin (e.g., USDC) = settlement layer

Used when real money enters/exits: creator payouts, premium subscriptions, cross-university transactions
Enables frictionless micro-payments: a student in HKUST pays 0.50 HKD for a study guide made by a student at CityU — no bank transfer needed
Smart contract handles revenue split automatically (e.g., 70% to creator, 20% to platform, 10% to content curators)

Q: How do you prevent low-quality content farming for coins?

Multi-layer quality control:

AI quality scoring — content is automatically evaluated for completeness, accuracy, and originality before earning coins
Community validation — peer ratings and usage metrics (how many students actually found it helpful)
Stake-weighted moderation — users with higher reputation (earned through sustained quality contributions) have more influence in content curation
Diminishing returns — bulk uploads of low-quality content yield progressively fewer coins

Q: What's the realistic timeline for blockchain integration?

Phase 1 (Current): Centralized platform with traditional database. Points/achievements tracked internally. This is where we are now.

Phase 2 (6-12 months): Introduce KnowIt Coin as an off-chain token with on-chain anchoring. Content attribution hashes recorded on-chain. Creator dashboard showing contribution history.

Phase 3 (12-24 months): Smart contract-based revenue sharing. Cross-university content marketplace. Stablecoin integration for payouts.

We're building the AI engine and user base first. Blockchain is the trust infrastructure layer that makes the community self-sustaining long-term.

Q: Isn't this over-engineered? Why not just use a database?

For a single university, yes — a database is fine. But our vision is a cross-university knowledge marketplace across Hong Kong and the Greater Bay Area. When content flows between institutions, you need:

Attribution that no single institution controls
Revenue sharing that creators can independently verify
Content provenance that survives platform changes

That's where blockchain becomes necessary, not optional. AI is our learning engine. Blockchain is our trust and incentive layer. Together, they turn one-time knowledge sharing into a sustainable, accumulating knowledge economy.

Part 3: Business & Scalability Questions

Q: How does this scale beyond HKUST?

The system is course-code agnostic — it works on any university's past papers. To expand:

Students at new universities upload their own papers (crowdsourced)
We partner with student unions for bulk paper sourcing
The AI pipeline is fully automated — no manual work per course

The topic normalization and analytics adapt automatically to each course's vocabulary.

Q: What's your competitive advantage vs ChatGPT / NotebookLM?

Feature	KnowIt	ChatGPT	NotebookLM
Localized past paper library	✅	❌	❌
Exam-oriented workflow	✅	❌	❌
Auto-grading with photo upload	✅	❌	❌
Similar question retrieval	✅	❌	❌
Variant question generation	✅	❌	❌
Error book with spaced repetition	✅	❌	❌
Course-specific analytics	✅	❌	❌
Price	Affordable	$20/mo	Free but limited

ChatGPT and NotebookLM are general-purpose tools. KnowIt is a vertical solution built specifically for exam preparation with features they can't replicate without building what we already have.

Q: What if Google or OpenAI builds this?

They build horizontal platforms. We build vertical depth — localized paper libraries, university-specific communities, exam-pattern analytics across semesters. Our data moat grows with every paper uploaded and every student interaction. A general-purpose AI can't replicate 5 years of COMP2211 exam pattern analysis overnight.

16 KiB Raw Permalink Blame History