Files
PastpaperMaster/tech_defense.md
Zhao 7a09167261 Initial commit: PastPaper Master full stack
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 12:27:47 +07:00

275 lines
16 KiB
Markdown

# KnowIt Technical Defense Q&A
## Part 1: AI & Product Technical Questions
### Q: How does your AI analyze past papers? What model do you use?
We use a multi-model pipeline with clear separation of concerns:
1. **Vision extraction** (Gemini 2.5 Flash): We render each PDF page to a 96 DPI PNG image using PyMuPDF, encode it as base64, and send it to Gemini's vision API via OpenAI-compatible endpoint. The model extracts every question into structured JSON — question number, type, text, options, score, topics, difficulty. We process in batches of 8 pages to stay within token limits.
2. **Solution generation** (DeepSeek V3): For each extracted question, we generate three components — a knowledge reminder, a progressive hint, and a complete step-by-step solution. This is done in batches of 3 questions per API call to balance throughput and quality.
3. **Answer matching** (Gemini Vision): If an answer PDF is provided, we send its pages to the vision model and match answers to corresponding questions by question number.
The key architectural decision is **splitting vision and text tasks across different models** — Gemini for anything that needs to "see" the document, DeepSeek for pure text reasoning. This cuts cost by ~60% compared to using a single vision model for everything.
### Q: Why vision mode instead of traditional PDF text extraction?
We started with pdfplumber-based text extraction and hit critical failures:
- **Multi-line code blocks** break apart: `C = np.array([[[0,1,2,3],` gets separated from its closing brackets across lines
- **Mathematical notation** is lost or garbled
- **Table structures** collapse into unreadable strings
- **Mixed formatting** (code + text + formulas on the same page) confuses parsers
Vision mode sends the raw page image — the model sees exactly what a student sees. On our COMP2211 benchmark (Python-heavy, lots of NumPy arrays), vision mode correctly extracts 100% of questions vs ~70% with text extraction.
### Q: How accurate is the AI on code questions?
For code output questions (e.g., "What does `print(A[2:-2:3])` output?"), we don't rely on the LLM to calculate. We run **Python exec()** on the actual code:
```python
ns = {"np": np}
exec(extract_code_lines(question_text), ns) # setup variables
output = exec("print(A[2:-2:3])", ns) # capture stdout
# output = "[ 7 10]" — ground truth, fed to AI as reference
```
We maintain a **shared namespace per question group** so variables defined in a parent question (e.g., `A = np.arange(5,15)`) are available to all sub-questions. This gives us 100% accuracy on Python output questions.
### Q: How does auto-grading work technically?
Three-step pipeline, all non-blocking (run in thread pool via `asyncio.to_thread`):
1. **Gemini Vision OCR**: Student's photo → base64 → Gemini with OCR prompt → extracted text with LaTeX formulas preserved (`$\mu = 8.03$`)
2. **DeepSeek Grading**: We inject the question text, reference answer (from DB), and OCR'd student answer into a structured prompt. The model returns JSON:
```json
{
"is_correct": false,
"score_given": 2,
"feedback": "<HTML with KaTeX>Step 1 correct, but in Step 3...",
"error_at_step": 3
}
```
3. **Persistence**: Result stored in `user_attempts` table with `user_id`, `question_id`, `feedback`, `photo_url`. Wrong answers auto-added to error book. Frontend loads historical results on page load via `GET /api/attempts/by-paper/{paper_id}`.
Grading runs in a thread pool (`asyncio.to_thread`) so it never blocks the main event loop — other users can browse papers, load questions, etc. while grading runs.
### Q: How does the similar question retrieval work?
Multi-signal similarity scoring with caching:
1. **Topic normalization**: We maintain an alias map (e.g., "Numpy"/"NumPy" → "NumPy", "Naïve Bayes"/"Naive Bayes Classifier" → "Naive Bayes") — about 80 aliases covering common variations.
2. **Candidate filtering**: Pre-filter by `analytics_topic` in PostgreSQL (cuts candidates from ~250 to ~30 for a given course).
3. **Scoring** (up to 100 points):
- Topic overlap: up to 40 pts (exact analytics_topic match = 30, shared topic_tags = 10)
- Question type match: 15 pts
- Difficulty match: 10 pts
- PostgreSQL `ts_rank_cd` full-text similarity: up to 20 pts
- Same parent question structure: 5 pts
4. **Caching**: First computation is stored in `similar_questions` JSONB column on the question row. Subsequent loads are instant. In-memory cache with 5-minute TTL for hot questions.
5. **Deduplication**: Only the best-matching question per paper is shown (avoid showing 5 questions from the same exam).
### Q: What's the processing pipeline architecture? How do you handle failures?
**Checkpoint-based processing with auto-resume:**
The pipeline has 5 stages, each checkpointed to the database:
| Stage | What happens | Checkpoint |
|-------|-------------|-----------|
| 1. Render | PDF → PNG images (96 DPI) | In memory only |
| 2. Extract | Vision API → structured questions | Progress bar updated per batch |
| 3. Match answers | Answer PDF → question mapping | Optional, failure skipped |
| 4. Save questions | Write all questions to DB | **Each question persisted immediately** |
| 5. AI trio | Generate solutions per question | **Each solution written individually** |
If the server crashes at stage 5 (say, 15/35 solutions generated), on restart:
- `lifespan` startup hook detects papers with `status=processing`
- Checks `paper_questions` table — finds 35 questions, 15 with solutions
- Calls `_resume_ai_trio()` which only processes the 20 missing ones
- Marks paper as `ready` when done
The processing runs in a **daemon thread** with its own event loop (`threading.Thread` + `asyncio.run`), completely isolated from the FastAPI server.
### Q: What's your RAG pipeline for the AI Tutor?
We use **LangChain** with a vector database (**SurrealDB**) to index three content types:
1. **Lecture recordings**: Downloaded from Canvas → FFmpeg extracts audio → Whisper transcribes with timestamps → chunked into ~500 token segments with navigation markers
2. **Courseware PDFs/PPTs**: Extracted and chunked with metadata (course code, topic, page)
3. **Past paper content**: Question text + solutions indexed with topic tags
Retrieval flow: Student query → embedding → top-K vector search → retrieved chunks + question context → LLM generates grounded answer with source citations.
### Q: How do you handle API costs?
**Cost optimization at every layer:**
| Strategy | Savings |
|----------|---------|
| Model splitting (Gemini vision + DeepSeek text) | ~60% vs single model |
| 96 DPI rendering (down from 120) | ~26% fewer tokens per page |
| 8-page batches for vision | Fewer API calls |
| Answer matching failure = skip, not retry forever | Prevents cost runaway |
| `similar_questions` cached in DB column | One-time compute per question |
| DeepSeek at $0.28/M input vs Gemini at $0.15/M + vision overhead | Text tasks 2-3x cheaper |
Per-paper cost: **~$1-2 USD** for full processing (extraction + answer matching + 40 solutions).
Per-grading: **~$0.02** (one vision OCR + one text grading call).
### Q: What's your tech stack?
| Layer | Technology | Why |
|-------|-----------|-----|
| Frontend | React 18 + Vite + TypeScript | Fast SPA, hot reload |
| Backend | FastAPI (Python 3.12, async) | Native async, OpenAPI docs |
| Database | PostgreSQL via Supabase | Relational + Auth + Storage + RLS |
| Vector DB | SurrealDB | RAG retrieval for AI Tutor |
| Cache | Redis | Session cache, rate limiting |
| Vision AI | Gemini 2.5 Flash (Google official API) | Best vision quality, free tier |
| Text AI | DeepSeek V3 (deepseek-chat) | Cheapest frontier model, no rate limits |
| PDF Rendering | PyMuPDF (fitz) | Fast, accurate page-to-image |
| Code Execution | Python exec() with sandboxed namespace | Ground-truth for code output questions |
| Math Rendering | KaTeX (client-side) | Fast LaTeX rendering, no server round-trip |
| Transcription | Whisper + FFmpeg | Lecture recording → text |
| Deployment | Docker + OpenResty + Let's Encrypt | Single server, HTTPS, reverse proxy |
| Hosting | Tencent Cloud Singapore (2C4G) | Low latency to HK, Gemini API accessible |
### Q: How do you handle concurrent uploads?
1. Upload endpoint reads file bytes, creates DB record (`status: processing`), returns paper ID immediately (~200ms response)
2. Processing spawns in a **daemon thread** with its own asyncio event loop — completely isolated from the FastAPI server
3. Frontend polls `GET /api/papers/mine` every 4 seconds, shows real-time progress bar ("Reading pages 1-8...", "Generating solutions 12/35 questions")
4. Multiple papers can process simultaneously (each in its own thread)
5. Server stays responsive for all other requests during processing
### Q: How do you handle JSON parsing issues from LLM responses?
LLMs often return invalid JSON, especially with LaTeX content. We handle three categories:
1. **Markdown code fences**: Strip ` ```json ... ``` ` wrappers
2. **Control characters**: Remove `\x00-\x1f` except `\t\n\r`
3. **Invalid escape sequences**: LaTeX like `\sqrt`, `\sigma` produces invalid JSON escapes. We use a regex that only fixes **odd-count backslash sequences** before non-escape characters:
```python
re.sub(r'(?<!\\)((?:\\\\)*)\\([^"\\/bfnrtu])', r'\1\\\\\2', text)
```
This correctly handles `\\sqrt` (valid: literal backslash + sqrt) vs `\sqrt` (invalid: needs fixing) vs `\\\sqrt` (odd count: needs fixing).
### Q: What about data privacy and security?
- All data stored in **Supabase with Row Level Security (RLS)** — users can only access their own attempts, error books, and uploads
- Photo uploads stored in Supabase Storage with per-user path isolation (`attempts/{user_id}/{question_id}/`)
- API authentication via Supabase JWT tokens, validated on every request
- No student data is sent to AI models beyond the current question context — no cross-user data leakage
- Server deployed in Singapore (Tencent Cloud), compliant with HK data regulations
---
## Part 2: Blockchain & KnowIt Coin
### Q: Why blockchain? Isn't this just a points system?
No — a points system is centralized and opaque. Blockchain gives us three things a database can't:
1. **Verifiable attribution** — When a student uploads a high-quality note or past paper analysis, the contribution is recorded on-chain with a tamper-proof timestamp and creator identity. This isn't just a database entry we control — it's a credential the student owns.
2. **Transparent usage tracking** — When that content gets used by other students, referenced by AI, or bundled into a paid study pack, the usage chain is publicly verifiable. No black-box algorithms deciding who gets credit.
3. **Trustless revenue sharing** — Smart contracts can automatically distribute earnings to original creators based on on-chain usage records, without requiring users to trust our platform's accounting.
### Q: How would the on-chain content attribution actually work technically?
When a student uploads content (notes, flashcards, paper analysis), we:
1. Generate a **content hash** (SHA-256 of the file/text)
2. Record a transaction on-chain containing: `creator_address`, `content_hash`, `timestamp`, `course_tags`, `content_type`
3. Store the actual content off-chain (IPFS or our own storage) — only the hash goes on-chain for cost efficiency
4. When content is referenced or reused, a new transaction records the **citation relationship**: `derived_from: [original_content_hash]`
This creates an immutable provenance graph — you can trace any piece of study material back to its original creator(s).
### Q: Which blockchain? Gas fees would kill micro-transactions for student content.
We'd use a **Layer 2 solution** or a low-cost chain like **Polygon**, **Base**, or **Arbitrum**. Transaction costs are fractions of a cent. For batch efficiency, we can aggregate multiple attribution records into a single on-chain transaction using **Merkle trees** — record 100 contributions in one tx.
Alternatively, we operate a **hybrid model**: maintain an off-chain ledger for real-time tracking, then periodically anchor batch proofs on-chain (e.g., daily settlement). This gives us the speed of centralized systems with the verifiability of blockchain.
### Q: What's the role of the stablecoin vs KnowIt Coin?
**KnowIt Coin** = internal contribution token
- Earned by: uploading quality content, annotating questions, correcting errors, community moderation
- Spent on: premium AI features, accessing curated study packs, unlocking advanced analytics
- Non-speculative, platform-governed supply
**Stablecoin** (e.g., USDC) = settlement layer
- Used when real money enters/exits: creator payouts, premium subscriptions, cross-university transactions
- Enables frictionless micro-payments: a student in HKUST pays 0.50 HKD for a study guide made by a student at CityU — no bank transfer needed
- Smart contract handles revenue split automatically (e.g., 70% to creator, 20% to platform, 10% to content curators)
### Q: How do you prevent low-quality content farming for coins?
Multi-layer quality control:
1. **AI quality scoring** — content is automatically evaluated for completeness, accuracy, and originality before earning coins
2. **Community validation** — peer ratings and usage metrics (how many students actually found it helpful)
3. **Stake-weighted moderation** — users with higher reputation (earned through sustained quality contributions) have more influence in content curation
4. **Diminishing returns** — bulk uploads of low-quality content yield progressively fewer coins
### Q: What's the realistic timeline for blockchain integration?
**Phase 1 (Current)**: Centralized platform with traditional database. Points/achievements tracked internally. This is where we are now.
**Phase 2 (6-12 months)**: Introduce KnowIt Coin as an off-chain token with on-chain anchoring. Content attribution hashes recorded on-chain. Creator dashboard showing contribution history.
**Phase 3 (12-24 months)**: Smart contract-based revenue sharing. Cross-university content marketplace. Stablecoin integration for payouts.
We're building the AI engine and user base first. Blockchain is the trust infrastructure layer that makes the community self-sustaining long-term.
### Q: Isn't this over-engineered? Why not just use a database?
For a single university, yes — a database is fine. But our vision is a **cross-university knowledge marketplace** across Hong Kong and the Greater Bay Area. When content flows between institutions, you need:
- Attribution that no single institution controls
- Revenue sharing that creators can independently verify
- Content provenance that survives platform changes
That's where blockchain becomes necessary, not optional. AI is our learning engine. Blockchain is our trust and incentive layer. Together, they turn one-time knowledge sharing into a sustainable, accumulating knowledge economy.
---
## Part 3: Business & Scalability Questions
### Q: How does this scale beyond HKUST?
The system is **course-code agnostic** — it works on any university's past papers. To expand:
1. Students at new universities upload their own papers (crowdsourced)
2. We partner with student unions for bulk paper sourcing
3. The AI pipeline is fully automated — no manual work per course
The topic normalization and analytics adapt automatically to each course's vocabulary.
### Q: What's your competitive advantage vs ChatGPT / NotebookLM?
| Feature | KnowIt | ChatGPT | NotebookLM |
|---------|--------|---------|------------|
| Localized past paper library | ✅ | ❌ | ❌ |
| Exam-oriented workflow | ✅ | ❌ | ❌ |
| Auto-grading with photo upload | ✅ | ❌ | ❌ |
| Similar question retrieval | ✅ | ❌ | ❌ |
| Variant question generation | ✅ | ❌ | ❌ |
| Error book with spaced repetition | ✅ | ❌ | ❌ |
| Course-specific analytics | ✅ | ❌ | ❌ |
| Price | Affordable | $20/mo | Free but limited |
ChatGPT and NotebookLM are general-purpose tools. KnowIt is a **vertical solution** built specifically for exam preparation with features they can't replicate without building what we already have.
### Q: What if Google or OpenAI builds this?
They build horizontal platforms. We build **vertical depth** — localized paper libraries, university-specific communities, exam-pattern analytics across semesters. Our data moat grows with every paper uploaded and every student interaction. A general-purpose AI can't replicate 5 years of COMP2211 exam pattern analysis overnight.