Initial commit: PastPaper Master full stack
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
274
tech_defense.md
Normal file
274
tech_defense.md
Normal file
@@ -0,0 +1,274 @@
|
||||
# KnowIt Technical Defense Q&A
|
||||
|
||||
## Part 1: AI & Product Technical Questions
|
||||
|
||||
### Q: How does your AI analyze past papers? What model do you use?
|
||||
|
||||
We use a multi-model pipeline with clear separation of concerns:
|
||||
|
||||
1. **Vision extraction** (Gemini 2.5 Flash): We render each PDF page to a 96 DPI PNG image using PyMuPDF, encode it as base64, and send it to Gemini's vision API via OpenAI-compatible endpoint. The model extracts every question into structured JSON — question number, type, text, options, score, topics, difficulty. We process in batches of 8 pages to stay within token limits.
|
||||
|
||||
2. **Solution generation** (DeepSeek V3): For each extracted question, we generate three components — a knowledge reminder, a progressive hint, and a complete step-by-step solution. This is done in batches of 3 questions per API call to balance throughput and quality.
|
||||
|
||||
3. **Answer matching** (Gemini Vision): If an answer PDF is provided, we send its pages to the vision model and match answers to corresponding questions by question number.
|
||||
|
||||
The key architectural decision is **splitting vision and text tasks across different models** — Gemini for anything that needs to "see" the document, DeepSeek for pure text reasoning. This cuts cost by ~60% compared to using a single vision model for everything.
|
||||
|
||||
### Q: Why vision mode instead of traditional PDF text extraction?
|
||||
|
||||
We started with pdfplumber-based text extraction and hit critical failures:
|
||||
|
||||
- **Multi-line code blocks** break apart: `C = np.array([[[0,1,2,3],` gets separated from its closing brackets across lines
|
||||
- **Mathematical notation** is lost or garbled
|
||||
- **Table structures** collapse into unreadable strings
|
||||
- **Mixed formatting** (code + text + formulas on the same page) confuses parsers
|
||||
|
||||
Vision mode sends the raw page image — the model sees exactly what a student sees. On our COMP2211 benchmark (Python-heavy, lots of NumPy arrays), vision mode correctly extracts 100% of questions vs ~70% with text extraction.
|
||||
|
||||
### Q: How accurate is the AI on code questions?
|
||||
|
||||
For code output questions (e.g., "What does `print(A[2:-2:3])` output?"), we don't rely on the LLM to calculate. We run **Python exec()** on the actual code:
|
||||
|
||||
```python
|
||||
ns = {"np": np}
|
||||
exec(extract_code_lines(question_text), ns) # setup variables
|
||||
output = exec("print(A[2:-2:3])", ns) # capture stdout
|
||||
# output = "[ 7 10]" — ground truth, fed to AI as reference
|
||||
```
|
||||
|
||||
We maintain a **shared namespace per question group** so variables defined in a parent question (e.g., `A = np.arange(5,15)`) are available to all sub-questions. This gives us 100% accuracy on Python output questions.
|
||||
|
||||
### Q: How does auto-grading work technically?
|
||||
|
||||
Three-step pipeline, all non-blocking (run in thread pool via `asyncio.to_thread`):
|
||||
|
||||
1. **Gemini Vision OCR**: Student's photo → base64 → Gemini with OCR prompt → extracted text with LaTeX formulas preserved (`$\mu = 8.03$`)
|
||||
|
||||
2. **DeepSeek Grading**: We inject the question text, reference answer (from DB), and OCR'd student answer into a structured prompt. The model returns JSON:
|
||||
```json
|
||||
{
|
||||
"is_correct": false,
|
||||
"score_given": 2,
|
||||
"feedback": "<HTML with KaTeX>Step 1 correct, but in Step 3...",
|
||||
"error_at_step": 3
|
||||
}
|
||||
```
|
||||
|
||||
3. **Persistence**: Result stored in `user_attempts` table with `user_id`, `question_id`, `feedback`, `photo_url`. Wrong answers auto-added to error book. Frontend loads historical results on page load via `GET /api/attempts/by-paper/{paper_id}`.
|
||||
|
||||
Grading runs in a thread pool (`asyncio.to_thread`) so it never blocks the main event loop — other users can browse papers, load questions, etc. while grading runs.
|
||||
|
||||
### Q: How does the similar question retrieval work?
|
||||
|
||||
Multi-signal similarity scoring with caching:
|
||||
|
||||
1. **Topic normalization**: We maintain an alias map (e.g., "Numpy"/"NumPy" → "NumPy", "Naïve Bayes"/"Naive Bayes Classifier" → "Naive Bayes") — about 80 aliases covering common variations.
|
||||
|
||||
2. **Candidate filtering**: Pre-filter by `analytics_topic` in PostgreSQL (cuts candidates from ~250 to ~30 for a given course).
|
||||
|
||||
3. **Scoring** (up to 100 points):
|
||||
- Topic overlap: up to 40 pts (exact analytics_topic match = 30, shared topic_tags = 10)
|
||||
- Question type match: 15 pts
|
||||
- Difficulty match: 10 pts
|
||||
- PostgreSQL `ts_rank_cd` full-text similarity: up to 20 pts
|
||||
- Same parent question structure: 5 pts
|
||||
|
||||
4. **Caching**: First computation is stored in `similar_questions` JSONB column on the question row. Subsequent loads are instant. In-memory cache with 5-minute TTL for hot questions.
|
||||
|
||||
5. **Deduplication**: Only the best-matching question per paper is shown (avoid showing 5 questions from the same exam).
|
||||
|
||||
### Q: What's the processing pipeline architecture? How do you handle failures?
|
||||
|
||||
**Checkpoint-based processing with auto-resume:**
|
||||
|
||||
The pipeline has 5 stages, each checkpointed to the database:
|
||||
|
||||
| Stage | What happens | Checkpoint |
|
||||
|-------|-------------|-----------|
|
||||
| 1. Render | PDF → PNG images (96 DPI) | In memory only |
|
||||
| 2. Extract | Vision API → structured questions | Progress bar updated per batch |
|
||||
| 3. Match answers | Answer PDF → question mapping | Optional, failure skipped |
|
||||
| 4. Save questions | Write all questions to DB | **Each question persisted immediately** |
|
||||
| 5. AI trio | Generate solutions per question | **Each solution written individually** |
|
||||
|
||||
If the server crashes at stage 5 (say, 15/35 solutions generated), on restart:
|
||||
- `lifespan` startup hook detects papers with `status=processing`
|
||||
- Checks `paper_questions` table — finds 35 questions, 15 with solutions
|
||||
- Calls `_resume_ai_trio()` which only processes the 20 missing ones
|
||||
- Marks paper as `ready` when done
|
||||
|
||||
The processing runs in a **daemon thread** with its own event loop (`threading.Thread` + `asyncio.run`), completely isolated from the FastAPI server.
|
||||
|
||||
### Q: What's your RAG pipeline for the AI Tutor?
|
||||
|
||||
We use **LangChain** with a vector database (**SurrealDB**) to index three content types:
|
||||
|
||||
1. **Lecture recordings**: Downloaded from Canvas → FFmpeg extracts audio → Whisper transcribes with timestamps → chunked into ~500 token segments with navigation markers
|
||||
2. **Courseware PDFs/PPTs**: Extracted and chunked with metadata (course code, topic, page)
|
||||
3. **Past paper content**: Question text + solutions indexed with topic tags
|
||||
|
||||
Retrieval flow: Student query → embedding → top-K vector search → retrieved chunks + question context → LLM generates grounded answer with source citations.
|
||||
|
||||
### Q: How do you handle API costs?
|
||||
|
||||
**Cost optimization at every layer:**
|
||||
|
||||
| Strategy | Savings |
|
||||
|----------|---------|
|
||||
| Model splitting (Gemini vision + DeepSeek text) | ~60% vs single model |
|
||||
| 96 DPI rendering (down from 120) | ~26% fewer tokens per page |
|
||||
| 8-page batches for vision | Fewer API calls |
|
||||
| Answer matching failure = skip, not retry forever | Prevents cost runaway |
|
||||
| `similar_questions` cached in DB column | One-time compute per question |
|
||||
| DeepSeek at $0.28/M input vs Gemini at $0.15/M + vision overhead | Text tasks 2-3x cheaper |
|
||||
|
||||
Per-paper cost: **~$1-2 USD** for full processing (extraction + answer matching + 40 solutions).
|
||||
Per-grading: **~$0.02** (one vision OCR + one text grading call).
|
||||
|
||||
### Q: What's your tech stack?
|
||||
|
||||
| Layer | Technology | Why |
|
||||
|-------|-----------|-----|
|
||||
| Frontend | React 18 + Vite + TypeScript | Fast SPA, hot reload |
|
||||
| Backend | FastAPI (Python 3.12, async) | Native async, OpenAPI docs |
|
||||
| Database | PostgreSQL via Supabase | Relational + Auth + Storage + RLS |
|
||||
| Vector DB | SurrealDB | RAG retrieval for AI Tutor |
|
||||
| Cache | Redis | Session cache, rate limiting |
|
||||
| Vision AI | Gemini 2.5 Flash (Google official API) | Best vision quality, free tier |
|
||||
| Text AI | DeepSeek V3 (deepseek-chat) | Cheapest frontier model, no rate limits |
|
||||
| PDF Rendering | PyMuPDF (fitz) | Fast, accurate page-to-image |
|
||||
| Code Execution | Python exec() with sandboxed namespace | Ground-truth for code output questions |
|
||||
| Math Rendering | KaTeX (client-side) | Fast LaTeX rendering, no server round-trip |
|
||||
| Transcription | Whisper + FFmpeg | Lecture recording → text |
|
||||
| Deployment | Docker + OpenResty + Let's Encrypt | Single server, HTTPS, reverse proxy |
|
||||
| Hosting | Tencent Cloud Singapore (2C4G) | Low latency to HK, Gemini API accessible |
|
||||
|
||||
### Q: How do you handle concurrent uploads?
|
||||
|
||||
1. Upload endpoint reads file bytes, creates DB record (`status: processing`), returns paper ID immediately (~200ms response)
|
||||
2. Processing spawns in a **daemon thread** with its own asyncio event loop — completely isolated from the FastAPI server
|
||||
3. Frontend polls `GET /api/papers/mine` every 4 seconds, shows real-time progress bar ("Reading pages 1-8...", "Generating solutions 12/35 questions")
|
||||
4. Multiple papers can process simultaneously (each in its own thread)
|
||||
5. Server stays responsive for all other requests during processing
|
||||
|
||||
### Q: How do you handle JSON parsing issues from LLM responses?
|
||||
|
||||
LLMs often return invalid JSON, especially with LaTeX content. We handle three categories:
|
||||
|
||||
1. **Markdown code fences**: Strip ` ```json ... ``` ` wrappers
|
||||
2. **Control characters**: Remove `\x00-\x1f` except `\t\n\r`
|
||||
3. **Invalid escape sequences**: LaTeX like `\sqrt`, `\sigma` produces invalid JSON escapes. We use a regex that only fixes **odd-count backslash sequences** before non-escape characters:
|
||||
```python
|
||||
re.sub(r'(?<!\\)((?:\\\\)*)\\([^"\\/bfnrtu])', r'\1\\\\\2', text)
|
||||
```
|
||||
This correctly handles `\\sqrt` (valid: literal backslash + sqrt) vs `\sqrt` (invalid: needs fixing) vs `\\\sqrt` (odd count: needs fixing).
|
||||
|
||||
### Q: What about data privacy and security?
|
||||
|
||||
- All data stored in **Supabase with Row Level Security (RLS)** — users can only access their own attempts, error books, and uploads
|
||||
- Photo uploads stored in Supabase Storage with per-user path isolation (`attempts/{user_id}/{question_id}/`)
|
||||
- API authentication via Supabase JWT tokens, validated on every request
|
||||
- No student data is sent to AI models beyond the current question context — no cross-user data leakage
|
||||
- Server deployed in Singapore (Tencent Cloud), compliant with HK data regulations
|
||||
|
||||
---
|
||||
|
||||
## Part 2: Blockchain & KnowIt Coin
|
||||
|
||||
### Q: Why blockchain? Isn't this just a points system?
|
||||
|
||||
No — a points system is centralized and opaque. Blockchain gives us three things a database can't:
|
||||
|
||||
1. **Verifiable attribution** — When a student uploads a high-quality note or past paper analysis, the contribution is recorded on-chain with a tamper-proof timestamp and creator identity. This isn't just a database entry we control — it's a credential the student owns.
|
||||
|
||||
2. **Transparent usage tracking** — When that content gets used by other students, referenced by AI, or bundled into a paid study pack, the usage chain is publicly verifiable. No black-box algorithms deciding who gets credit.
|
||||
|
||||
3. **Trustless revenue sharing** — Smart contracts can automatically distribute earnings to original creators based on on-chain usage records, without requiring users to trust our platform's accounting.
|
||||
|
||||
### Q: How would the on-chain content attribution actually work technically?
|
||||
|
||||
When a student uploads content (notes, flashcards, paper analysis), we:
|
||||
|
||||
1. Generate a **content hash** (SHA-256 of the file/text)
|
||||
2. Record a transaction on-chain containing: `creator_address`, `content_hash`, `timestamp`, `course_tags`, `content_type`
|
||||
3. Store the actual content off-chain (IPFS or our own storage) — only the hash goes on-chain for cost efficiency
|
||||
4. When content is referenced or reused, a new transaction records the **citation relationship**: `derived_from: [original_content_hash]`
|
||||
|
||||
This creates an immutable provenance graph — you can trace any piece of study material back to its original creator(s).
|
||||
|
||||
### Q: Which blockchain? Gas fees would kill micro-transactions for student content.
|
||||
|
||||
We'd use a **Layer 2 solution** or a low-cost chain like **Polygon**, **Base**, or **Arbitrum**. Transaction costs are fractions of a cent. For batch efficiency, we can aggregate multiple attribution records into a single on-chain transaction using **Merkle trees** — record 100 contributions in one tx.
|
||||
|
||||
Alternatively, we operate a **hybrid model**: maintain an off-chain ledger for real-time tracking, then periodically anchor batch proofs on-chain (e.g., daily settlement). This gives us the speed of centralized systems with the verifiability of blockchain.
|
||||
|
||||
### Q: What's the role of the stablecoin vs KnowIt Coin?
|
||||
|
||||
**KnowIt Coin** = internal contribution token
|
||||
- Earned by: uploading quality content, annotating questions, correcting errors, community moderation
|
||||
- Spent on: premium AI features, accessing curated study packs, unlocking advanced analytics
|
||||
- Non-speculative, platform-governed supply
|
||||
|
||||
**Stablecoin** (e.g., USDC) = settlement layer
|
||||
- Used when real money enters/exits: creator payouts, premium subscriptions, cross-university transactions
|
||||
- Enables frictionless micro-payments: a student in HKUST pays 0.50 HKD for a study guide made by a student at CityU — no bank transfer needed
|
||||
- Smart contract handles revenue split automatically (e.g., 70% to creator, 20% to platform, 10% to content curators)
|
||||
|
||||
### Q: How do you prevent low-quality content farming for coins?
|
||||
|
||||
Multi-layer quality control:
|
||||
1. **AI quality scoring** — content is automatically evaluated for completeness, accuracy, and originality before earning coins
|
||||
2. **Community validation** — peer ratings and usage metrics (how many students actually found it helpful)
|
||||
3. **Stake-weighted moderation** — users with higher reputation (earned through sustained quality contributions) have more influence in content curation
|
||||
4. **Diminishing returns** — bulk uploads of low-quality content yield progressively fewer coins
|
||||
|
||||
### Q: What's the realistic timeline for blockchain integration?
|
||||
|
||||
**Phase 1 (Current)**: Centralized platform with traditional database. Points/achievements tracked internally. This is where we are now.
|
||||
|
||||
**Phase 2 (6-12 months)**: Introduce KnowIt Coin as an off-chain token with on-chain anchoring. Content attribution hashes recorded on-chain. Creator dashboard showing contribution history.
|
||||
|
||||
**Phase 3 (12-24 months)**: Smart contract-based revenue sharing. Cross-university content marketplace. Stablecoin integration for payouts.
|
||||
|
||||
We're building the AI engine and user base first. Blockchain is the trust infrastructure layer that makes the community self-sustaining long-term.
|
||||
|
||||
### Q: Isn't this over-engineered? Why not just use a database?
|
||||
|
||||
For a single university, yes — a database is fine. But our vision is a **cross-university knowledge marketplace** across Hong Kong and the Greater Bay Area. When content flows between institutions, you need:
|
||||
- Attribution that no single institution controls
|
||||
- Revenue sharing that creators can independently verify
|
||||
- Content provenance that survives platform changes
|
||||
|
||||
That's where blockchain becomes necessary, not optional. AI is our learning engine. Blockchain is our trust and incentive layer. Together, they turn one-time knowledge sharing into a sustainable, accumulating knowledge economy.
|
||||
|
||||
---
|
||||
|
||||
## Part 3: Business & Scalability Questions
|
||||
|
||||
### Q: How does this scale beyond HKUST?
|
||||
|
||||
The system is **course-code agnostic** — it works on any university's past papers. To expand:
|
||||
1. Students at new universities upload their own papers (crowdsourced)
|
||||
2. We partner with student unions for bulk paper sourcing
|
||||
3. The AI pipeline is fully automated — no manual work per course
|
||||
|
||||
The topic normalization and analytics adapt automatically to each course's vocabulary.
|
||||
|
||||
### Q: What's your competitive advantage vs ChatGPT / NotebookLM?
|
||||
|
||||
| Feature | KnowIt | ChatGPT | NotebookLM |
|
||||
|---------|--------|---------|------------|
|
||||
| Localized past paper library | ✅ | ❌ | ❌ |
|
||||
| Exam-oriented workflow | ✅ | ❌ | ❌ |
|
||||
| Auto-grading with photo upload | ✅ | ❌ | ❌ |
|
||||
| Similar question retrieval | ✅ | ❌ | ❌ |
|
||||
| Variant question generation | ✅ | ❌ | ❌ |
|
||||
| Error book with spaced repetition | ✅ | ❌ | ❌ |
|
||||
| Course-specific analytics | ✅ | ❌ | ❌ |
|
||||
| Price | Affordable | $20/mo | Free but limited |
|
||||
|
||||
ChatGPT and NotebookLM are general-purpose tools. KnowIt is a **vertical solution** built specifically for exam preparation with features they can't replicate without building what we already have.
|
||||
|
||||
### Q: What if Google or OpenAI builds this?
|
||||
|
||||
They build horizontal platforms. We build **vertical depth** — localized paper libraries, university-specific communities, exam-pattern analytics across semesters. Our data moat grows with every paper uploaded and every student interaction. A general-purpose AI can't replicate 5 years of COMP2211 exam pattern analysis overnight.
|
||||
Reference in New Issue
Block a user