16 KiB
KnowIt Technical Defense Q&A
Part 1: AI & Product Technical Questions
Q: How does your AI analyze past papers? What model do you use?
We use a multi-model pipeline with clear separation of concerns:
-
Vision extraction (Gemini 2.5 Flash): We render each PDF page to a 96 DPI PNG image using PyMuPDF, encode it as base64, and send it to Gemini's vision API via OpenAI-compatible endpoint. The model extracts every question into structured JSON — question number, type, text, options, score, topics, difficulty. We process in batches of 8 pages to stay within token limits.
-
Solution generation (DeepSeek V3): For each extracted question, we generate three components — a knowledge reminder, a progressive hint, and a complete step-by-step solution. This is done in batches of 3 questions per API call to balance throughput and quality.
-
Answer matching (Gemini Vision): If an answer PDF is provided, we send its pages to the vision model and match answers to corresponding questions by question number.
The key architectural decision is splitting vision and text tasks across different models — Gemini for anything that needs to "see" the document, DeepSeek for pure text reasoning. This cuts cost by ~60% compared to using a single vision model for everything.
Q: Why vision mode instead of traditional PDF text extraction?
We started with pdfplumber-based text extraction and hit critical failures:
- Multi-line code blocks break apart:
C = np.array([[[0,1,2,3],gets separated from its closing brackets across lines - Mathematical notation is lost or garbled
- Table structures collapse into unreadable strings
- Mixed formatting (code + text + formulas on the same page) confuses parsers
Vision mode sends the raw page image — the model sees exactly what a student sees. On our COMP2211 benchmark (Python-heavy, lots of NumPy arrays), vision mode correctly extracts 100% of questions vs ~70% with text extraction.
Q: How accurate is the AI on code questions?
For code output questions (e.g., "What does print(A[2:-2:3]) output?"), we don't rely on the LLM to calculate. We run Python exec() on the actual code:
ns = {"np": np}
exec(extract_code_lines(question_text), ns) # setup variables
output = exec("print(A[2:-2:3])", ns) # capture stdout
# output = "[ 7 10]" — ground truth, fed to AI as reference
We maintain a shared namespace per question group so variables defined in a parent question (e.g., A = np.arange(5,15)) are available to all sub-questions. This gives us 100% accuracy on Python output questions.
Q: How does auto-grading work technically?
Three-step pipeline, all non-blocking (run in thread pool via asyncio.to_thread):
-
Gemini Vision OCR: Student's photo → base64 → Gemini with OCR prompt → extracted text with LaTeX formulas preserved (
$\mu = 8.03$) -
DeepSeek Grading: We inject the question text, reference answer (from DB), and OCR'd student answer into a structured prompt. The model returns JSON:
{ "is_correct": false, "score_given": 2, "feedback": "<HTML with KaTeX>Step 1 correct, but in Step 3...", "error_at_step": 3 } -
Persistence: Result stored in
user_attemptstable withuser_id,question_id,feedback,photo_url. Wrong answers auto-added to error book. Frontend loads historical results on page load viaGET /api/attempts/by-paper/{paper_id}.
Grading runs in a thread pool (asyncio.to_thread) so it never blocks the main event loop — other users can browse papers, load questions, etc. while grading runs.
Q: How does the similar question retrieval work?
Multi-signal similarity scoring with caching:
-
Topic normalization: We maintain an alias map (e.g., "Numpy"/"NumPy" → "NumPy", "Naïve Bayes"/"Naive Bayes Classifier" → "Naive Bayes") — about 80 aliases covering common variations.
-
Candidate filtering: Pre-filter by
analytics_topicin PostgreSQL (cuts candidates from ~250 to ~30 for a given course). -
Scoring (up to 100 points):
- Topic overlap: up to 40 pts (exact analytics_topic match = 30, shared topic_tags = 10)
- Question type match: 15 pts
- Difficulty match: 10 pts
- PostgreSQL
ts_rank_cdfull-text similarity: up to 20 pts - Same parent question structure: 5 pts
-
Caching: First computation is stored in
similar_questionsJSONB column on the question row. Subsequent loads are instant. In-memory cache with 5-minute TTL for hot questions. -
Deduplication: Only the best-matching question per paper is shown (avoid showing 5 questions from the same exam).
Q: What's the processing pipeline architecture? How do you handle failures?
Checkpoint-based processing with auto-resume:
The pipeline has 5 stages, each checkpointed to the database:
| Stage | What happens | Checkpoint |
|---|---|---|
| 1. Render | PDF → PNG images (96 DPI) | In memory only |
| 2. Extract | Vision API → structured questions | Progress bar updated per batch |
| 3. Match answers | Answer PDF → question mapping | Optional, failure skipped |
| 4. Save questions | Write all questions to DB | Each question persisted immediately |
| 5. AI trio | Generate solutions per question | Each solution written individually |
If the server crashes at stage 5 (say, 15/35 solutions generated), on restart:
lifespanstartup hook detects papers withstatus=processing- Checks
paper_questionstable — finds 35 questions, 15 with solutions - Calls
_resume_ai_trio()which only processes the 20 missing ones - Marks paper as
readywhen done
The processing runs in a daemon thread with its own event loop (threading.Thread + asyncio.run), completely isolated from the FastAPI server.
Q: What's your RAG pipeline for the AI Tutor?
We use LangChain with a vector database (SurrealDB) to index three content types:
- Lecture recordings: Downloaded from Canvas → FFmpeg extracts audio → Whisper transcribes with timestamps → chunked into ~500 token segments with navigation markers
- Courseware PDFs/PPTs: Extracted and chunked with metadata (course code, topic, page)
- Past paper content: Question text + solutions indexed with topic tags
Retrieval flow: Student query → embedding → top-K vector search → retrieved chunks + question context → LLM generates grounded answer with source citations.
Q: How do you handle API costs?
Cost optimization at every layer:
| Strategy | Savings |
|---|---|
| Model splitting (Gemini vision + DeepSeek text) | ~60% vs single model |
| 96 DPI rendering (down from 120) | ~26% fewer tokens per page |
| 8-page batches for vision | Fewer API calls |
| Answer matching failure = skip, not retry forever | Prevents cost runaway |
similar_questions cached in DB column |
One-time compute per question |
| DeepSeek at $0.28/M input vs Gemini at $0.15/M + vision overhead | Text tasks 2-3x cheaper |
Per-paper cost: ~$1-2 USD for full processing (extraction + answer matching + 40 solutions). Per-grading: ~$0.02 (one vision OCR + one text grading call).
Q: What's your tech stack?
| Layer | Technology | Why |
|---|---|---|
| Frontend | React 18 + Vite + TypeScript | Fast SPA, hot reload |
| Backend | FastAPI (Python 3.12, async) | Native async, OpenAPI docs |
| Database | PostgreSQL via Supabase | Relational + Auth + Storage + RLS |
| Vector DB | SurrealDB | RAG retrieval for AI Tutor |
| Cache | Redis | Session cache, rate limiting |
| Vision AI | Gemini 2.5 Flash (Google official API) | Best vision quality, free tier |
| Text AI | DeepSeek V3 (deepseek-chat) | Cheapest frontier model, no rate limits |
| PDF Rendering | PyMuPDF (fitz) | Fast, accurate page-to-image |
| Code Execution | Python exec() with sandboxed namespace | Ground-truth for code output questions |
| Math Rendering | KaTeX (client-side) | Fast LaTeX rendering, no server round-trip |
| Transcription | Whisper + FFmpeg | Lecture recording → text |
| Deployment | Docker + OpenResty + Let's Encrypt | Single server, HTTPS, reverse proxy |
| Hosting | Tencent Cloud Singapore (2C4G) | Low latency to HK, Gemini API accessible |
Q: How do you handle concurrent uploads?
- Upload endpoint reads file bytes, creates DB record (
status: processing), returns paper ID immediately (~200ms response) - Processing spawns in a daemon thread with its own asyncio event loop — completely isolated from the FastAPI server
- Frontend polls
GET /api/papers/mineevery 4 seconds, shows real-time progress bar ("Reading pages 1-8...", "Generating solutions 12/35 questions") - Multiple papers can process simultaneously (each in its own thread)
- Server stays responsive for all other requests during processing
Q: How do you handle JSON parsing issues from LLM responses?
LLMs often return invalid JSON, especially with LaTeX content. We handle three categories:
- Markdown code fences: Strip
```json ... ```wrappers - Control characters: Remove
\x00-\x1fexcept\t\n\r - Invalid escape sequences: LaTeX like
\sqrt,\sigmaproduces invalid JSON escapes. We use a regex that only fixes odd-count backslash sequences before non-escape characters:This correctly handlesre.sub(r'(?<!\\)((?:\\\\)*)\\([^"\\/bfnrtu])', r'\1\\\\\2', text)\\sqrt(valid: literal backslash + sqrt) vs\sqrt(invalid: needs fixing) vs\\\sqrt(odd count: needs fixing).
Q: What about data privacy and security?
- All data stored in Supabase with Row Level Security (RLS) — users can only access their own attempts, error books, and uploads
- Photo uploads stored in Supabase Storage with per-user path isolation (
attempts/{user_id}/{question_id}/) - API authentication via Supabase JWT tokens, validated on every request
- No student data is sent to AI models beyond the current question context — no cross-user data leakage
- Server deployed in Singapore (Tencent Cloud), compliant with HK data regulations
Part 2: Blockchain & KnowIt Coin
Q: Why blockchain? Isn't this just a points system?
No — a points system is centralized and opaque. Blockchain gives us three things a database can't:
-
Verifiable attribution — When a student uploads a high-quality note or past paper analysis, the contribution is recorded on-chain with a tamper-proof timestamp and creator identity. This isn't just a database entry we control — it's a credential the student owns.
-
Transparent usage tracking — When that content gets used by other students, referenced by AI, or bundled into a paid study pack, the usage chain is publicly verifiable. No black-box algorithms deciding who gets credit.
-
Trustless revenue sharing — Smart contracts can automatically distribute earnings to original creators based on on-chain usage records, without requiring users to trust our platform's accounting.
Q: How would the on-chain content attribution actually work technically?
When a student uploads content (notes, flashcards, paper analysis), we:
- Generate a content hash (SHA-256 of the file/text)
- Record a transaction on-chain containing:
creator_address,content_hash,timestamp,course_tags,content_type - Store the actual content off-chain (IPFS or our own storage) — only the hash goes on-chain for cost efficiency
- When content is referenced or reused, a new transaction records the citation relationship:
derived_from: [original_content_hash]
This creates an immutable provenance graph — you can trace any piece of study material back to its original creator(s).
Q: Which blockchain? Gas fees would kill micro-transactions for student content.
We'd use a Layer 2 solution or a low-cost chain like Polygon, Base, or Arbitrum. Transaction costs are fractions of a cent. For batch efficiency, we can aggregate multiple attribution records into a single on-chain transaction using Merkle trees — record 100 contributions in one tx.
Alternatively, we operate a hybrid model: maintain an off-chain ledger for real-time tracking, then periodically anchor batch proofs on-chain (e.g., daily settlement). This gives us the speed of centralized systems with the verifiability of blockchain.
Q: What's the role of the stablecoin vs KnowIt Coin?
KnowIt Coin = internal contribution token
- Earned by: uploading quality content, annotating questions, correcting errors, community moderation
- Spent on: premium AI features, accessing curated study packs, unlocking advanced analytics
- Non-speculative, platform-governed supply
Stablecoin (e.g., USDC) = settlement layer
- Used when real money enters/exits: creator payouts, premium subscriptions, cross-university transactions
- Enables frictionless micro-payments: a student in HKUST pays 0.50 HKD for a study guide made by a student at CityU — no bank transfer needed
- Smart contract handles revenue split automatically (e.g., 70% to creator, 20% to platform, 10% to content curators)
Q: How do you prevent low-quality content farming for coins?
Multi-layer quality control:
- AI quality scoring — content is automatically evaluated for completeness, accuracy, and originality before earning coins
- Community validation — peer ratings and usage metrics (how many students actually found it helpful)
- Stake-weighted moderation — users with higher reputation (earned through sustained quality contributions) have more influence in content curation
- Diminishing returns — bulk uploads of low-quality content yield progressively fewer coins
Q: What's the realistic timeline for blockchain integration?
Phase 1 (Current): Centralized platform with traditional database. Points/achievements tracked internally. This is where we are now.
Phase 2 (6-12 months): Introduce KnowIt Coin as an off-chain token with on-chain anchoring. Content attribution hashes recorded on-chain. Creator dashboard showing contribution history.
Phase 3 (12-24 months): Smart contract-based revenue sharing. Cross-university content marketplace. Stablecoin integration for payouts.
We're building the AI engine and user base first. Blockchain is the trust infrastructure layer that makes the community self-sustaining long-term.
Q: Isn't this over-engineered? Why not just use a database?
For a single university, yes — a database is fine. But our vision is a cross-university knowledge marketplace across Hong Kong and the Greater Bay Area. When content flows between institutions, you need:
- Attribution that no single institution controls
- Revenue sharing that creators can independently verify
- Content provenance that survives platform changes
That's where blockchain becomes necessary, not optional. AI is our learning engine. Blockchain is our trust and incentive layer. Together, they turn one-time knowledge sharing into a sustainable, accumulating knowledge economy.
Part 3: Business & Scalability Questions
Q: How does this scale beyond HKUST?
The system is course-code agnostic — it works on any university's past papers. To expand:
- Students at new universities upload their own papers (crowdsourced)
- We partner with student unions for bulk paper sourcing
- The AI pipeline is fully automated — no manual work per course
The topic normalization and analytics adapt automatically to each course's vocabulary.
Q: What's your competitive advantage vs ChatGPT / NotebookLM?
| Feature | KnowIt | ChatGPT | NotebookLM |
|---|---|---|---|
| Localized past paper library | ✅ | ❌ | ❌ |
| Exam-oriented workflow | ✅ | ❌ | ❌ |
| Auto-grading with photo upload | ✅ | ❌ | ❌ |
| Similar question retrieval | ✅ | ❌ | ❌ |
| Variant question generation | ✅ | ❌ | ❌ |
| Error book with spaced repetition | ✅ | ❌ | ❌ |
| Course-specific analytics | ✅ | ❌ | ❌ |
| Price | Affordable | $20/mo | Free but limited |
ChatGPT and NotebookLM are general-purpose tools. KnowIt is a vertical solution built specifically for exam preparation with features they can't replicate without building what we already have.
Q: What if Google or OpenAI builds this?
They build horizontal platforms. We build vertical depth — localized paper libraries, university-specific communities, exam-pattern analytics across semesters. Our data moat grows with every paper uploaded and every student interaction. A general-purpose AI can't replicate 5 years of COMP2211 exam pattern analysis overnight.