# KnowIt Technical Defense Q&A ## Part 1: AI & Product Technical Questions ### Q: How does your AI analyze past papers? What model do you use? We use a multi-model pipeline with clear separation of concerns: 1. **Vision extraction** (Gemini 2.5 Flash): We render each PDF page to a 96 DPI PNG image using PyMuPDF, encode it as base64, and send it to Gemini's vision API via OpenAI-compatible endpoint. The model extracts every question into structured JSON — question number, type, text, options, score, topics, difficulty. We process in batches of 8 pages to stay within token limits. 2. **Solution generation** (DeepSeek V3): For each extracted question, we generate three components — a knowledge reminder, a progressive hint, and a complete step-by-step solution. This is done in batches of 3 questions per API call to balance throughput and quality. 3. **Answer matching** (Gemini Vision): If an answer PDF is provided, we send its pages to the vision model and match answers to corresponding questions by question number. The key architectural decision is **splitting vision and text tasks across different models** — Gemini for anything that needs to "see" the document, DeepSeek for pure text reasoning. This cuts cost by ~60% compared to using a single vision model for everything. ### Q: Why vision mode instead of traditional PDF text extraction? We started with pdfplumber-based text extraction and hit critical failures: - **Multi-line code blocks** break apart: `C = np.array([[[0,1,2,3],` gets separated from its closing brackets across lines - **Mathematical notation** is lost or garbled - **Table structures** collapse into unreadable strings - **Mixed formatting** (code + text + formulas on the same page) confuses parsers Vision mode sends the raw page image — the model sees exactly what a student sees. On our COMP2211 benchmark (Python-heavy, lots of NumPy arrays), vision mode correctly extracts 100% of questions vs ~70% with text extraction. ### Q: How accurate is the AI on code questions? For code output questions (e.g., "What does `print(A[2:-2:3])` output?"), we don't rely on the LLM to calculate. We run **Python exec()** on the actual code: ```python ns = {"np": np} exec(extract_code_lines(question_text), ns) # setup variables output = exec("print(A[2:-2:3])", ns) # capture stdout # output = "[ 7 10]" — ground truth, fed to AI as reference ``` We maintain a **shared namespace per question group** so variables defined in a parent question (e.g., `A = np.arange(5,15)`) are available to all sub-questions. This gives us 100% accuracy on Python output questions. ### Q: How does auto-grading work technically? Three-step pipeline, all non-blocking (run in thread pool via `asyncio.to_thread`): 1. **Gemini Vision OCR**: Student's photo → base64 → Gemini with OCR prompt → extracted text with LaTeX formulas preserved (`$\mu = 8.03$`) 2. **DeepSeek Grading**: We inject the question text, reference answer (from DB), and OCR'd student answer into a structured prompt. The model returns JSON: ```json { "is_correct": false, "score_given": 2, "feedback": "Step 1 correct, but in Step 3...", "error_at_step": 3 } ``` 3. **Persistence**: Result stored in `user_attempts` table with `user_id`, `question_id`, `feedback`, `photo_url`. Wrong answers auto-added to error book. Frontend loads historical results on page load via `GET /api/attempts/by-paper/{paper_id}`. Grading runs in a thread pool (`asyncio.to_thread`) so it never blocks the main event loop — other users can browse papers, load questions, etc. while grading runs. ### Q: How does the similar question retrieval work? Multi-signal similarity scoring with caching: 1. **Topic normalization**: We maintain an alias map (e.g., "Numpy"/"NumPy" → "NumPy", "Naïve Bayes"/"Naive Bayes Classifier" → "Naive Bayes") — about 80 aliases covering common variations. 2. **Candidate filtering**: Pre-filter by `analytics_topic` in PostgreSQL (cuts candidates from ~250 to ~30 for a given course). 3. **Scoring** (up to 100 points): - Topic overlap: up to 40 pts (exact analytics_topic match = 30, shared topic_tags = 10) - Question type match: 15 pts - Difficulty match: 10 pts - PostgreSQL `ts_rank_cd` full-text similarity: up to 20 pts - Same parent question structure: 5 pts 4. **Caching**: First computation is stored in `similar_questions` JSONB column on the question row. Subsequent loads are instant. In-memory cache with 5-minute TTL for hot questions. 5. **Deduplication**: Only the best-matching question per paper is shown (avoid showing 5 questions from the same exam). ### Q: What's the processing pipeline architecture? How do you handle failures? **Checkpoint-based processing with auto-resume:** The pipeline has 5 stages, each checkpointed to the database: | Stage | What happens | Checkpoint | |-------|-------------|-----------| | 1. Render | PDF → PNG images (96 DPI) | In memory only | | 2. Extract | Vision API → structured questions | Progress bar updated per batch | | 3. Match answers | Answer PDF → question mapping | Optional, failure skipped | | 4. Save questions | Write all questions to DB | **Each question persisted immediately** | | 5. AI trio | Generate solutions per question | **Each solution written individually** | If the server crashes at stage 5 (say, 15/35 solutions generated), on restart: - `lifespan` startup hook detects papers with `status=processing` - Checks `paper_questions` table — finds 35 questions, 15 with solutions - Calls `_resume_ai_trio()` which only processes the 20 missing ones - Marks paper as `ready` when done The processing runs in a **daemon thread** with its own event loop (`threading.Thread` + `asyncio.run`), completely isolated from the FastAPI server. ### Q: What's your RAG pipeline for the AI Tutor? We use **LangChain** with a vector database (**SurrealDB**) to index three content types: 1. **Lecture recordings**: Downloaded from Canvas → FFmpeg extracts audio → Whisper transcribes with timestamps → chunked into ~500 token segments with navigation markers 2. **Courseware PDFs/PPTs**: Extracted and chunked with metadata (course code, topic, page) 3. **Past paper content**: Question text + solutions indexed with topic tags Retrieval flow: Student query → embedding → top-K vector search → retrieved chunks + question context → LLM generates grounded answer with source citations. ### Q: How do you handle API costs? **Cost optimization at every layer:** | Strategy | Savings | |----------|---------| | Model splitting (Gemini vision + DeepSeek text) | ~60% vs single model | | 96 DPI rendering (down from 120) | ~26% fewer tokens per page | | 8-page batches for vision | Fewer API calls | | Answer matching failure = skip, not retry forever | Prevents cost runaway | | `similar_questions` cached in DB column | One-time compute per question | | DeepSeek at $0.28/M input vs Gemini at $0.15/M + vision overhead | Text tasks 2-3x cheaper | Per-paper cost: **~$1-2 USD** for full processing (extraction + answer matching + 40 solutions). Per-grading: **~$0.02** (one vision OCR + one text grading call). ### Q: What's your tech stack? | Layer | Technology | Why | |-------|-----------|-----| | Frontend | React 18 + Vite + TypeScript | Fast SPA, hot reload | | Backend | FastAPI (Python 3.12, async) | Native async, OpenAPI docs | | Database | PostgreSQL via Supabase | Relational + Auth + Storage + RLS | | Vector DB | SurrealDB | RAG retrieval for AI Tutor | | Cache | Redis | Session cache, rate limiting | | Vision AI | Gemini 2.5 Flash (Google official API) | Best vision quality, free tier | | Text AI | DeepSeek V3 (deepseek-chat) | Cheapest frontier model, no rate limits | | PDF Rendering | PyMuPDF (fitz) | Fast, accurate page-to-image | | Code Execution | Python exec() with sandboxed namespace | Ground-truth for code output questions | | Math Rendering | KaTeX (client-side) | Fast LaTeX rendering, no server round-trip | | Transcription | Whisper + FFmpeg | Lecture recording → text | | Deployment | Docker + OpenResty + Let's Encrypt | Single server, HTTPS, reverse proxy | | Hosting | Tencent Cloud Singapore (2C4G) | Low latency to HK, Gemini API accessible | ### Q: How do you handle concurrent uploads? 1. Upload endpoint reads file bytes, creates DB record (`status: processing`), returns paper ID immediately (~200ms response) 2. Processing spawns in a **daemon thread** with its own asyncio event loop — completely isolated from the FastAPI server 3. Frontend polls `GET /api/papers/mine` every 4 seconds, shows real-time progress bar ("Reading pages 1-8...", "Generating solutions 12/35 questions") 4. Multiple papers can process simultaneously (each in its own thread) 5. Server stays responsive for all other requests during processing ### Q: How do you handle JSON parsing issues from LLM responses? LLMs often return invalid JSON, especially with LaTeX content. We handle three categories: 1. **Markdown code fences**: Strip ` ```json ... ``` ` wrappers 2. **Control characters**: Remove `\x00-\x1f` except `\t\n\r` 3. **Invalid escape sequences**: LaTeX like `\sqrt`, `\sigma` produces invalid JSON escapes. We use a regex that only fixes **odd-count backslash sequences** before non-escape characters: ```python re.sub(r'(?