Files
PastpaperMaster/backend/BATCH_IMPORT_GUIDE.md
Zhao 9c09944c96 feat: expandable previews, KaTeX rendering, variant speedup, batch import
- Analytics/Similar: expandable question preview with KaTeX rendering
- KaTeXRenderer: auto markdown-to-HTML (code blocks, tables, bold), auto Unicode→LaTeX
- ErrorBook: full question text rendering instead of truncated preview
- Variant: remove hint/solution from generation (faster), async, fix null crash
- Grading: add max_tokens limit
- JSON parser: robust multi-layer repair + JSONDecodeError retry
- Extraction prompt: enforce LaTeX notation for math
- Upload: redirect to home instead of blank paper page
- ProcessingBanner: add ETA time estimate + percentage
- Batch import script + handoff guide for team

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-24 22:41:57 +09:00

221 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 批量导入试卷指南
## 概述
`batch_import.py` 用于批量向 PastPaper Master 数据库填充试卷。它会自动完成:
1. 创建 DB 记录
2. 上传 PDF 到 Supabase Storage
3. Gemini Vision 提取题目结构
4. DeepSeek 生成 AI 解题三件套knowledge reminder + hint + solution
## 环境准备
### 1. 服务器信息
| 项目 | 值 |
|------|-----|
| 生产服务器 | `129.226.210.66` |
| SSH | `ssh -i ~/.ssh/id_ed25519 root@129.226.210.66` |
| 后端容器 | `pastpaper-backend-1` |
| 项目路径 | `/opt/pastpaper/` |
| 前端静态文件 | `/opt/1panel/www/pastpaper/` |
### 2. 在本地运行(推荐)
```bash
cd /path/to/PastPaper\ Master/backend
# 确保 .env 在项目根目录(../. env
# 需要的 key: SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY, GOOGLE_GEMINI_API_KEY, DEEPSEEK_API_KEY
# 激活虚拟环境
source .venv/bin/activate
# 或用 venv 的 python
.venv/bin/python batch_import.py ...
```
### 3. 在服务器 Docker 容器里运行
```bash
# 先把脚本和试卷文件传到服务器
scp -i ~/.ssh/id_ed25519 batch_import.py root@129.226.210.66:/opt/pastpaper/backend/
scp -i ~/.ssh/id_ed25519 -r /path/to/papers root@129.226.210.66:/opt/pastpaper/papers_to_import/
# 进容器运行
ssh -i ~/.ssh/id_ed25519 root@129.226.210.66
docker exec -it pastpaper-backend-1 bash
cd /app
python batch_import.py /path/to/papers --batch
```
## 使用方法
### 单份导入
```bash
python batch_import.py paper.pdf \
--course COMP2211 \
--year 2024 \
--term spring \
--exam midterm
# 带答案
python batch_import.py paper.pdf \
--answer answer.pdf \
--course COMP2211 \
--year 2024 \
--term spring \
--exam midterm
```
### 批量导入
#### 目录结构要求
```
papers_to_import/
├── COMP2211/
│ ├── 2024_spring_midterm.pdf
│ ├── 2024_spring_midterm_answer.pdf <- 自动匹配
│ ├── 2024_fall_final.pdf
│ └── 2023_spring_midterm.pdf
├── COMP2011/
│ ├── 2024_spring_midterm.pdf
│ └── 2024_fall_final.pdf
├── MATH1014/
│ └── 2024_spring_midterm.pdf
└── FINA2303/
└── 2023_fall_midterm.pdf
```
- 一级目录名 = 课程代码(自动转大写)
- 文件名格式: `{year}_{term}_{examtype}.pdf`
- 答案文件: `{year}_{term}_{examtype}_answer.pdf`(可选,放同一目录,自动匹配)
- term: `spring` / `fall` / `summer`
- examtype: `midterm` / `final` / `quiz`
#### 命令
```bash
# 先试运行看看会导入什么
python batch_import.py papers_to_import/ --batch --dry-run
# 正式导入(串行,最安全)
python batch_import.py papers_to_import/ --batch
# 并发导入2个同时处理更快但 API 可能限流)
python batch_import.py papers_to_import/ --batch --concurrency 2
```
### 自动查重
脚本会自动跳过已存在的试卷(相同 course_code + year + term + exam_type 且 status 为 ready 或 processing
## 处理时间估计
单份试卷处理时间取决于页数和题目数:
| 阶段 | 耗时 |
|------|------|
| PDF 渲染 | 2-5s |
| Vision 提取(每 8 页一批) | 30-60s/批 |
| 答案匹配 | 20-40s |
| AI trio 生成(每 3 题一批) | 15-25s/批 |
| **总计30 题试卷)** | **~3-5 min** |
| **总计40+ 题试卷)** | **~5-8 min** |
建议: 并发不要超过 2否则 Gemini API 可能限流429 错误,脚本会自动重试但会更慢)。
## API 费用
| 模型 | 用途 | 费用 |
|------|------|------|
| Gemini 2.5 Flash | Vision 提取 + 答案匹配 | 免费额度内通常够 |
| DeepSeek V3 | AI trio 生成 | ~$0.5-1.5/份试卷 |
监控费用:
- Gemini: https://aistudio.google.com (API keys 页面看用量)
- DeepSeek: https://platform.deepseek.com (Usage 页面)
## 常见问题
### Q: 处理失败怎么办?
试卷会标记为 `status=error`。可以删掉重来:
```python
# 在 backend/ 目录下
.venv/bin/python -c "
import sys; sys.path.insert(0, '.')
from dotenv import load_dotenv; load_dotenv('../.env')
from app.services.supabase_client import get_supabase
sb = get_supabase()
errors = sb.table('papers').select('id, course_code').eq('status', 'error').execute().data
for p in errors:
sb.table('paper_questions').delete().eq('paper_id', p['id']).execute()
sb.table('papers').delete().eq('id', p['id']).execute()
print('Deleted', p['course_code'])
"
```
### Q: JSON 解析错误?
已内置多层 JSON 修复 + 自动重试(最多 6 次)。如果还是失败,通常是因为试卷内容太复杂(大量 LaTeX + 代码),可以尝试:
1. 删掉 error 记录重新导入
2. 如果反复失败,可能需要拆分试卷 PDF
### Q: 如何只重新生成 AI trio题目已提取
```python
# 清空 solution 字段,重启后端会自动续传
.venv/bin/python -c "
import sys; sys.path.insert(0, '.')
from dotenv import load_dotenv; load_dotenv('../.env')
from app.services.supabase_client import get_supabase
sb = get_supabase()
PAPER_ID = 'xxxxxxxx-xxxx-...' # 替换
qs = sb.table('paper_questions').select('id').eq('paper_id', PAPER_ID).execute().data
for q in qs:
sb.table('paper_questions').update({'solution': None, 'ai_hint': None, 'knowledge_reminder': None}).eq('id', q['id']).execute()
sb.table('papers').update({'status': 'processing'}).eq('id', PAPER_ID).execute()
print(f'Reset {len(qs)} questions, restart backend to regenerate')
"
# 然后重启后端
ssh -i ~/.ssh/id_ed25519 root@129.226.210.66 "sudo docker restart pastpaper-backend-1"
```
### Q: 如何部署后端代码改动?
```bash
# 上传改动的文件
scp -i ~/.ssh/id_ed25519 app/services/paper_processor.py root@129.226.210.66:/opt/pastpaper/backend/app/services/
# 重建容器
ssh -i ~/.ssh/id_ed25519 root@129.226.210.66 "cd /opt/pastpaper && sudo docker compose up -d --build backend"
```
### Q: 如何部署前端改动?
```bash
cd frontend
npm run build
cp public/favicon.jpg dist/
ssh -i ~/.ssh/id_ed25519 root@129.226.210.66 "rm -rf /opt/1panel/www/pastpaper/assets"
scp -i ~/.ssh/id_ed25519 dist/index.html dist/favicon.jpg root@129.226.210.66:/opt/1panel/www/pastpaper/
scp -i ~/.ssh/id_ed25519 -r dist/assets root@129.226.210.66:/opt/1panel/www/pastpaper/
```
## 试卷来源
`pastpaper-scraper/papers/` 目录下有从 HKUST 爬取的历年试卷 PDF按课程分目录。可以从中挑选热门课程导入
优先导入的课程(用户量大):
- COMP2011, COMP2211, COMP2711H
- MATH1013, MATH1014, MATH2023
- PHYS1112
- ELEC2100
- FINA2303
将文件按上述目录结构组织后运行 `--batch` 即可。