- Analytics/Similar: expandable question preview with KaTeX rendering - KaTeXRenderer: auto markdown-to-HTML (code blocks, tables, bold), auto Unicode→LaTeX - ErrorBook: full question text rendering instead of truncated preview - Variant: remove hint/solution from generation (faster), async, fix null crash - Grading: add max_tokens limit - JSON parser: robust multi-layer repair + JSONDecodeError retry - Extraction prompt: enforce LaTeX notation for math - Upload: redirect to home instead of blank paper page - ProcessingBanner: add ETA time estimate + percentage - Batch import script + handoff guide for team Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
221 lines
6.4 KiB
Markdown
221 lines
6.4 KiB
Markdown
# 批量导入试卷指南
|
||
|
||
## 概述
|
||
|
||
`batch_import.py` 用于批量向 PastPaper Master 数据库填充试卷。它会自动完成:
|
||
1. 创建 DB 记录
|
||
2. 上传 PDF 到 Supabase Storage
|
||
3. Gemini Vision 提取题目结构
|
||
4. DeepSeek 生成 AI 解题三件套(knowledge reminder + hint + solution)
|
||
|
||
## 环境准备
|
||
|
||
### 1. 服务器信息
|
||
|
||
| 项目 | 值 |
|
||
|------|-----|
|
||
| 生产服务器 | `129.226.210.66` |
|
||
| SSH | `ssh -i ~/.ssh/id_ed25519 root@129.226.210.66` |
|
||
| 后端容器 | `pastpaper-backend-1` |
|
||
| 项目路径 | `/opt/pastpaper/` |
|
||
| 前端静态文件 | `/opt/1panel/www/pastpaper/` |
|
||
|
||
### 2. 在本地运行(推荐)
|
||
|
||
```bash
|
||
cd /path/to/PastPaper\ Master/backend
|
||
|
||
# 确保 .env 在项目根目录(../. env)
|
||
# 需要的 key: SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY, GOOGLE_GEMINI_API_KEY, DEEPSEEK_API_KEY
|
||
|
||
# 激活虚拟环境
|
||
source .venv/bin/activate
|
||
|
||
# 或用 venv 的 python
|
||
.venv/bin/python batch_import.py ...
|
||
```
|
||
|
||
### 3. 在服务器 Docker 容器里运行
|
||
|
||
```bash
|
||
# 先把脚本和试卷文件传到服务器
|
||
scp -i ~/.ssh/id_ed25519 batch_import.py root@129.226.210.66:/opt/pastpaper/backend/
|
||
scp -i ~/.ssh/id_ed25519 -r /path/to/papers root@129.226.210.66:/opt/pastpaper/papers_to_import/
|
||
|
||
# 进容器运行
|
||
ssh -i ~/.ssh/id_ed25519 root@129.226.210.66
|
||
docker exec -it pastpaper-backend-1 bash
|
||
cd /app
|
||
python batch_import.py /path/to/papers --batch
|
||
```
|
||
|
||
## 使用方法
|
||
|
||
### 单份导入
|
||
|
||
```bash
|
||
python batch_import.py paper.pdf \
|
||
--course COMP2211 \
|
||
--year 2024 \
|
||
--term spring \
|
||
--exam midterm
|
||
|
||
# 带答案
|
||
python batch_import.py paper.pdf \
|
||
--answer answer.pdf \
|
||
--course COMP2211 \
|
||
--year 2024 \
|
||
--term spring \
|
||
--exam midterm
|
||
```
|
||
|
||
### 批量导入
|
||
|
||
#### 目录结构要求
|
||
|
||
```
|
||
papers_to_import/
|
||
├── COMP2211/
|
||
│ ├── 2024_spring_midterm.pdf
|
||
│ ├── 2024_spring_midterm_answer.pdf <- 自动匹配
|
||
│ ├── 2024_fall_final.pdf
|
||
│ └── 2023_spring_midterm.pdf
|
||
├── COMP2011/
|
||
│ ├── 2024_spring_midterm.pdf
|
||
│ └── 2024_fall_final.pdf
|
||
├── MATH1014/
|
||
│ └── 2024_spring_midterm.pdf
|
||
└── FINA2303/
|
||
└── 2023_fall_midterm.pdf
|
||
```
|
||
|
||
- 一级目录名 = 课程代码(自动转大写)
|
||
- 文件名格式: `{year}_{term}_{examtype}.pdf`
|
||
- 答案文件: `{year}_{term}_{examtype}_answer.pdf`(可选,放同一目录,自动匹配)
|
||
- term: `spring` / `fall` / `summer`
|
||
- examtype: `midterm` / `final` / `quiz`
|
||
|
||
#### 命令
|
||
|
||
```bash
|
||
# 先试运行看看会导入什么
|
||
python batch_import.py papers_to_import/ --batch --dry-run
|
||
|
||
# 正式导入(串行,最安全)
|
||
python batch_import.py papers_to_import/ --batch
|
||
|
||
# 并发导入(2个同时处理,更快但 API 可能限流)
|
||
python batch_import.py papers_to_import/ --batch --concurrency 2
|
||
```
|
||
|
||
### 自动查重
|
||
|
||
脚本会自动跳过已存在的试卷(相同 course_code + year + term + exam_type 且 status 为 ready 或 processing)。
|
||
|
||
## 处理时间估计
|
||
|
||
单份试卷处理时间取决于页数和题目数:
|
||
|
||
| 阶段 | 耗时 |
|
||
|------|------|
|
||
| PDF 渲染 | 2-5s |
|
||
| Vision 提取(每 8 页一批) | 30-60s/批 |
|
||
| 答案匹配 | 20-40s |
|
||
| AI trio 生成(每 3 题一批) | 15-25s/批 |
|
||
| **总计(30 题试卷)** | **~3-5 min** |
|
||
| **总计(40+ 题试卷)** | **~5-8 min** |
|
||
|
||
建议: 并发不要超过 2,否则 Gemini API 可能限流(429 错误,脚本会自动重试但会更慢)。
|
||
|
||
## API 费用
|
||
|
||
| 模型 | 用途 | 费用 |
|
||
|------|------|------|
|
||
| Gemini 2.5 Flash | Vision 提取 + 答案匹配 | 免费额度内通常够 |
|
||
| DeepSeek V3 | AI trio 生成 | ~$0.5-1.5/份试卷 |
|
||
|
||
监控费用:
|
||
- Gemini: https://aistudio.google.com (API keys 页面看用量)
|
||
- DeepSeek: https://platform.deepseek.com (Usage 页面)
|
||
|
||
## 常见问题
|
||
|
||
### Q: 处理失败怎么办?
|
||
|
||
试卷会标记为 `status=error`。可以删掉重来:
|
||
```python
|
||
# 在 backend/ 目录下
|
||
.venv/bin/python -c "
|
||
import sys; sys.path.insert(0, '.')
|
||
from dotenv import load_dotenv; load_dotenv('../.env')
|
||
from app.services.supabase_client import get_supabase
|
||
sb = get_supabase()
|
||
errors = sb.table('papers').select('id, course_code').eq('status', 'error').execute().data
|
||
for p in errors:
|
||
sb.table('paper_questions').delete().eq('paper_id', p['id']).execute()
|
||
sb.table('papers').delete().eq('id', p['id']).execute()
|
||
print('Deleted', p['course_code'])
|
||
"
|
||
```
|
||
|
||
### Q: JSON 解析错误?
|
||
|
||
已内置多层 JSON 修复 + 自动重试(最多 6 次)。如果还是失败,通常是因为试卷内容太复杂(大量 LaTeX + 代码),可以尝试:
|
||
1. 删掉 error 记录重新导入
|
||
2. 如果反复失败,可能需要拆分试卷 PDF
|
||
|
||
### Q: 如何只重新生成 AI trio(题目已提取)?
|
||
|
||
```python
|
||
# 清空 solution 字段,重启后端会自动续传
|
||
.venv/bin/python -c "
|
||
import sys; sys.path.insert(0, '.')
|
||
from dotenv import load_dotenv; load_dotenv('../.env')
|
||
from app.services.supabase_client import get_supabase
|
||
sb = get_supabase()
|
||
PAPER_ID = 'xxxxxxxx-xxxx-...' # 替换
|
||
qs = sb.table('paper_questions').select('id').eq('paper_id', PAPER_ID).execute().data
|
||
for q in qs:
|
||
sb.table('paper_questions').update({'solution': None, 'ai_hint': None, 'knowledge_reminder': None}).eq('id', q['id']).execute()
|
||
sb.table('papers').update({'status': 'processing'}).eq('id', PAPER_ID).execute()
|
||
print(f'Reset {len(qs)} questions, restart backend to regenerate')
|
||
"
|
||
|
||
# 然后重启后端
|
||
ssh -i ~/.ssh/id_ed25519 root@129.226.210.66 "sudo docker restart pastpaper-backend-1"
|
||
```
|
||
|
||
### Q: 如何部署后端代码改动?
|
||
|
||
```bash
|
||
# 上传改动的文件
|
||
scp -i ~/.ssh/id_ed25519 app/services/paper_processor.py root@129.226.210.66:/opt/pastpaper/backend/app/services/
|
||
|
||
# 重建容器
|
||
ssh -i ~/.ssh/id_ed25519 root@129.226.210.66 "cd /opt/pastpaper && sudo docker compose up -d --build backend"
|
||
```
|
||
|
||
### Q: 如何部署前端改动?
|
||
|
||
```bash
|
||
cd frontend
|
||
npm run build
|
||
cp public/favicon.jpg dist/
|
||
ssh -i ~/.ssh/id_ed25519 root@129.226.210.66 "rm -rf /opt/1panel/www/pastpaper/assets"
|
||
scp -i ~/.ssh/id_ed25519 dist/index.html dist/favicon.jpg root@129.226.210.66:/opt/1panel/www/pastpaper/
|
||
scp -i ~/.ssh/id_ed25519 -r dist/assets root@129.226.210.66:/opt/1panel/www/pastpaper/
|
||
```
|
||
|
||
## 试卷来源
|
||
|
||
`pastpaper-scraper/papers/` 目录下有从 HKUST 爬取的历年试卷 PDF,按课程分目录。可以从中挑选热门课程导入:
|
||
|
||
优先导入的课程(用户量大):
|
||
- COMP2011, COMP2211, COMP2711H
|
||
- MATH1013, MATH1014, MATH2023
|
||
- PHYS1112
|
||
- ELEC2100
|
||
- FINA2303
|
||
|
||
将文件按上述目录结构组织后运行 `--batch` 即可。
|