Inspiration
Students often study with slides, notes, and past exams — but the bottleneck is always the same: creating good practice questions that truly match their course, their instructor, and their level. Generic AI chatbots fail because they hallucinate, gloss over course-specific notation, or make questions that don’t resemble real exams.
We wanted a system that could:
- Ingest a course exactly as taught.
- Generate questions grounded strictly in the instructor’s materials.
- Run locally on Windows-on-Snapdragon hardware with on-device acceleration.
- Provide remediation loops that adapt to each student’s weaknesses.
That became Test-me-please — a full-stack, locally accelerated exam generation & grading platform.
What it does
Test-me-please is an AI-powered practice exam system that turns your uploaded course materials (PDF slides, past exams, lecture notes) into:
- Automatically generated practice exams (MCQ and FRQ).
- Course-grounded, non-hallucinated questions thanks to local RAG.
- Auto-grading for both MCQ and short-answer.
- Weak-topic extraction and personalized follow-up question sets.
- A clean, simple Streamlit UI for both instructors and students.
It supports three modes depending on hardware:
- Stub mode: instant iteration for development.
- Local ONNX runtime: semantic retrieval using Nomic Embed Text.
- QNN accelerated mode: fully on-device vector search using the Snapdragon NPU.
AnythingLLM integration is available for remote LLM generation or remote retrieval when needed.
How we built it
1. Ingestion Pipeline
- PDFs → text via PyMuPDF, stored with metadata (
week,topic,source). - Chunking ensures each piece is semantically meaningful and retrievable.
2. Embedding + Vector Store
- Embeddings computed with Nomic Embed Text ONNX.
- Optional QNN acceleration via Qualcomm AI Engine Direct.
- Stored in a local Weaviate instance (dockerized).
- Hybrid retrieval mode allowing fallback to local even when remote retrieval fails.
3. Exam Generation
The backend builds a structured prompt from:
- the top-k retrieved chunks
- instructor parameters (count, difficulty, type)
ExamAgent generates the exam using:
- local stub → fast iteration
- AnythingLLM → remote generation
- Future path: Phi-3.5 loaded locally
4. Grading + Personalization
- MCQs graded directly.
- FRQs graded via semantic similarity + rubric.
- Weak topics extracted from incorrect answers.
- Follow-up exam generated focusing on those topics.
5. Frontend (Streamlit)
- Session state handled by shared utilities:
merged_questions,first_skipped_index,timer_remaining, etc. - Instructor and student views share the same backend API helpers.
6. Backend (FastAPI)
Single consolidated runtime:
- Unified CORS logic
- Lazy initialization of heavy models
- Stable import path for uvicorn
Shared code extracted into
src/common/to eliminate duplication:api.py,config.py,frontend.py,runtime.py,path.py.
Challenges we ran into
QNN acceleration complexity ONNX Runtime + QNN EP required precise environment setup and correct backend libraries on Snapdragon hardware.
Avoiding hallucination We had to design strict retrieval + structured prompting so question generation stayed faithful to the uploaded materials.
Weaviate cluster quirks Local single-node deployments had to be stabilized for fast ingestion and retrieval in dev environments.
Windows on Snapdragon toolchain Python wheels for ONNX Runtime, tokenizers, and QNN had to be pinned exactly to ensure compatibility.
Cross-platform UI and backend imports Streamlit’s execution directory differs from uvicorn’s, so we had to centralize path handling to keep imports consistent.
Fast iteration without breaking the system We introduced stub modes for both LLM and embedding so we could prototype UI and API flows before hooking up models.
Accomplishments that we're proud of
- A fully functioning exam workflow: ingest → generate → test → grade → personalize.
- Consistent local-only deployment with no OpenAI or cloud dependencies.
- Seamless hybrid mode where AnythingLLM retrieval can be used when available, but falls back to local RAG when not.
- A shared codebase where backend/frontend duplication has been removed via
src/common/. - Real progress toward on-device exam generation with Phi-3.5 + NPU acceleration.
- A stable Windows-on-Snapdragon developer environment that others can reuse.
What we learned
- Vector stores require careful metadata design to make retrieval good — chunk size and structure matter more than model size.
- Qualcomm’s QNN stack is powerful but unforgiving: correct EP selection, library paths, and ABI versioning matter.
- Strict grounding is essential when generating educational content — retrieval + filtering greatly reduce hallucination.
- Streamlit works surprisingly well for fast multimodal UI prototyping.
- Building a reusable internal module layer (
src/common/) greatly simplifies a multi-script hackathon project. - AnythingLLM is useful as a fallback or hybrid component, but local retrieval offers higher reliability and lower latency.
What’s next for Test-me-please
- Full Phi-3.5-Mini integration running entirely offline on-device.
- User accounts + auth (JWT or API keys).
- In-browser React UI with real-time streaming answers.
- Analytics dashboard showing topic mastery progression.
- Rubric-aware FRQ grading using structured reasoning.
- Better deduplication using remote or local similarity scoring.
- Classroom mode where instructors can push exams to multiple student accounts.
- Semantic tagging of course materials
Log in or sign up for Devpost to join the conversation.