A platform for undergraduate students to find research supervisors and automate personalized cold emails.
Hybrid Microservice Architecture:
- Next.js 16 Web App: Frontend UI, Google OAuth, rate limiting, MongoDB queries
- Python FastAPI Agent Service: LangChain/LangGraph AI agents, PDF parsing, Gemini integration
- Firecrawl Data Pipeline: Standalone scraping service to populate professor database
Before you begin, ensure you have the following installed:
-
Node.js (v18 or higher) and npm (v9 or higher)
- Download from nodejs.org
- Verify installation:
node --version && npm --version
-
Python (v3.10 or higher) and pip
- Download from python.org
- Verify installation:
python3 --version && pip3 --version
-
Git
- Download from git-scm.com
-
MongoDB Atlas Account (Free tier available)
- Sign up at mongodb.com/cloud/atlas
find-my-prof/
├── web/ # Next.js 16 Application
│ ├── app/ # App router pages and API routes
│ ├── lib/ # Core utilities (MongoDB, Auth, Rate limiting)
│ ├── components/ # React components
│ ├── CLAUDE.md # Context file for AI assistance
│ └── SPEC.md # Technical specification
│
├── agent-service/ # Python FastAPI Microservice
│ ├── app/
│ │ ├── agents/ # LangGraph agents (matcher, drafter)
│ │ ├── services/ # PDF parsing, Gemini, Semantic Scholar
│ │ └── main.py # FastAPI app
│ ├── requirements.txt
│ └── Dockerfile
│
└── data-pipeline/ # Firecrawl Scraping Pipeline
├── scrapers/ # Firecrawl integration
├── processors/ # Data cleaning and MongoDB seeding
├── config.yaml # University URLs
└── run_pipeline.py # Main execution script
- TypeScript, Tailwind CSS
- NextAuth.js (Google OAuth 2.0)
- MongoDB Atlas
- Gmail API
- FastAPI
- LangChain + LangGraph
- Gemini 1.5 Flash
- Semantic Scholar API
- PyPDF2, pdfplumber
- Firecrawl API
- MongoDB Atlas
- YAML configuration
Follow these steps to get the entire project up and running on your local machine.
git clone https://github.com/yourusername/find-my-prof.git
cd find-my-prof- Create a free account at MongoDB Atlas
- Create a new cluster (free tier M0 is sufficient)
- Create a database user with read/write permissions
- Whitelist your IP address (or use
0.0.0.0/0for development) - Get your connection string (should look like:
mongodb+srv://username:[email protected]/)
You'll need the following API keys:
-
Google OAuth Credentials
- Go to Google Cloud Console
- Create a new project or select an existing one
- Enable the Google+ API and Gmail API
- Go to "Credentials" → "Create Credentials" → "OAuth 2.0 Client ID"
- Set authorized redirect URI:
http://localhost:3000/api/auth/callback/google - Save your Client ID and Client Secret
-
Google Gemini API Key
- Visit Google AI Studio
- Create a new API key
- Save your API key
-
Firecrawl API Key (Optional - only needed for data pipeline)
- Sign up at Firecrawl
- Get your API key from the dashboard
-
NextAuth Secret
- Generate a random secret:
openssl rand -base64 32 - Or visit: https://generate-secret.vercel.app/32
- Generate a random secret:
# Navigate to agent-service directory
cd agent-service
# Create a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy environment file
cp .env.example .env
# Edit .env and add your API keys
# GEMINI_API_KEY=your_gemini_api_key_here
# API_KEY=create_any_secure_random_string_hereStart the agent service:
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000The agent service should now be running at http://localhost:8000
Open a new terminal window:
# Navigate to web directory
cd web
# Install dependencies
npm install
# Copy environment file
cp .env.example .env.local
# Edit .env.local and fill in all the environment variables:
# MONGODB_URI=your_mongodb_connection_string
# GOOGLE_CLIENT_ID=your_google_oauth_client_id
# GOOGLE_CLIENT_SECRET=your_google_oauth_client_secret
# NEXTAUTH_URL=http://localhost:3000
# NEXTAUTH_SECRET=your_generated_secret
# AGENT_SERVICE_URL=http://localhost:8000
# AGENT_SERVICE_API_KEY=same_as_API_KEY_in_agent_serviceStart the web application:
npm run devThe web app should now be running at http://localhost:3000
The data pipeline uses Firecrawl and Google Gemini to scrape professor information from university faculty pages and automatically populate your MongoDB database. This is how you'll get professor data into your system.
You'll need:
- Python 3.12 (required for compatibility)
- Firecrawl API Key - Sign up at firecrawl.dev
- Google Gemini API Key - Get it from Google AI Studio
- MongoDB URI - Your connection string from Step 2
Open a new terminal window:
# Navigate to data-pipeline directory
cd data-pipeline
# Verify you have Python 3.12 installed
python3.12 --version
# Create a virtual environment with Python 3.12
python3.12 -m venv venv
# Activate the virtual environment
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy environment template
cp .env.example .envEdit the .env file in the data-pipeline directory:
# Required: Get from https://www.firecrawl.dev/
FIRECRAWL_API_KEY=fc-your_firecrawl_api_key_here
# Required: Get from https://makersuite.google.com/app/apikey
GEMINI_API_KEY=AIzaSy_your_gemini_api_key_here
# Required: Your MongoDB Atlas connection string
MONGODB_URI=mongodb+srv://username:[email protected]/The pipeline works best with faculty directory pages that list professors. Here's how to find them:
-
Navigate to your target university's website
- Example: University of Toronto, McMaster University, etc.
-
Find the faculty/department page
- Look for pages like "Faculty", "People", "Faculty Directory", or "Our Team"
- These pages typically list all professors with their names, titles, and research areas
-
Copy the full URL
- The URL should look like:
https://www.cs.toronto.edu/people/faculty/https://healthsci.mcmaster.ca/faculty-membershttps://engineering.university.edu/faculty-directory
- The URL should look like:
-
Verify the page contains professor information
- Check that the page lists professor names, titles, and research interests
- The more structured the page, the better the extraction will work
Use the following command structure:
python run_pipeline.py --url "FACULTY_PAGE_URL" --university "UNIVERSITY_NAME" --faculty "FACULTY_NAME"Example 1: Computer Science Department
python run_pipeline.py \
--url "https://www.cs.toronto.edu/people/faculty/" \
--university "University of Toronto" \
--faculty "Computer Science"Example 2: Health Sciences
python run_pipeline.py \
--url "https://healthsci.mcmaster.ca/faculty-members" \
--university "McMaster University" \
--faculty "Health Sciences"Example 3: Engineering
python run_pipeline.py \
--url "https://engineering.example.edu/faculty" \
--university "Example University" \
--faculty "Engineering"- Scrapes the faculty page using Firecrawl (respects robots.txt)
- Extracts professor information using Google Gemini AI:
- Name
- Title
- Research interests
- Personal website (if available)
- Cleans and validates the data
- Seeds to MongoDB with university and faculty metadata
- Upserts data - Updates existing professors or inserts new ones
When the pipeline runs successfully, you'll see:
============================================================
Scraping: University of Toronto - Computer Science
URL: https://www.cs.toronto.edu/people/faculty/
============================================================
Scraping with Firecrawl...
Parsing with Gemini...
Extracted 45 professors
Cleaning data...
Cleaned professors: 43
Seeding to MongoDB...
============================================================
✓ Pipeline complete!
Total processed: 43
Inserted: 40
Updated: 3
============================================================
You can run the pipeline multiple times for different universities and faculties:
# Add Computer Science professors
python run_pipeline.py --url "https://cs.uni1.edu/faculty" --university "University 1" --faculty "Computer Science"
# Add Health Sciences professors
python run_pipeline.py --url "https://healthsci.uni1.edu/faculty" --university "University 1" --faculty "Health Sciences"
# Add Engineering professors from another university
python run_pipeline.py --url "https://eng.uni2.edu/people" --university "University 2" --faculty "Engineering""No professors extracted" error:
- The page might not have structured professor data
- Try a different faculty directory page
- Ensure the URL is accessible and contains professor listings
Firecrawl rate limits:
- Free tier has rate limits
- Wait a few minutes between runs or upgrade your Firecrawl plan
Gemini API errors:
- Verify your Gemini API key is correct
- Check you haven't exceeded the free tier quota
- Ensure the API key has the correct permissions
MongoDB connection failed:
- Verify your IP is whitelisted in MongoDB Atlas
- Check the connection string is correct
- Ensure the database user has write permissions
- Open http://localhost:3000 in your browser
- You should see the Find My Prof landing page
- Click "Login with Google" to authenticate
- Upload a resume PDF to test the matching functionality
- Check http://localhost:8000/health to verify the agent service is running
Port already in use:
# Find and kill process on port 3000 or 8000
lsof -ti:3000 | xargs kill -9 # Web app
lsof -ti:8000 | xargs kill -9 # Agent serviceMongoDB connection issues:
- Ensure your IP is whitelisted in MongoDB Atlas
- Verify the connection string is correct
- Check that the database user has proper permissions
Google OAuth not working:
- Verify redirect URI matches exactly in Google Cloud Console
- Ensure both Google+ API and Gmail API are enabled
- Check that NEXTAUTH_URL matches your local environment
Module not found errors:
- Make sure you're in the correct directory
- Verify virtual environment is activated (for Python services)
- Re-run
npm installorpip install -r requirements.txt
| Variable | Description | Example |
|---|---|---|
MONGODB_URI |
MongoDB Atlas connection string | mongodb+srv://user:[email protected]/ |
GOOGLE_CLIENT_ID |
Google OAuth client ID | 123456789.apps.googleusercontent.com |
GOOGLE_CLIENT_SECRET |
Google OAuth client secret | GOCSPX-xxxxxxxxxxxxxx |
NEXTAUTH_URL |
Base URL for NextAuth | http://localhost:3000 |
NEXTAUTH_SECRET |
NextAuth JWT encryption secret | Generated via openssl rand -base64 32 |
AGENT_SERVICE_URL |
Python agent service URL | http://localhost:8000 |
AGENT_SERVICE_API_KEY |
Agent service API key (must match agent service) | Any secure random string |
| Variable | Description | Example |
|---|---|---|
GEMINI_API_KEY |
Google Gemini API key | AIzaSyXXXXXXXXXXXXXXXXXXXXXX |
API_KEY |
Service API key for authentication | Any secure random string |
| Variable | Description | Example |
|---|---|---|
FIRECRAWL_API_KEY |
Firecrawl API key for web scraping | fc-xxxxxxxxxxxxxx |
GEMINI_API_KEY |
Google Gemini API key | AIzaSyXXXXXXXXXXXXXXXXXXXXXX |
MONGODB_URI |
MongoDB Atlas connection string | mongodb+srv://user:[email protected]/ |
- Login → Google OAuth (grants gmail.send permission)
- Upload Resume → PDF processed in-memory
- Select Faculty → Choose from: Med Sci, Health Sci, CS, Engineering
- View Matches → See 12 professors with compatibility scores
- Review Drafts → Edit/delete personalized emails
- Send → Batch send via Gmail API
cd web
vercel deploycd agent-service
railway upRun once locally to populate the database:
cd data-pipeline
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python run_pipeline.py- Web: Use Turbopack for fast development (
npm run dev) - Agent: Hot reload enabled with
uvicorn --reload(use virtual environment) - Pipeline: Run once locally to seed database
POST /api/agent/match- Match student with professorsPOST /api/agent/draft- Generate email draftsPOST /api/gmail/send- Send email via GmailGET /api/professors- Query professor database
POST /draft- Run drafter agentGET /health- Health check
- API key authentication for agent service
- In-memory PDF processing (no storage)
- Server-side only secrets (no NEXT_PUBLIC_ vars)
- HTTPS only in production