Find My Prof

A platform for undergraduate students to find research supervisors and automate personalized cold emails.

Architecture

Hybrid Microservice Architecture:

Next.js 16 Web App: Frontend UI, Google OAuth, rate limiting, MongoDB queries
Python FastAPI Agent Service: LangChain/LangGraph AI agents, PDF parsing, Gemini integration
Firecrawl Data Pipeline: Standalone scraping service to populate professor database

Prerequisites

Before you begin, ensure you have the following installed:

Node.js (v18 or higher) and npm (v9 or higher)
- Download from nodejs.org
- Verify installation: node --version && npm --version
Python (v3.10 or higher) and pip
- Download from python.org
- Verify installation: python3 --version && pip3 --version
Git
- Download from git-scm.com
MongoDB Atlas Account (Free tier available)
- Sign up at mongodb.com/cloud/atlas

Project Structure

find-my-prof/
├── web/                    # Next.js 16 Application
│   ├── app/               # App router pages and API routes
│   ├── lib/               # Core utilities (MongoDB, Auth, Rate limiting)
│   ├── components/        # React components
│   ├── CLAUDE.md          # Context file for AI assistance
│   └── SPEC.md            # Technical specification
│
├── agent-service/         # Python FastAPI Microservice
│   ├── app/
│   │   ├── agents/        # LangGraph agents (matcher, drafter)
│   │   ├── services/      # PDF parsing, Gemini, Semantic Scholar
│   │   └── main.py        # FastAPI app
│   ├── requirements.txt
│   └── Dockerfile
│
└── data-pipeline/         # Firecrawl Scraping Pipeline
    ├── scrapers/          # Firecrawl integration
    ├── processors/        # Data cleaning and MongoDB seeding
    ├── config.yaml        # University URLs
    └── run_pipeline.py    # Main execution script

Tech Stack

Web (Next.js 16)

TypeScript, Tailwind CSS
NextAuth.js (Google OAuth 2.0)
MongoDB Atlas
Gmail API

Agent Service (Python)

FastAPI
LangChain + LangGraph
Gemini 1.5 Flash
Semantic Scholar API
PyPDF2, pdfplumber

Data Pipeline (Python)

Firecrawl API
MongoDB Atlas
YAML configuration

Getting Started

Follow these steps to get the entire project up and running on your local machine.

Step 1: Clone the Repository

git clone https://github.com/yourusername/find-my-prof.git
cd find-my-prof

Step 2: Set Up MongoDB Atlas

Create a free account at MongoDB Atlas
Create a new cluster (free tier M0 is sufficient)
Create a database user with read/write permissions
Whitelist your IP address (or use 0.0.0.0/0 for development)
Get your connection string (should look like: mongodb+srv://username:[email protected]/)

Step 3: Obtain API Keys

You'll need the following API keys:

Google OAuth Credentials
- Go to Google Cloud Console
- Create a new project or select an existing one
- Enable the Google+ API and Gmail API
- Go to "Credentials" → "Create Credentials" → "OAuth 2.0 Client ID"
- Set authorized redirect URI: http://localhost:3000/api/auth/callback/google
- Save your Client ID and Client Secret
Google Gemini API Key
- Visit Google AI Studio
- Create a new API key
- Save your API key
Firecrawl API Key (Optional - only needed for data pipeline)
- Sign up at Firecrawl
- Get your API key from the dashboard
NextAuth Secret
- Generate a random secret: openssl rand -base64 32
- Or visit: https://generate-secret.vercel.app/32

Step 4: Set Up the Python Agent Service

# Navigate to agent-service directory
cd agent-service

# Create a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Copy environment file
cp .env.example .env

# Edit .env and add your API keys
# GEMINI_API_KEY=your_gemini_api_key_here
# API_KEY=create_any_secure_random_string_here

Start the agent service:

uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

The agent service should now be running at http://localhost:8000

Step 5: Set Up the Web Application

Open a new terminal window:

# Navigate to web directory
cd web

# Install dependencies
npm install

# Copy environment file
cp .env.example .env.local

# Edit .env.local and fill in all the environment variables:
# MONGODB_URI=your_mongodb_connection_string
# GOOGLE_CLIENT_ID=your_google_oauth_client_id
# GOOGLE_CLIENT_SECRET=your_google_oauth_client_secret
# NEXTAUTH_URL=http://localhost:3000
# NEXTAUTH_SECRET=your_generated_secret
# AGENT_SERVICE_URL=http://localhost:8000
# AGENT_SERVICE_API_KEY=same_as_API_KEY_in_agent_service

Start the web application:

npm run dev

The web app should now be running at http://localhost:3000

Step 6: Populate Your Database with Professor Data

The data pipeline uses Firecrawl and Google Gemini to scrape professor information from university faculty pages and automatically populate your MongoDB database. This is how you'll get professor data into your system.

Prerequisites for Data Pipeline

You'll need:

Python 3.12 (required for compatibility)
Firecrawl API Key - Sign up at firecrawl.dev
Google Gemini API Key - Get it from Google AI Studio
MongoDB URI - Your connection string from Step 2

Setup Instructions

Open a new terminal window:

# Navigate to data-pipeline directory
cd data-pipeline

# Verify you have Python 3.12 installed
python3.12 --version

# Create a virtual environment with Python 3.12
python3.12 -m venv venv

# Activate the virtual environment
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Copy environment template
cp .env.example .env

Configure Environment Variables

Edit the .env file in the data-pipeline directory:

# Required: Get from https://www.firecrawl.dev/
FIRECRAWL_API_KEY=fc-your_firecrawl_api_key_here

# Required: Get from https://makersuite.google.com/app/apikey
GEMINI_API_KEY=AIzaSy_your_gemini_api_key_here

# Required: Your MongoDB Atlas connection string
MONGODB_URI=mongodb+srv://username:[email protected]/

Finding the Right Faculty Page URL

The pipeline works best with faculty directory pages that list professors. Here's how to find them:

Navigate to your target university's website
- Example: University of Toronto, McMaster University, etc.
Find the faculty/department page
- Look for pages like "Faculty", "People", "Faculty Directory", or "Our Team"
- These pages typically list all professors with their names, titles, and research areas
Copy the full URL
- The URL should look like:
  - https://www.cs.toronto.edu/people/faculty/
  - https://healthsci.mcmaster.ca/faculty-members
  - https://engineering.university.edu/faculty-directory
Verify the page contains professor information
- Check that the page lists professor names, titles, and research interests
- The more structured the page, the better the extraction will work

Running the Pipeline

Use the following command structure:

python run_pipeline.py --url "FACULTY_PAGE_URL" --university "UNIVERSITY_NAME" --faculty "FACULTY_NAME"

Example 1: Computer Science Department

python run_pipeline.py \
  --url "https://www.cs.toronto.edu/people/faculty/" \
  --university "University of Toronto" \
  --faculty "Computer Science"

Example 2: Health Sciences

python run_pipeline.py \
  --url "https://healthsci.mcmaster.ca/faculty-members" \
  --university "McMaster University" \
  --faculty "Health Sciences"

Example 3: Engineering

python run_pipeline.py \
  --url "https://engineering.example.edu/faculty" \
  --university "Example University" \
  --faculty "Engineering"

What the Pipeline Does

Scrapes the faculty page using Firecrawl (respects robots.txt)
Extracts professor information using Google Gemini AI:
- Name
- Title
- Email
- Research interests
- Personal website (if available)
Cleans and validates the data
Seeds to MongoDB with university and faculty metadata
Upserts data - Updates existing professors or inserts new ones

Expected Output

When the pipeline runs successfully, you'll see:

============================================================
Scraping: University of Toronto - Computer Science
URL: https://www.cs.toronto.edu/people/faculty/
============================================================

Scraping with Firecrawl...
Parsing with Gemini...
Extracted 45 professors

Cleaning data...
Cleaned professors: 43

Seeding to MongoDB...
============================================================
✓ Pipeline complete!
  Total processed: 43
  Inserted: 40
  Updated: 3
============================================================

Running for Multiple Universities

You can run the pipeline multiple times for different universities and faculties:

# Add Computer Science professors
python run_pipeline.py --url "https://cs.uni1.edu/faculty" --university "University 1" --faculty "Computer Science"

# Add Health Sciences professors
python run_pipeline.py --url "https://healthsci.uni1.edu/faculty" --university "University 1" --faculty "Health Sciences"

# Add Engineering professors from another university
python run_pipeline.py --url "https://eng.uni2.edu/people" --university "University 2" --faculty "Engineering"

Troubleshooting the Pipeline

"No professors extracted" error:

The page might not have structured professor data
Try a different faculty directory page
Ensure the URL is accessible and contains professor listings

Firecrawl rate limits:

Free tier has rate limits
Wait a few minutes between runs or upgrade your Firecrawl plan

Gemini API errors:

Verify your Gemini API key is correct
Check you haven't exceeded the free tier quota
Ensure the API key has the correct permissions

MongoDB connection failed:

Verify your IP is whitelisted in MongoDB Atlas
Check the connection string is correct
Ensure the database user has write permissions

Verify Installation

Open http://localhost:3000 in your browser
You should see the Find My Prof landing page
Click "Login with Google" to authenticate
Upload a resume PDF to test the matching functionality
Check http://localhost:8000/health to verify the agent service is running

Troubleshooting

Port already in use:

# Find and kill process on port 3000 or 8000
lsof -ti:3000 | xargs kill -9  # Web app
lsof -ti:8000 | xargs kill -9  # Agent service

MongoDB connection issues:

Ensure your IP is whitelisted in MongoDB Atlas
Verify the connection string is correct
Check that the database user has proper permissions

Google OAuth not working:

Verify redirect URI matches exactly in Google Cloud Console
Ensure both Google+ API and Gmail API are enabled
Check that NEXTAUTH_URL matches your local environment

Module not found errors:

Make sure you're in the correct directory
Verify virtual environment is activated (for Python services)
Re-run npm install or pip install -r requirements.txt

Environment Variables Reference

Web App (.env.local)

Variable	Description	Example
`MONGODB_URI`	MongoDB Atlas connection string	`mongodb+srv://user:[email protected]/`
`GOOGLE_CLIENT_ID`	Google OAuth client ID	`123456789.apps.googleusercontent.com`
`GOOGLE_CLIENT_SECRET`	Google OAuth client secret	`GOCSPX-xxxxxxxxxxxxxx`
`NEXTAUTH_URL`	Base URL for NextAuth	`http://localhost:3000`
`NEXTAUTH_SECRET`	NextAuth JWT encryption secret	Generated via `openssl rand -base64 32`
`AGENT_SERVICE_URL`	Python agent service URL	`http://localhost:8000`
`AGENT_SERVICE_API_KEY`	Agent service API key (must match agent service)	Any secure random string

Agent Service (.env)

Variable	Description	Example
`GEMINI_API_KEY`	Google Gemini API key	`AIzaSyXXXXXXXXXXXXXXXXXXXXXX`
`API_KEY`	Service API key for authentication	Any secure random string

Data Pipeline (.env)

Variable	Description	Example
`FIRECRAWL_API_KEY`	Firecrawl API key for web scraping	`fc-xxxxxxxxxxxxxx`
`GEMINI_API_KEY`	Google Gemini API key	`AIzaSyXXXXXXXXXXXXXXXXXXXXXX`
`MONGODB_URI`	MongoDB Atlas connection string	`mongodb+srv://user:[email protected]/`

User Flow

Login → Google OAuth (grants gmail.send permission)
Upload Resume → PDF processed in-memory
Select Faculty → Choose from: Med Sci, Health Sci, CS, Engineering
View Matches → See 12 professors with compatibility scores
Review Drafts → Edit/delete personalized emails
Send → Batch send via Gmail API

Deployment

Web (Vercel)

cd web
vercel deploy

Agent Service (Railway)

cd agent-service
railway up

Data Pipeline (One-Time Local Execution)

Run once locally to populate the database:

cd data-pipeline
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python run_pipeline.py

Development

Web: Use Turbopack for fast development (npm run dev)
Agent: Hot reload enabled with uvicorn --reload (use virtual environment)
Pipeline: Run once locally to seed database

API Endpoints

Web App

POST /api/agent/match - Match student with professors
POST /api/agent/draft - Generate email drafts
POST /api/gmail/send - Send email via Gmail
GET /api/professors - Query professor database

Agent Service

POST /draft - Run drafter agent
GET /health - Health check

Security Features

API key authentication for agent service
In-memory PDF processing (no storage)
Server-side only secrets (no NEXT_PUBLIC_ vars)
HTTPS only in production

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
agent-service		agent-service
data-pipeline		data-pipeline
web		web
.gitignore		.gitignore
README.md		README.md

garv130/FindMyProf

Folders and files

Latest commit

History

Repository files navigation