Skip to content

garv130/FindMyProf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Find My Prof

A platform for undergraduate students to find research supervisors and automate personalized cold emails.

Architecture

Hybrid Microservice Architecture:

  • Next.js 16 Web App: Frontend UI, Google OAuth, rate limiting, MongoDB queries
  • Python FastAPI Agent Service: LangChain/LangGraph AI agents, PDF parsing, Gemini integration
  • Firecrawl Data Pipeline: Standalone scraping service to populate professor database

Prerequisites

Before you begin, ensure you have the following installed:

  • Node.js (v18 or higher) and npm (v9 or higher)

    • Download from nodejs.org
    • Verify installation: node --version && npm --version
  • Python (v3.10 or higher) and pip

    • Download from python.org
    • Verify installation: python3 --version && pip3 --version
  • Git

  • MongoDB Atlas Account (Free tier available)

Project Structure

find-my-prof/
├── web/                    # Next.js 16 Application
│   ├── app/               # App router pages and API routes
│   ├── lib/               # Core utilities (MongoDB, Auth, Rate limiting)
│   ├── components/        # React components
│   ├── CLAUDE.md          # Context file for AI assistance
│   └── SPEC.md            # Technical specification
│
├── agent-service/         # Python FastAPI Microservice
│   ├── app/
│   │   ├── agents/        # LangGraph agents (matcher, drafter)
│   │   ├── services/      # PDF parsing, Gemini, Semantic Scholar
│   │   └── main.py        # FastAPI app
│   ├── requirements.txt
│   └── Dockerfile
│
└── data-pipeline/         # Firecrawl Scraping Pipeline
    ├── scrapers/          # Firecrawl integration
    ├── processors/        # Data cleaning and MongoDB seeding
    ├── config.yaml        # University URLs
    └── run_pipeline.py    # Main execution script

Tech Stack

Web (Next.js 16)

  • TypeScript, Tailwind CSS
  • NextAuth.js (Google OAuth 2.0)
  • MongoDB Atlas
  • Gmail API

Agent Service (Python)

  • FastAPI
  • LangChain + LangGraph
  • Gemini 1.5 Flash
  • Semantic Scholar API
  • PyPDF2, pdfplumber

Data Pipeline (Python)

  • Firecrawl API
  • MongoDB Atlas
  • YAML configuration

Getting Started

Follow these steps to get the entire project up and running on your local machine.

Step 1: Clone the Repository

git clone https://github.com/yourusername/find-my-prof.git
cd find-my-prof

Step 2: Set Up MongoDB Atlas

  1. Create a free account at MongoDB Atlas
  2. Create a new cluster (free tier M0 is sufficient)
  3. Create a database user with read/write permissions
  4. Whitelist your IP address (or use 0.0.0.0/0 for development)
  5. Get your connection string (should look like: mongodb+srv://username:[email protected]/)

Step 3: Obtain API Keys

You'll need the following API keys:

  1. Google OAuth Credentials

    • Go to Google Cloud Console
    • Create a new project or select an existing one
    • Enable the Google+ API and Gmail API
    • Go to "Credentials" → "Create Credentials" → "OAuth 2.0 Client ID"
    • Set authorized redirect URI: http://localhost:3000/api/auth/callback/google
    • Save your Client ID and Client Secret
  2. Google Gemini API Key

  3. Firecrawl API Key (Optional - only needed for data pipeline)

    • Sign up at Firecrawl
    • Get your API key from the dashboard
  4. NextAuth Secret

Step 4: Set Up the Python Agent Service

# Navigate to agent-service directory
cd agent-service

# Create a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Copy environment file
cp .env.example .env

# Edit .env and add your API keys
# GEMINI_API_KEY=your_gemini_api_key_here
# API_KEY=create_any_secure_random_string_here

Start the agent service:

uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

The agent service should now be running at http://localhost:8000

Step 5: Set Up the Web Application

Open a new terminal window:

# Navigate to web directory
cd web

# Install dependencies
npm install

# Copy environment file
cp .env.example .env.local

# Edit .env.local and fill in all the environment variables:
# MONGODB_URI=your_mongodb_connection_string
# GOOGLE_CLIENT_ID=your_google_oauth_client_id
# GOOGLE_CLIENT_SECRET=your_google_oauth_client_secret
# NEXTAUTH_URL=http://localhost:3000
# NEXTAUTH_SECRET=your_generated_secret
# AGENT_SERVICE_URL=http://localhost:8000
# AGENT_SERVICE_API_KEY=same_as_API_KEY_in_agent_service

Start the web application:

npm run dev

The web app should now be running at http://localhost:3000

Step 6: Populate Your Database with Professor Data

The data pipeline uses Firecrawl and Google Gemini to scrape professor information from university faculty pages and automatically populate your MongoDB database. This is how you'll get professor data into your system.

Prerequisites for Data Pipeline

You'll need:

  • Python 3.12 (required for compatibility)
  • Firecrawl API Key - Sign up at firecrawl.dev
  • Google Gemini API Key - Get it from Google AI Studio
  • MongoDB URI - Your connection string from Step 2

Setup Instructions

Open a new terminal window:

# Navigate to data-pipeline directory
cd data-pipeline

# Verify you have Python 3.12 installed
python3.12 --version

# Create a virtual environment with Python 3.12
python3.12 -m venv venv

# Activate the virtual environment
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Copy environment template
cp .env.example .env

Configure Environment Variables

Edit the .env file in the data-pipeline directory:

# Required: Get from https://www.firecrawl.dev/
FIRECRAWL_API_KEY=fc-your_firecrawl_api_key_here

# Required: Get from https://makersuite.google.com/app/apikey
GEMINI_API_KEY=AIzaSy_your_gemini_api_key_here

# Required: Your MongoDB Atlas connection string
MONGODB_URI=mongodb+srv://username:[email protected]/

Finding the Right Faculty Page URL

The pipeline works best with faculty directory pages that list professors. Here's how to find them:

  1. Navigate to your target university's website

    • Example: University of Toronto, McMaster University, etc.
  2. Find the faculty/department page

    • Look for pages like "Faculty", "People", "Faculty Directory", or "Our Team"
    • These pages typically list all professors with their names, titles, and research areas
  3. Copy the full URL

    • The URL should look like:
      • https://www.cs.toronto.edu/people/faculty/
      • https://healthsci.mcmaster.ca/faculty-members
      • https://engineering.university.edu/faculty-directory
  4. Verify the page contains professor information

    • Check that the page lists professor names, titles, and research interests
    • The more structured the page, the better the extraction will work

Running the Pipeline

Use the following command structure:

python run_pipeline.py --url "FACULTY_PAGE_URL" --university "UNIVERSITY_NAME" --faculty "FACULTY_NAME"

Example 1: Computer Science Department

python run_pipeline.py \
  --url "https://www.cs.toronto.edu/people/faculty/" \
  --university "University of Toronto" \
  --faculty "Computer Science"

Example 2: Health Sciences

python run_pipeline.py \
  --url "https://healthsci.mcmaster.ca/faculty-members" \
  --university "McMaster University" \
  --faculty "Health Sciences"

Example 3: Engineering

python run_pipeline.py \
  --url "https://engineering.example.edu/faculty" \
  --university "Example University" \
  --faculty "Engineering"

What the Pipeline Does

  1. Scrapes the faculty page using Firecrawl (respects robots.txt)
  2. Extracts professor information using Google Gemini AI:
    • Name
    • Title
    • Email
    • Research interests
    • Personal website (if available)
  3. Cleans and validates the data
  4. Seeds to MongoDB with university and faculty metadata
  5. Upserts data - Updates existing professors or inserts new ones

Expected Output

When the pipeline runs successfully, you'll see:

============================================================
Scraping: University of Toronto - Computer Science
URL: https://www.cs.toronto.edu/people/faculty/
============================================================

Scraping with Firecrawl...
Parsing with Gemini...
Extracted 45 professors

Cleaning data...
Cleaned professors: 43

Seeding to MongoDB...
============================================================
✓ Pipeline complete!
  Total processed: 43
  Inserted: 40
  Updated: 3
============================================================

Running for Multiple Universities

You can run the pipeline multiple times for different universities and faculties:

# Add Computer Science professors
python run_pipeline.py --url "https://cs.uni1.edu/faculty" --university "University 1" --faculty "Computer Science"

# Add Health Sciences professors
python run_pipeline.py --url "https://healthsci.uni1.edu/faculty" --university "University 1" --faculty "Health Sciences"

# Add Engineering professors from another university
python run_pipeline.py --url "https://eng.uni2.edu/people" --university "University 2" --faculty "Engineering"

Troubleshooting the Pipeline

"No professors extracted" error:

  • The page might not have structured professor data
  • Try a different faculty directory page
  • Ensure the URL is accessible and contains professor listings

Firecrawl rate limits:

  • Free tier has rate limits
  • Wait a few minutes between runs or upgrade your Firecrawl plan

Gemini API errors:

  • Verify your Gemini API key is correct
  • Check you haven't exceeded the free tier quota
  • Ensure the API key has the correct permissions

MongoDB connection failed:

  • Verify your IP is whitelisted in MongoDB Atlas
  • Check the connection string is correct
  • Ensure the database user has write permissions

Verify Installation

  1. Open http://localhost:3000 in your browser
  2. You should see the Find My Prof landing page
  3. Click "Login with Google" to authenticate
  4. Upload a resume PDF to test the matching functionality
  5. Check http://localhost:8000/health to verify the agent service is running

Troubleshooting

Port already in use:

# Find and kill process on port 3000 or 8000
lsof -ti:3000 | xargs kill -9  # Web app
lsof -ti:8000 | xargs kill -9  # Agent service

MongoDB connection issues:

  • Ensure your IP is whitelisted in MongoDB Atlas
  • Verify the connection string is correct
  • Check that the database user has proper permissions

Google OAuth not working:

  • Verify redirect URI matches exactly in Google Cloud Console
  • Ensure both Google+ API and Gmail API are enabled
  • Check that NEXTAUTH_URL matches your local environment

Module not found errors:

  • Make sure you're in the correct directory
  • Verify virtual environment is activated (for Python services)
  • Re-run npm install or pip install -r requirements.txt

Environment Variables Reference

Web App (.env.local)

Variable Description Example
MONGODB_URI MongoDB Atlas connection string mongodb+srv://user:[email protected]/
GOOGLE_CLIENT_ID Google OAuth client ID 123456789.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET Google OAuth client secret GOCSPX-xxxxxxxxxxxxxx
NEXTAUTH_URL Base URL for NextAuth http://localhost:3000
NEXTAUTH_SECRET NextAuth JWT encryption secret Generated via openssl rand -base64 32
AGENT_SERVICE_URL Python agent service URL http://localhost:8000
AGENT_SERVICE_API_KEY Agent service API key (must match agent service) Any secure random string

Agent Service (.env)

Variable Description Example
GEMINI_API_KEY Google Gemini API key AIzaSyXXXXXXXXXXXXXXXXXXXXXX
API_KEY Service API key for authentication Any secure random string

Data Pipeline (.env)

Variable Description Example
FIRECRAWL_API_KEY Firecrawl API key for web scraping fc-xxxxxxxxxxxxxx
GEMINI_API_KEY Google Gemini API key AIzaSyXXXXXXXXXXXXXXXXXXXXXX
MONGODB_URI MongoDB Atlas connection string mongodb+srv://user:[email protected]/

User Flow

  1. Login → Google OAuth (grants gmail.send permission)
  2. Upload Resume → PDF processed in-memory
  3. Select Faculty → Choose from: Med Sci, Health Sci, CS, Engineering
  4. View Matches → See 12 professors with compatibility scores
  5. Review Drafts → Edit/delete personalized emails
  6. Send → Batch send via Gmail API

Deployment

Web (Vercel)

cd web
vercel deploy

Agent Service (Railway)

cd agent-service
railway up

Data Pipeline (One-Time Local Execution)

Run once locally to populate the database:

cd data-pipeline
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python run_pipeline.py

Development

  • Web: Use Turbopack for fast development (npm run dev)
  • Agent: Hot reload enabled with uvicorn --reload (use virtual environment)
  • Pipeline: Run once locally to seed database

API Endpoints

Web App

  • POST /api/agent/match - Match student with professors
  • POST /api/agent/draft - Generate email drafts
  • POST /api/gmail/send - Send email via Gmail
  • GET /api/professors - Query professor database

Agent Service

  • POST /draft - Run drafter agent
  • GET /health - Health check

Security Features

  • API key authentication for agent service
  • In-memory PDF processing (no storage)
  • Server-side only secrets (no NEXT_PUBLIC_ vars)
  • HTTPS only in production

About

ConU Hacks 2026

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •