Inspiration

Students hunt for faculty whose research, courses, or open positions match their passions - but the information is scattered across listings, personal websites, and lab sites. We want students to be able to post a link of their dream job listing and for us to do the rest to help them use their school's opportunities. A one-stop knowledge graph that surfaces who works on what, with direct links to emails, lab pages, and recent publications so students can discover mentors in seconds.

What it does

CLI + JSON knowledge base that:

  • Crawls public “People” pages (faculty, grads, post-docs, researchers) and each person’s Campus Directory profile.
  • Extracts name, title, department, email, websites, and rich extras (research interests, courses, awards).
  • Enriches each profile with research topics from the Semantic Scholar API (Work in progress...)
  • Embeds the combined text into 1,536-dim OpenAI vectors so downstream apps (chatbots, semantic search, match-making dashboards) can query by meaning.
  • Persists everything to data/records.json, ready for search, ranking, or feeding into a vector DB.

How we built it

  1. Node.js script (crawl.js) orchestrates the crawl.
  2. cheerio parses the HTML DOM for fast, jQuery-style scraping.
  3. node-fetch retrieves seed pages and profile URLs (with automatic HTTPS→HTTP fallback).
  4. Semantic Scholar API returns top topics for each author.
  5. OpenAI Embeddings API converts profile + topic + bio into semantic vectors.
  6. Output is stored locally; future steps can upsert into PostgreSQL + pgvector or any vector store.

Challenges we ran into

  • Semantic Scholar API - did not have enough time to totally debug what was going on. The unauthenticated user limit is probably the issue here; it's taking too long for Semantic Scholar website to approve me as a user.
  • Directory downtime - occasional DNS hiccups required HTTP fallback and retry logic

What we learned

  • Different colleges had vastly different HTML and front end, had to focus on just UCSC
  • Different departments are also vastly different, had to focus on just the Engineering department

What's next for College Matcher

  • Get Semantic Scholar fully working so students have access to research
  • Expand it to any college with an intelligent scraper that can handle new websites
  • A UI that allows students to type in the careers they want or post in job listings

Built With

Share this project:

Updates