Stories by Graduates Rising in Information and Data Science on Medium

ResumeGenie: Building an Intelligent Resume Optimization Engine

Graduates Rising in Information and Data Science — Mon, 28 Apr 2025 20:52:15 GMT

By Aryan Vats, Nidhi Choudhary, Eben Gunadi, Muqi Zhang and Justin Chen

LangChain-Powered Resume Creation with LLMs, RAG, and Streamlit

Introduction

In a hyper-competitive job market, securing interview opportunities increasingly relies on crafting highly personalized, ATS-optimized resumes. Traditional manual tailoring is slow and inefficient. ResumeGenie presents an intelligent, automated system that optimizes resumes using a sophisticated architecture of parsing pipelines, retrieval-augmented generation (RAG), and LLM-driven personalization.

In this technical deep dive, we’ll explore how ResumeGenie is architected, how each component operates, and how various technologies come together to streamline resume optimization.

Problem Statement

Most resumes fail at two critical checkpoints:

ATS Systems: Automated filters reject resumes lacking proper structure and relevant keywords.
Human Recruiters: Generic resumes don’t showcase alignment to specific job descriptions.

Candidates need a scalable, intelligent method to tailor resumes across multiple applications, without sacrificing quality or personalization.

Solution Architecture Overview

Figure 1: Solution Architecture

Key pipeline stages:

Candidate Upload
Parsing Layer
Matching & Ranking via ChromaDB
Resume Generation with LangChain and LLMs
Output PDF Conversion

Each stage is modularized for maximum flexibility and traceability.

1. Candidate Upload and Parsing Layer: Converting PDFs into Structured JSON

Figure 2: Parsing PDF based Resume

The candidate first uploads their raw resume and the corresponding job description through our web interface.

On the backend, we immediately capture these files and start parsing:

PDF Extraction: Using PDFPlumber, we extract raw text from user-uploaded PDF resumes.
Structured Data Modeling: We use Pydantic to specify a desired JSON schema (with fields like education, experience, skills).
LLM-Assisted Refinement: The extracted text and JSON schema are then fed into ChatGPT-4o, which formats the parsed data neatly into a clean, machine-readable JSON object.

This multi-step parsing ensures high fidelity in transferring information from PDFs to structured data, making subsequent processing seamless.

2. Retrieval-Augmented Generation (RAG) with ChromaDB

Figure 3: Vector Storage and Retrival with ChromaDB

After parsing, we apply RAG to ground our LLM outputs:

Knowledge Base Construction: We build a knowledge base of successful, domain-specific resumes.
Vectorization and Storage: These documents are converted into embeddings and stored in ChromaDB, an open-source vector database.
Retriever Query: Based on the candidate’s extracted field, we query ChromaDB to retrieve the top two most relevant resume examples using similarity search.

This approach reduces hallucinations by grounding the LLM’s generation on high-quality, retrieved examples.

3. LangChain Workflow for Resume Generation

Figure 4: Langchain Workflow to generate Resume in JSON

The full transformation from raw resume to ATS-ready JSON document unfolds as follows:

Input Chain: Ingests the candidate’s parsed resume JSON and applies a custom PromptTemplate to extract the applicant's major or field of expertise.
Retriever Query Chain: Powered by the field information, a retriever queries ChromaDB to fetch two top-matching resumes.
Few-Shot Prompt Assembly: Constructs a FewShotChatMessagePromptTemplate by inserting the retrieved examples as demonstrations.
Resume Generation Chain: This enriched prompt, including system instructions, few-shot examples, and the raw resume, is passed into a ChatOpenAI model to generate an ATS-optimized resume.
Output Parsing Chain: The LLM’s response is validated and cleaned by an OutputParser, ensuring it adheres to the required JSON structure.

This pipeline dramatically enhances the resume’s quality, personalization, and ATS compatibility.

4. JSON to Downloadable PDF Resume

The structured JSON resume is mapped back into a polished, professional-looking PDF document via our custom rendering pipeline built in Streamlit. It uses weasyprint and jinja2 as the templating engine with a customizable HTML template. The final product is fully ATS-compliant and visually appealing, ready for immediate job applications.

Technology Stack

Figure 5: Overview of Technologies Used in the ResumeGenie System Architecture

Figure 5: UI for ResumeGenie deployed at Streamlit

ResumeGenie empowers candidates with intelligent, automated, and personalized resume crafting — unlocking greater opportunities in today’s competitive landscape.

About the Team

ResumeGenie was built by Aryan Vats, Nidhi Choudhary, Eben Gunadi Muqi Zhang, Justin Chen who are students at GRIDS USC passionate about applying cutting-edge AI to real-world challenges.

From Hours to Moments: Transforming Video Search with Multimodal Retrieval

Graduates Rising in Information and Data Science — Mon, 28 Apr 2025 16:56:03 GMT

By Muhammad Adil Fayyaz, Tanmay Surendra Thakare, and Harshita Dooja Poojary

Image Source: https://www.searchenginejournal.com/how-reverse-video-search/464654/

Imagine sitting through a two-hour lecture just to revisit a single concept that was discussed for five minutes. For many students, this isn’t just frustrating; it’s a real barrier to efficient learning. Traditional video platforms and search tools often lack the semantic depth to accurately pinpoint where specific topics are explained, especially when information is conveyed both visually (slides, diagrams) and verbally (spoken words).

To tackle this challenge, we set out to build an intelligent multimodal search system that allows students to search within videos, not just using text, but also leveraging visual content. Our project aims to make educational videos truly searchable by creating a Unified Semantic Space for Multimodal Retrieval.

Why This Problem Matters (Especially for Students)

Students today rely heavily on recorded lectures and online courses. However, current search systems are typically unimodal — they either look at the transcript (if available) or, worse, only search titles and descriptions.

No Good Open-Source Solutions: Proprietary models offer impressive video search capabilities, but being paid services, they limit accessibility for students.
Lost Time = Lost Learning: Without intelligent search, students waste valuable time scrubbing through long videos, often leading to frustration and missed learning opportunities.
Multimodal Understanding is Key: In real educational content, meaning is communicated both through speech and visuals. Ignoring one mode leads to incomplete search results.

Our goal was simple: help students find exactly what they need inside a lecture, fast and free.

The Technology Behind It —

Model Selection: Why We Chose BridgeTower Over CLIP

We evaluated different models and ultimately chose BridgeTower, a powerful multimodal model that fuses visual and textual data early in its architecture. Compared to models like CLIP (which perform late fusion) or LXMERT (which are heavyweight), BridgeTower offers:

Efficient early fusion of text and images.
Fewer visual tokens, reducing computational cost.
Faster training and inference with a strong speed-accuracy tradeoff.
Unified transformer layers, meaning lower memory requirements.

This made it ideal for scaling across large lecture datasets without overwhelming our resources.

How It Works: Our Processing Pipeline

Architecture Overview

1. Video Collection

We downloaded a full YouTube playlist (NPTEL IIT Guwahati — AI courses) as MP4 files.

def download_youtube_video(url: str, outdir: Path) -> Path:
    """Download YouTube video via yt-dlp. Returns the path to the saved .mp4."""
    outdir.mkdir(parents=True, exist_ok=True)
    ydl_opts = {
        'format': 'best',
        'outtmpl': str(outdir / '%(id)s.%(ext)s'),
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=True)
    # yt-dlp will save as .mp4 (or .mkv, etc.)
    filename = outdir / f"{info['id']}.{info['ext']}"
    return filenamep

2. Frame Sampling & Transcripts

We extracted all frames and generated timestamped transcripts using OpenAI’s Whisper model. Two key optimizations improved our quality:

Added ±7 words around each frame for better transcript context.
Linked skipped-frame transcripts to the nearest sampled frame.

def download_transcript(video_id: str, outdir: Path) -> Path:
    """Fetch transcript via youtube_transcript_api and save as .vtt."""
    outdir.mkdir(parents=True, exist_ok=True)
    transcript = YouTubeTranscriptApi.get_transcript(video_id)
    vtt_file = outdir / f"{video_id}.vtt"
    with open(vtt_file, "w", encoding="utf-8") as f:
        f.write("WEBVTT\n\n")
        for seg in transcript:
            start = seg['start']
            end = start + seg['duration']
            f.write(f"{format_time(start)} --> {format_time(end)}\n")
            f.write(seg['text'] + "\n\n")
    return vtt_file

def preprocess_text(text: str) -> str:
    """Lowercase, strip punctuation/brackets, remove stopwords."""
    text = re.sub(r'\[.*?\]', '', text.lower())
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    stops = set(stopwords.words('english'))
    return ' '.join(w for w in tokens if w not in stops)

3. Metadata Creation

Each frame was tagged with:

Image path
Video ID
Mid-time (timestamp)
Cleaned transcript snippet

def process_videos(video_urls: list, output_root: Path = Path("/content/drive.../path")):
    for url in video_urls:
        vid_id = extract_video_id(url)
        workdir = output_root / vid_id
        vids_dir = workdir / "video"
        trans_dir = workdir / "transcript"
        frames_dir = workdir / "frames"

        # 1) Download video
        video_path = download_youtube_video(url, vids_dir)

        # 2) Download transcript
        transcript_path = download_transcript(vid_id, trans_dir)

        # 3) Extract frames + metadata
        metas = extract_frames_and_metadata(video_path, transcript_path, frames_dir)

        # 4) Save per-video metadata.json
        out = {
            "video_url": url,
            "video_id": vid_id,
            "video_path": str(video_path),
            "transcript_path": str(transcript_path),
            "frames": metas
        }
        with open(workdir / "metadata.json", "w") as f:
            json.dump(out, f, indent=2)

        print(f"[✔] Processed {vid_id}: {len(metas)} frames extracted.")

4. Embedding Generation

We passed the frames and their associated transcript snippets into BridgeTower to generate a 512-dimensional multimodal embedding for each video segment.

processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
model     = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
model.eval()  # make sure we’re in eval mode

def extract_frame_embeddings(metadata):
    embeddings = {}
    for entry in metadata:
        image = Image.open(entry['extracted_frame_path'])
        text  = entry['transcript']
        encoding = processor(image, text=text, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**encoding)
        # unchanged: use cross_embeds pooled output
        image_embeddings = outputs.cross_embeds.squeeze().cpu().numpy()
        embeddings[entry['video_segment_id']] = image_embeddings
    return embeddings

5. Indexing with LanceDB

Using LanceDB’s IVF_HNSW_PQ indexing, we stored the embeddings for scalable and fast approximate nearest neighbor search.

import lancedb
import pyarrow as pa
db = lancedb.connect("lancedb_data")  # This will create a database in the "lancedb_data" directory
table = db.create_table(
    "frame_embeddings",
    schema=pa.schema([
        ("video_segment_id", pa.int32()),
        ("vector", pa.list_(pa.float32(), 512)),
        ("extracted_frame_path", pa.string()),
        ("transcript", pa.string()),
        ("video_path", pa.string()),
        ("mid_time_ms", pa.float64()),
        ("associated_transcript_times", pa.list_(pa.list_(pa.float64()))),  # List of tuples as list of float64
    ]),
    mode="create"
)

records = []
for meta in extracted_metadatas:
    segment_id = meta["video_segment_id"]
    if segment_id in loaded_embeddings:
        records.append({
            "video_segment_id": segment_id,
            "vector": list(loaded_embeddings[segment_id]),
            "extracted_frame_path": meta["extracted_frame_path"],
            "transcript": meta["transcript"],
            "video_path": meta["video_path"],
            "mid_time_ms": meta["mid_time_ms"],
            "associated_transcript_times": [
                [start, end] for start, end in meta.get("associated_transcript_times", [])
            ],
        })
table.add(records)
table.create_index(index_type="IVF_HNSW_SQ")

6. Retrieval

At search time:

User query is embedded using BridgeTower.
Top 5 closest matches are retrieved from LanceDB.
Clicking a result jumps straight to the relevant moment in the video.

from PIL import Image

dummy_img = Image.new("RGB", (224, 224), (255, 255, 255))
from transformers import BridgeTowerForContrastiveLearning
processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
def bt_embedding_from_bridgetower(prompt, base64_image=None):
    model2 = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
    inputs = {"text": [prompt], "images": dummy_img}

    if base64_image:
        image_bytes = base64.b64decode(base64_image)
        image = Image.open(BytesIO(image_bytes)).convert("RGB")
        inputs["images"] = [image]

    inputs = processor(**inputs, padding=True, return_tensors="pt")

    with torch.no_grad():
        outputs = model2(**inputs)

    text_embeddings = outputs.text_embeds.squeeze().tolist()

    return text_embeddings


class BridgeTowerEmbeddings(BaseModel, Embeddings):
    """ BridgeTower embedding model """

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        """Embed a list of documents using BridgeTower.
        Args:
            texts: The list of texts to embed.
        Returns:
            List of embeddings, one for each text.
        """
        embeddings = []
        for text in texts:
            embedding = bt_embedding_from_bridgetower(text, None)
            embeddings.append(embedding)
        return embeddings

    def embed_query(self, text: str) -> List[float]:
        """Embed a query using BridgeTower.
        Args:
            text: The text to embed.
        Returns:
            Embeddings for the text.
        """
        return self.embed_documents([text])[0]

bt_embeddings = BridgeTowerEmbeddings()

query_text = "intelligent behavior"  # Example
query_embedding = bt_embeddings.embed_query(query_text)

# Perform similarity search
results = table.search(query_embedding, vector_column_name="vector").limit(5).nprobes(2).refine_factor(5).to_list()

# Print results
for result in results:
    print(result)

Frontend: Seamless Student Experience

We built a lightweight Streamlit app where students can:

Choose a video from a list.
Enter a text query (e.g., “Bayesian Networks”).
Instantly see top-matching timestamps.
Click to jump directly to the right part of the video.

This transforms hours of manual browsing into seconds of smart retrieval.

How Well Does It Work?

We evaluated the system using metrics like Precision@5, Recall@5, and Mean Reciprocal Rank (MRR):

Metric Score

Average Precision@5: 0.52

Average Recall@5: 0.49

MRR: 0.67

These results show that, on average, students are likely to find the right content within the top few results, a huge improvement over random scrubbing!

Challenges We Faced

Fine-Tuning the Model:

We initially wanted to fine-tune BridgeTower (using LoRA) to better adapt it to lecture-specific domains. However, GPU memory limitations prevented us from doing so.

Ambiguous Frames:

Lectures often have silent frames (e.g., when professors are writing on boards or switching slides), making it tricky to align frames with meaningful transcript snippets.

Contrastive Learning:

Using BridgeTower for contrastive learning across ambiguous frames was another challenge, given the variability in lecture visuals.

Future Directions

Our work is just the beginning. Some promising next steps include:

Separate Indexes: Create distinct indexes for visual and textual embeddings to improve search granularity.
Fine-Tuning: Revisit LoRA-based fine-tuning once more compute is available.
Advanced Metadata Extraction: Use OCR for reading slide text and Named Entity Recognition (NER) to enhance query alignment.
External Knowledge Integration: Pull in Wikipedia or ArXiv summaries for expanded learning within the app.

Conclusion

By combining cutting-edge multimodal retrieval techniques with accessible open-source tools, we took a big step toward making lecture content truly searchable and student-friendly. No more endless scrubbing through videos; just search, click, and learn.

If you’re passionate about education, AI, or just hate wasting time finding the right clip in a 2-hour lecture, we hope this inspires you to imagine smarter, more human-centric learning tools for the future.

Contributors

Muhammad Adil Fayyaz, Tanmay Surendra Thakare, and Harshita Dooja Poojary