<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Graduates Rising in Information and Data Science on Medium]]></title>
        <description><![CDATA[Stories by Graduates Rising in Information and Data Science on Medium]]></description>
        <link>https://medium.com/@grids_61465?source=rss-01e61bc624f5------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*za7EvhzxoBEVLsoh</url>
            <title>Stories by Graduates Rising in Information and Data Science on Medium</title>
            <link>https://medium.com/@grids_61465?source=rss-01e61bc624f5------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 06 Jun 2026 22:13:39 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@grids_61465/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[ResumeGenie: Building an Intelligent Resume Optimization Engine]]></title>
            <link>https://medium.com/@grids_61465/resumegenie-building-an-intelligent-resume-optimization-engine-d02a58a7a163?source=rss-01e61bc624f5------2</link>
            <guid isPermaLink="false">https://medium.com/p/d02a58a7a163</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[streamlit]]></category>
            <category><![CDATA[retrieval-augmented-gen]]></category>
            <category><![CDATA[pydantic]]></category>
            <category><![CDATA[langchain]]></category>
            <dc:creator><![CDATA[Graduates Rising in Information and Data Science]]></dc:creator>
            <pubDate>Mon, 28 Apr 2025 20:52:15 GMT</pubDate>
            <atom:updated>2025-04-28T21:34:20.607Z</atom:updated>
            <content:encoded><![CDATA[<p><strong>By Aryan Vats, Nidhi Choudhary, Eben Gunadi, Muqi Zhang and Justin Chen</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*LKQJ8iCTRk4IimhXx4ILaQ.png" /></figure><h3>LangChain-Powered Resume Creation with LLMs, RAG, and Streamlit</h3><p><strong>Introduction</strong></p><p>In a hyper-competitive job market, securing interview opportunities increasingly relies on crafting highly personalized, ATS-optimized resumes. Traditional manual tailoring is slow and inefficient. ResumeGenie presents an intelligent, automated system that optimizes resumes using a sophisticated architecture of parsing pipelines, retrieval-augmented generation (RAG), and LLM-driven personalization.</p><p>In this technical deep dive, we’ll explore how ResumeGenie is architected, how each component operates, and how various technologies come together to streamline resume optimization.</p><p><strong>Problem Statement</strong></p><p>Most resumes fail at two critical checkpoints:</p><ul><li><strong>ATS Systems</strong>: Automated filters reject resumes lacking proper structure and relevant keywords.</li><li><strong>Human Recruiters</strong>: Generic resumes don’t showcase alignment to specific job descriptions.</li></ul><p>Candidates need a scalable, intelligent method to tailor resumes across multiple applications, without sacrificing quality or personalization.</p><p><strong>Solution Architecture Overview</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*C0HIJVo6qcRTrJUJxNydug.png" /><figcaption><strong>Figure 1: Solution Architecture</strong></figcaption></figure><p>Key pipeline stages:</p><ul><li><strong>Candidate Upload</strong></li><li><strong>Parsing Layer</strong></li><li><strong>Matching &amp; Ranking via ChromaDB</strong></li><li><strong>Resume Generation with LangChain and LLMs</strong></li><li><strong>Output PDF Conversion</strong></li></ul><p>Each stage is modularized for maximum flexibility and traceability.</p><p><strong>1. Candidate Upload and Parsing Layer: Converting PDFs into Structured JSON</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/507/1*Qjl5_O1LPpZxKV5tyR9afw.png" /><figcaption>Figure 2: Parsing PDF based Resume</figcaption></figure><p>The candidate first uploads their raw resume and the corresponding job description through our web interface.</p><p>On the backend, we immediately capture these files and start parsing:</p><ul><li><strong>PDF Extraction</strong>: Using PDFPlumber, we extract raw text from user-uploaded PDF resumes.</li><li><strong>Structured Data Modeling</strong>: We use Pydantic to specify a desired JSON schema (with fields like education, experience, skills).</li><li><strong>LLM-Assisted Refinement</strong>: The extracted text and JSON schema are then fed into ChatGPT-4o, which formats the parsed data neatly into a clean, machine-readable JSON object.</li></ul><p>This multi-step parsing ensures high fidelity in transferring information from PDFs to structured data, making subsequent processing seamless.</p><p><strong>2. Retrieval-Augmented Generation (RAG) with ChromaDB</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/470/1*hxhAFzNxamFo-D6CuEbqhw.png" /><figcaption>Figure 3: Vector Storage and Retrival with ChromaDB</figcaption></figure><p>After parsing, we apply RAG to ground our LLM outputs:</p><ul><li><strong>Knowledge Base Construction</strong>: We build a knowledge base of successful, domain-specific resumes.</li><li><strong>Vectorization and Storage</strong>: These documents are converted into embeddings and stored in ChromaDB, an open-source vector database.</li><li><strong>Retriever Query</strong>: Based on the candidate’s extracted field, we query ChromaDB to retrieve the top two most relevant resume examples using similarity search.</li></ul><p>This approach reduces hallucinations by grounding the LLM’s generation on high-quality, retrieved examples.</p><p><strong>3. LangChain Workflow for Resume Generation</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*s8xfg2TRSjmRfGsmx7d5Rg.png" /><figcaption>Figure 4: Langchain Workflow to generate Resume in JSON</figcaption></figure><p>The full transformation from raw resume to ATS-ready JSON document unfolds as follows:</p><ul><li><strong>Input Chain</strong>: Ingests the candidate’s parsed resume JSON and applies a custom PromptTemplate to extract the applicant&#39;s major or field of expertise.</li><li><strong>Retriever Query Chain</strong>: Powered by the field information, a retriever queries ChromaDB to fetch two top-matching resumes.</li><li><strong>Few-Shot Prompt Assembly</strong>: Constructs a FewShotChatMessagePromptTemplate by inserting the retrieved examples as demonstrations.</li><li><strong>Resume Generation Chain</strong>: This enriched prompt, including system instructions, few-shot examples, and the raw resume, is passed into a ChatOpenAI model to generate an ATS-optimized resume.</li><li><strong>Output Parsing Chain</strong>: The LLM’s response is validated and cleaned by an OutputParser, ensuring it adheres to the required JSON structure.</li></ul><p>This pipeline dramatically enhances the resume’s quality, personalization, and ATS compatibility.</p><p><strong>4. JSON to Downloadable PDF Resume</strong></p><p>The structured JSON resume is mapped back into a polished, professional-looking PDF document via our custom rendering pipeline built in Streamlit. It uses weasyprint and jinja2 as the templating engine with a customizable HTML template. The final product is fully ATS-compliant and visually appealing, ready for immediate job applications.</p><p><strong>Technology Stack</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/411/1*m81EO2BXs8e4t2WIKGGGYA.png" /><figcaption>Figure 5: Overview of Technologies Used in the ResumeGenie System Architecture</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nubmyf8S-kZ3J6AcmOfONQ.png" /><figcaption><strong>Figure 5: UI for ResumeGenie deployed at Streamlit</strong></figcaption></figure><p>ResumeGenie empowers candidates with intelligent, automated, and personalized resume crafting — unlocking greater opportunities in today’s competitive landscape.</p><p><strong>About the Team</strong></p><p>ResumeGenie was built by <a href="https://www.linkedin.com/in/aryan-vats/">Aryan Vats</a>, <a href="https://www.linkedin.com/in/nidhi-choudhary7">Nidhi Choudhary</a>, <a href="https://www.linkedin.com/in/eben-gunadi-048898141">Eben Gunadi</a> <a href="https://www.linkedin.com/in/muqiz">Muqi Zhang</a>, Justin Chen who are students at GRIDS USC passionate about applying cutting-edge AI to real-world challenges.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d02a58a7a163" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[From Hours to Moments: Transforming Video Search with Multimodal Retrieval]]></title>
            <link>https://medium.com/@grids_61465/from-hours-to-moments-transforming-video-search-with-multimodal-retrieval-e87da4dacde8?source=rss-01e61bc624f5------2</link>
            <guid isPermaLink="false">https://medium.com/p/e87da4dacde8</guid>
            <category><![CDATA[search]]></category>
            <category><![CDATA[multimodal]]></category>
            <category><![CDATA[students]]></category>
            <category><![CDATA[vidéo]]></category>
            <category><![CDATA[information-retrieval]]></category>
            <dc:creator><![CDATA[Graduates Rising in Information and Data Science]]></dc:creator>
            <pubDate>Mon, 28 Apr 2025 16:56:03 GMT</pubDate>
            <atom:updated>2025-04-28T19:26:21.549Z</atom:updated>
            <content:encoded><![CDATA[<p><strong>By Muhammad Adil Fayyaz, Tanmay Surendra Thakare, and Harshita Dooja Poojary</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*S5R01VhTf6SIRiY2sk1RUA.png" /><figcaption>Image Source: <a href="https://www.searchenginejournal.com/how-reverse-video-search/464654/">https://www.searchenginejournal.com/how-reverse-video-search/464654/</a></figcaption></figure><p>Imagine sitting through a two-hour lecture just to revisit a single concept that was discussed for five minutes. For many students, this isn’t just frustrating; it’s a real barrier to efficient learning. Traditional video platforms and search tools often lack the semantic depth to accurately pinpoint where specific topics are explained, especially when information is conveyed both <em>visually</em> (slides, diagrams) and <em>verbally</em> (spoken words).</p><p>To tackle this challenge, we set out to build an <strong>intelligent multimodal search system</strong> that allows students to search <em>within</em> videos, not just using text, but also leveraging visual content. Our project aims to make educational videos truly searchable by creating a <strong>Unified Semantic Space for Multimodal Retrieval</strong>.</p><h3>Why This Problem Matters (Especially for Students)</h3><p>Students today rely heavily on recorded lectures and online courses. However, <strong>current search systems</strong> are typically <em>unimodal</em> — they either look at the transcript (if available) or, worse, only search titles and descriptions.</p><ul><li><strong>No Good Open-Source Solutions</strong>: Proprietary models offer impressive video search capabilities, but being paid services, they limit accessibility for students.</li><li><strong>Lost Time = Lost Learning</strong>: Without intelligent search, students waste valuable time scrubbing through long videos, often leading to frustration and missed learning opportunities.</li><li><strong>Multimodal Understanding is Key</strong>: In real educational content, meaning is communicated both through speech <em>and</em> visuals. Ignoring one mode leads to incomplete search results.</li></ul><p>Our goal was simple: <strong>help students find exactly what they need inside a lecture, fast and free.</strong></p><h3>The Technology Behind It —</h3><h3>Model Selection: Why We Chose BridgeTower Over CLIP</h3><p>We evaluated different models and ultimately chose <strong>BridgeTower</strong>, a powerful <strong>multimodal model</strong> that fuses visual and textual data early in its architecture. Compared to models like CLIP (which perform late fusion) or LXMERT (which are heavyweight), BridgeTower offers:</p><ul><li><strong>Efficient early fusion</strong> of text and images.</li><li><strong>Fewer visual tokens</strong>, reducing computational cost.</li><li><strong>Faster training and inference</strong> with a strong speed-accuracy tradeoff.</li><li><strong>Unified transformer layers</strong>, meaning lower memory requirements.</li></ul><p>This made it ideal for scaling across large lecture datasets without overwhelming our resources.</p><h3>How It Works: Our Processing Pipeline</h3><h4>Architecture Overview</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*kqz6WQWbpWm94edv" /></figure><p><strong>1. Video Collection</strong></p><p>We downloaded a full YouTube playlist (NPTEL IIT Guwahati — AI courses) as MP4 files.</p><pre>def download_youtube_video(url: str, outdir: Path) -&gt; Path:<br>    &quot;&quot;&quot;Download YouTube video via yt-dlp. Returns the path to the saved .mp4.&quot;&quot;&quot;<br>    outdir.mkdir(parents=True, exist_ok=True)<br>    ydl_opts = {<br>        &#39;format&#39;: &#39;best&#39;,<br>        &#39;outtmpl&#39;: str(outdir / &#39;%(id)s.%(ext)s&#39;),<br>    }<br>    with yt_dlp.YoutubeDL(ydl_opts) as ydl:<br>        info = ydl.extract_info(url, download=True)<br>    # yt-dlp will save as &lt;video_id&gt;.mp4 (or .mkv, etc.)<br>    filename = outdir / f&quot;{info[&#39;id&#39;]}.{info[&#39;ext&#39;]}&quot;<br>    return filenamep</pre><p><strong>2. Frame Sampling &amp; Transcripts</strong></p><p>We extracted all frames and generated timestamped transcripts using OpenAI’s Whisper model. Two key optimizations improved our quality:</p><ul><li>Added ±7 words around each frame for better transcript context.</li><li>Linked skipped-frame transcripts to the nearest sampled frame.</li></ul><pre>def download_transcript(video_id: str, outdir: Path) -&gt; Path:<br>    &quot;&quot;&quot;Fetch transcript via youtube_transcript_api and save as .vtt.&quot;&quot;&quot;<br>    outdir.mkdir(parents=True, exist_ok=True)<br>    transcript = YouTubeTranscriptApi.get_transcript(video_id)<br>    vtt_file = outdir / f&quot;{video_id}.vtt&quot;<br>    with open(vtt_file, &quot;w&quot;, encoding=&quot;utf-8&quot;) as f:<br>        f.write(&quot;WEBVTT\n\n&quot;)<br>        for seg in transcript:<br>            start = seg[&#39;start&#39;]<br>            end = start + seg[&#39;duration&#39;]<br>            f.write(f&quot;{format_time(start)} --&gt; {format_time(end)}\n&quot;)<br>            f.write(seg[&#39;text&#39;] + &quot;\n\n&quot;)<br>    return vtt_file<br><br>def preprocess_text(text: str) -&gt; str:<br>    &quot;&quot;&quot;Lowercase, strip punctuation/brackets, remove stopwords.&quot;&quot;&quot;<br>    text = re.sub(r&#39;\[.*?\]&#39;, &#39;&#39;, text.lower())<br>    text = text.translate(str.maketrans(&#39;&#39;, &#39;&#39;, string.punctuation))<br>    tokens = word_tokenize(text)<br>    stops = set(stopwords.words(&#39;english&#39;))<br>    return &#39; &#39;.join(w for w in tokens if w not in stops)</pre><p><strong>3. Metadata Creation</strong></p><p>Each frame was tagged with:</p><ul><li>Image path</li><li>Video ID</li><li>Mid-time (timestamp)</li><li>Cleaned transcript snippet</li></ul><pre>def process_videos(video_urls: list, output_root: Path = Path(&quot;/content/drive.../path&quot;)):<br>    for url in video_urls:<br>        vid_id = extract_video_id(url)<br>        workdir = output_root / vid_id<br>        vids_dir = workdir / &quot;video&quot;<br>        trans_dir = workdir / &quot;transcript&quot;<br>        frames_dir = workdir / &quot;frames&quot;<br><br>        # 1) Download video<br>        video_path = download_youtube_video(url, vids_dir)<br><br>        # 2) Download transcript<br>        transcript_path = download_transcript(vid_id, trans_dir)<br><br>        # 3) Extract frames + metadata<br>        metas = extract_frames_and_metadata(video_path, transcript_path, frames_dir)<br><br>        # 4) Save per-video metadata.json<br>        out = {<br>            &quot;video_url&quot;: url,<br>            &quot;video_id&quot;: vid_id,<br>            &quot;video_path&quot;: str(video_path),<br>            &quot;transcript_path&quot;: str(transcript_path),<br>            &quot;frames&quot;: metas<br>        }<br>        with open(workdir / &quot;metadata.json&quot;, &quot;w&quot;) as f:<br>            json.dump(out, f, indent=2)<br><br>        print(f&quot;[✔] Processed {vid_id}: {len(metas)} frames extracted.&quot;)</pre><p><strong>4. Embedding Generation</strong></p><p>We passed the frames and their associated transcript snippets into <strong>BridgeTower</strong> to generate a <strong>512-dimensional multimodal embedding</strong> for each video segment.</p><pre>processor = BridgeTowerProcessor.from_pretrained(&quot;BridgeTower/bridgetower-large-itm-mlm-itc&quot;)<br>model     = BridgeTowerForContrastiveLearning.from_pretrained(&quot;BridgeTower/bridgetower-large-itm-mlm-itc&quot;)<br>model.eval()  # make sure we’re in eval mode<br><br>def extract_frame_embeddings(metadata):<br>    embeddings = {}<br>    for entry in metadata:<br>        image = Image.open(entry[&#39;extracted_frame_path&#39;])<br>        text  = entry[&#39;transcript&#39;]<br>        encoding = processor(image, text=text, return_tensors=&quot;pt&quot;)<br>        with torch.no_grad():<br>            outputs = model(**encoding)<br>        # unchanged: use cross_embeds pooled output<br>        image_embeddings = outputs.cross_embeds.squeeze().cpu().numpy()<br>        embeddings[entry[&#39;video_segment_id&#39;]] = image_embeddings<br>    return embeddings</pre><p><strong>5. Indexing with LanceDB</strong></p><p>Using LanceDB’s <strong>IVF_HNSW_PQ indexing</strong>, we stored the embeddings for scalable and fast approximate nearest neighbor search.</p><pre>import lancedb<br>import pyarrow as pa<br>db = lancedb.connect(&quot;lancedb_data&quot;)  # This will create a database in the &quot;lancedb_data&quot; directory<br>table = db.create_table(<br>    &quot;frame_embeddings&quot;,<br>    schema=pa.schema([<br>        (&quot;video_segment_id&quot;, pa.int32()),<br>        (&quot;vector&quot;, pa.list_(pa.float32(), 512)),<br>        (&quot;extracted_frame_path&quot;, pa.string()),<br>        (&quot;transcript&quot;, pa.string()),<br>        (&quot;video_path&quot;, pa.string()),<br>        (&quot;mid_time_ms&quot;, pa.float64()),<br>        (&quot;associated_transcript_times&quot;, pa.list_(pa.list_(pa.float64()))),  # List of tuples as list of float64<br>    ]),<br>    mode=&quot;create&quot;<br>)</pre><pre>records = []<br>for meta in extracted_metadatas:<br>    segment_id = meta[&quot;video_segment_id&quot;]<br>    if segment_id in loaded_embeddings:<br>        records.append({<br>            &quot;video_segment_id&quot;: segment_id,<br>            &quot;vector&quot;: list(loaded_embeddings[segment_id]),<br>            &quot;extracted_frame_path&quot;: meta[&quot;extracted_frame_path&quot;],<br>            &quot;transcript&quot;: meta[&quot;transcript&quot;],<br>            &quot;video_path&quot;: meta[&quot;video_path&quot;],<br>            &quot;mid_time_ms&quot;: meta[&quot;mid_time_ms&quot;],<br>            &quot;associated_transcript_times&quot;: [<br>                [start, end] for start, end in meta.get(&quot;associated_transcript_times&quot;, [])<br>            ],<br>        })<br>table.add(records)<br>table.create_index(index_type=&quot;IVF_HNSW_SQ&quot;)</pre><p><strong>6. Retrieval</strong></p><p>At search time:</p><ul><li>User query is embedded using BridgeTower.</li><li>Top 5 closest matches are retrieved from LanceDB.</li><li>Clicking a result jumps straight to the relevant moment in the video.</li></ul><pre>from PIL import Image<br><br>dummy_img = Image.new(&quot;RGB&quot;, (224, 224), (255, 255, 255))<br>from transformers import BridgeTowerForContrastiveLearning<br>processor = BridgeTowerProcessor.from_pretrained(&quot;BridgeTower/bridgetower-large-itm-mlm-itc&quot;)<br>def bt_embedding_from_bridgetower(prompt, base64_image=None):<br>    model2 = BridgeTowerForContrastiveLearning.from_pretrained(&quot;BridgeTower/bridgetower-large-itm-mlm-itc&quot;)<br>    inputs = {&quot;text&quot;: [prompt], &quot;images&quot;: dummy_img}<br><br>    if base64_image:<br>        image_bytes = base64.b64decode(base64_image)<br>        image = Image.open(BytesIO(image_bytes)).convert(&quot;RGB&quot;)<br>        inputs[&quot;images&quot;] = [image]<br><br>    inputs = processor(**inputs, padding=True, return_tensors=&quot;pt&quot;)<br><br>    with torch.no_grad():<br>        outputs = model2(**inputs)<br><br>    text_embeddings = outputs.text_embeds.squeeze().tolist()<br><br>    return text_embeddings<br><br><br>class BridgeTowerEmbeddings(BaseModel, Embeddings):<br>    &quot;&quot;&quot; BridgeTower embedding model &quot;&quot;&quot;<br><br>    def embed_documents(self, texts: List[str]) -&gt; List[List[float]]:<br>        &quot;&quot;&quot;Embed a list of documents using BridgeTower.<br>        Args:<br>            texts: The list of texts to embed.<br>        Returns:<br>            List of embeddings, one for each text.<br>        &quot;&quot;&quot;<br>        embeddings = []<br>        for text in texts:<br>            embedding = bt_embedding_from_bridgetower(text, None)<br>            embeddings.append(embedding)<br>        return embeddings<br><br>    def embed_query(self, text: str) -&gt; List[float]:<br>        &quot;&quot;&quot;Embed a query using BridgeTower.<br>        Args:<br>            text: The text to embed.<br>        Returns:<br>            Embeddings for the text.<br>        &quot;&quot;&quot;<br>        return self.embed_documents([text])[0]<br><br>bt_embeddings = BridgeTowerEmbeddings()<br><br>query_text = &quot;intelligent behavior&quot;  # Example<br>query_embedding = bt_embeddings.embed_query(query_text)</pre><pre># Perform similarity search<br>results = table.search(query_embedding, vector_column_name=&quot;vector&quot;).limit(5).nprobes(2).refine_factor(5).to_list()<br><br># Print results<br>for result in results:<br>    print(result)</pre><h3>Frontend: Seamless Student Experience</h3><p>We built a lightweight <strong>Streamlit app</strong> where students can:</p><ul><li>Choose a video from a list.</li><li>Enter a text query (e.g., “Bayesian Networks”).</li><li>Instantly see top-matching timestamps.</li><li>Click to jump directly to the right part of the video.</li></ul><p>This transforms hours of manual browsing into seconds of smart retrieval.</p><h3>How Well Does It Work?</h3><p>We evaluated the system using metrics like <strong>Precision@5</strong>, <strong>Recall@5</strong>, and <strong>Mean Reciprocal Rank (MRR)</strong>:</p><p><strong>Metric Score</strong></p><p><strong>Average Precision@5:</strong> 0.52</p><p><strong>Average Recall@5:</strong> 0.49</p><p><strong>MRR:</strong> 0.67</p><p>These results show that, on average, students are likely to find the right content within the top few results, a huge improvement over random scrubbing!</p><h3>Challenges We Faced</h3><p><strong>Fine-Tuning the Model</strong>:</p><ul><li>We initially wanted to fine-tune BridgeTower (using LoRA) to better adapt it to lecture-specific domains. However, GPU memory limitations prevented us from doing so.</li></ul><p><strong>Ambiguous Frames</strong>:</p><ul><li>Lectures often have silent frames (e.g., when professors are writing on boards or switching slides), making it tricky to align frames with meaningful transcript snippets.</li></ul><p><strong>Contrastive Learning</strong>:</p><ul><li>Using BridgeTower for contrastive learning across ambiguous frames was another challenge, given the variability in lecture visuals.</li></ul><h3>Future Directions</h3><p>Our work is just the beginning. Some promising next steps include:</p><ol><li><strong>Separate Indexes</strong>: Create distinct indexes for visual and textual embeddings to improve search granularity.</li><li><strong>Fine-Tuning</strong>: Revisit LoRA-based fine-tuning once more compute is available.</li><li><strong>Advanced Metadata Extraction</strong>: Use OCR for reading slide text and Named Entity Recognition (NER) to enhance query alignment.</li><li><strong>External Knowledge Integration</strong>: Pull in Wikipedia or ArXiv summaries for expanded learning within the app.</li></ol><h3>Conclusion</h3><p>By combining cutting-edge multimodal retrieval techniques with accessible open-source tools, we took a big step toward making <strong>lecture content truly searchable and student-friendly</strong>. No more endless scrubbing through videos; just search, click, and learn.</p><p>If you’re passionate about education, AI, or just hate wasting time finding the right clip in a 2-hour lecture, we hope this inspires you to imagine smarter, more human-centric learning tools for the future.</p><h3><strong>Contributors</strong></h3><p><a href="https://www.linkedin.com/in/muhammad-adil-fayyaz/">Muhammad Adil Fayyaz</a>, <a href="https://www.linkedin.com/in/tanmay-thakare/">Tanmay Surendra Thakare</a>, and <a href="https://www.linkedin.com/in/harshitapoojary/">Harshita Dooja Poojary</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e87da4dacde8" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>