Inspiration
The first 48 hours after a crime are critical. We were inspired by this fact to create a tool that automatically analyzed surveillance footage and detected people of interest based on a description. As we worked on the project, we extended our focus to identify not only people but any feature in the video given a short description. With a 2,600 tokens per second inference speed, FrameSleuth is a scalable solution for ultra fast video processing.
What it does
FrameSleuth processes videos using Llama 4 to automatically flag video frames that contain a specified feature or object. Our tool utilizes a user-provided description to identify features in the target video, allowing detectives or forensic analysts to efficiently go through massive amounts of video footage. Our solution is also highly scalable because we can process several video frames in parallel enabling faster inference. Given the 2,600 tokens/second inference speed, we can process an hour long video in just 1.5 minutes, providing 40x faster video analysis.
How we built it
We built this using Meta's latest large language model, Llama 4, in conjunction with Groq API to run inference with Llama 4. We developed an interactive frontend using Typescript and React as well as a backend using Flask backend to run Python logic for Llama 4 inference.
Challenges we ran into
We ran into challenges with Groq API since we were using the free version, and it was slow due to rate-limiting. We also had to refine our prompting to the Llama model and iteratively test this to ensure accurate results.
Accomplishments that we're proud of
We’re proud of building a novel application for the Llama 4 model that solves general object detection problems leveraging the latest AI technology. Despite challenges with API limitations, we achieved fast, accurate frame-level detection with a smooth and intuitive interface. Most importantly, we created a solution that could meaningfully assist in real-world scenarios like time-sensitive investigations, bringing cutting-edge AI closer to impactful use cases.
What we learned
- How to use Groq API to integrate an LLM into a React application
- How to build a desktop app using Electron and React
- How to integrate LLMs like Llama 4 for high-throughput, frame-level video understanding
- The importance of prompt engineering when generating visual detection criteria from natural language
What's next for FrameSleuth
- Add multi-object tracking, allowing the system to track individuals across frames once detected.
- Incorporate real-time streaming support so FrameSleuth can flag objects in live camera feeds.
- Improve the natural language interface with dialog-based refinement (e.g., "Show me only people wearing red jackets").
- Introduce a visual dashboard for interactive playback and frame annotation.
- Extend to edge deployment using optimized models for on-premises or mobile hardware.
- Explore integration with law enforcement tools or emergency response pipelines.
Log in or sign up for Devpost to join the conversation.