SnapQuery: Your Personalized Image Search Assistant

Inspiration

Digital devices have become an essential part of our everyday lives. From waking up in the morning to going to sleep at night, we interact with cameras, smartphones, and online platforms countless times. With smartphone penetration increasing by over 10% each year, more people than ever are capturing, storing, and sharing moments seamlessly. Whether using dedicated cameras for high-quality shots, smartphones for instant captures, or cloud platforms for storage and sharing, our reliance on these technologies continues to grow.

Stored media – from photographs and videos to memes and stickers - have become an integral part of our digital lives. They have become a popular way to celebrate special occasions, we select images from our past that mark our special connections and share them over social media. These things help to make our relations stronger and last longer. However, selecting that special image you have in mind can be a time-consuming task. Especially when a large space of our storage is filled with several gigabytes of images from a variety of sources on separate occasions having different people. Moreover, finding a specific memory or even your own headshot can be daunting if your images aren't well-organized.

What it does

When you have a gallery of over 1000 images, how can you find the right image? You’ll have to spend a lot of time looking for that image. But what if we can make it easier? We introduce to you our personalized solution – SnapQuery. You just describe the image, and it shows up in your search. Think about the trip you took with your friends or the time you spent with your parents on a holiday at your home. Or even when you took a hike in the mountains. Not just a generic description, but with anyone you choose to. For example you can enter, “mom and dad hugging each other during Christmas” or “Sid and I during our hike”, and SnapQuery does its magic. It can also do face recognition. So when you ask for the images of your mom and dad, it fetches it. If you want their picture from a specific occasion or event, all you have to do is type in the details, and it fetches it.

How we built it

Developing SnapQuery was a challenging task—creating a solution that could accurately recognize people, locations, and occasions. We achieved this by integrating multiple techniques: facial recognition, image representation generation, and similarity calculation.

Facial Recognition: Firstly, we must teach SnapQuery who is who. We give it reference images and tell their names; for example, you give a picture of yourself and a picture of each of your parents and name it “me”, “mom” and “dad” respectively, so now it knows who you are referring to when you say “mom” or “dad”. After that, when it detects faces in any of the images, it will be able to compare them to the reference image. This gives SnapQuery the personalized touch we want it to have.
Image Description Generation: How do we correlate between the images and text? We need to generate a natural language representation of the images using a multimodal model, and we do the same for the given text query. Using this, we can identify what images relate to the user’s query the most. This gives SnapQuery the flexibility and specificity it needs when handling queries
Indexing and Similarity Search: When the number of images is high, to handle all the different vector representations (faces, image descriptions, etc.) we need to store them in an index for searching easily and efficiently. Based on the user query, SnapQuery quickly locates the images and calculates the similarity between these vector representations to fetch the most apt results. This allows SnapQuery to be efficient, fast and light.

Architecture

Organizing the Data:

Given the collection of images, first you tell the model who is who (through a single reference image each) and where you have put your personal collection. The AI engine uses OpenAI’s CLIP from the Qualcomm® AI Hub, which has efficiently learnt visual concepts from natural language through deep learning. So given an image, CLIP can create a vector representation that would be closer to its natural language description.

Next, both the collection and reference images are processed through MediaPipe, a lightweight yet accurate face detection model from the Qualcomm® AI Hub. This model identifies each face for a given image along with their landmark features like eyes, nose etc. Each of these faces go through CLIP model again to get our vector representation.

All these different vector representations are stored in their own index (Reference Embeddings, Face Embeddings and Image Embeddings). We use Facebook AI Similarity Search (FAISS) as the store for these vectors so that they can be efficiently looked up and compared to.

Analyzing the Query:

Once the user enters the query, we analyze it for mentions of names that are under reference images. So, if the query is “me and mom” SnapQuery can retrieve the face vector for you and your mom (keyword embeddings) to compare it with images in the collection making the search personalized and intuituve.

Apart from names, SnapQuery also looks at contextual clues in the query like location, time of day, etc. These clues are converted into vectors by the text model of CLIP so that it can be compared against our image vectors.

Matching Relevant Images:

Once we have our keyword embeddings and query embedding, SnapQuery compares them to the indexed vector representations through similarity search. Here our keyword embeddings are compared against the face embeddings to rank images based on closeness (Face-Keyword scoring). Likewise, the query embedding is compared to the image embeddings from CLIP, and SnapQuery scores images based on how much their description matches our query (Image-Text scoring). This hybrid search approach allows for more accurate results when searching through your personal gallery.

Ranking and Displaying the Results:

Once the SnapQuery has the two sets of scores, it needs to filter out the unwanted images. By applying a threshold to the obtained scores, there will be two sets of images: one set with images that have the person we are looking for and one set with images that have a similar contextual description to what we want.

SnapQuery ranks these images based on the calculated similarity scores and then combine both these ranks into a single metric using weighted Reciprocal Rank Fusion (RRF). Once it gets the combined rank, then using a Streamlit application, the user is shown these images. If the user wants to change their query, they can do so right there and then.

Challenges we ran into

We faced several challenges at every stage of its development. We had to experiment with multiple models to get to the right accuracy and latency. For facial recognition, we tested MTCNN and InceptionResnet V1 which were a part of the facenet-pytorch library. MTCNN and InceptionResnet V1 were effective in detecting faces and generating embeddings but were also very large models that had high computational overhead, unsuitable to handheld and edge devices.

Similarly, we tested several image-text models as well such as FLAVA and ALIGN. Most multimodal models such as these two perform well but for specific use cases like ours these models required retraining which increased effort and computational requirement.

But the major challenge we faced was how to combine facial recognition and image retrieval. We had to combine all the elements which would maintain accuracy within a reasonable latency. When we obtained high accuracy using certain models, the latency was poor. But when the latency was good, accuracy was poor. We experimented with several techniques such as quantization, matryoshka embeddings, etc. to get the balance.

After some optimization, we struck a good balance between accuracy and latency, thanks to usage of Qualcomm® AI Hub models CLIP and MediaPipe. Thus, we managed to put together a lightweight but powerful application that solves our problems.

Accomplishments that we're proud of

We were thrilled when we saw that our model was performing successfully when we used a repo of over 600 images and it received accurate results. And all the images we were looking for always came in the top 10 images. With this we can proudly say we have:

Successfully built a hybrid image retrieval system that improves search accuracy.
Optimized performance to ensure real-time results.
Optimized models for on-device deployment using Qualcomm® AI Hub for Samsung Galaxy S23 Family (Snapdragon® 8 Gen 2 | SM8550):
- Achieved minimum inference time of 40.7ms for CLIP Image Encoder and 7.3ms for CLIP Text Encoder
- Achieved minimum inference time of 2ms for MediaPipe Face Detector
Implemented a scalable and lightweight solution that can be extended for various applications.

What we learned

Using a face recognition model structure with the Qualcomm® AI Hub models MediaPipe Face and OpenAI CLIP solved the problem of high computational overhead. Using OpenAI CLIP allowed us to avoid excessive retraining or fine-tuning and lowered overall resource demand. But what surprised us the most was how effective the generic image-text embedding model CLIP was for comparing faces. However, the image quality plays a crucial role here. Lower quality images (such as low resolution, low lighting) do not always show up in the top results. We think it’s because it cannot extract enough details in the vector representation. For this, we sometimes need to alter the similarity filter threshold. The filtering out of unwanted images based on similarity threshold was also a challenging exercise. Because our goal was to integrate image search with facial recognition, we prioritized facial similarity comparison in our search process. Therefore, we used a weighted RRF with higher weight on facial similarity rank. Also, the similarity filtering threshold we set up for our gallery was 0.81 for face recognition and 0.2 for image-text similarity. Based on the users gallery, these results may need to be slightly tweaked. From our experience it did not go beyond a difference of 0.05.

What's next for SnapQuery

While our solution does a commendable job in finding the correct image out of hundreds of image, we do understand that it’s far from perfect. But we are excited about implementing it in handheld devices. That’s where we see the potential of using Qualcomm® AI Engine SDKs. Building on the optimized OpenAI CLIP and MediaPipe models for Samsung Galaxy S23 Family (Snapdragon® 8 Gen 2 | SM8550) devices using Qualcomm® AI Hub, we believe we can push performance even further on smartphones. Also, we are looking forward to the new additions to the Qualcomm® AI Hub. It does host several amazing open-source models, and we do understand that the team is relentlessly working on expanding the database. We are definitely hoping that some new high-quality state-of-the-art models will be published in the AI hub and we can make use of it.

SnapQuery isn’t just an image search tool—it’s a way to keep your most treasured moments at your fingertips, effortlessly and intuitively.