In addition to my academic pursuits, I have gained industry experience as a Generative AI Researcher at TikTok and as a Computer Vision Research Engineer at Retrocausal.
COLLAGE is a few-shot imitation learning method that adaptively fuses demonstrations from multiple similarity modalities. It estimates each modality’s usefulness by training a policy on its retrieved data and measuring how probable the target actions are under that policy.
This paper introduces a method that uses finetuned multimodal language models to filter image–text pairs more accurately than traditional metrics like CLIPScore. It defines multiple specialized quality metrics and builds instruction tuning data guided by stronger models such as GPT-4 to train the models to score data effectively. The result is a reliable and efficient filter that better evaluates image–text alignment.
We introduce a three-stage filtering strategy for enhancing model performance. It focuses on single-modality filtering, cross-modality filtering, and data distribution alignment. The proposed approach significantly surpasses previous methods on the DataComp benchmark.
GraphIRL is a self-supervised method for learning a task reward solely from videos.
We build an object-centric graph abstraction from video demonstrations and then learn an embedding space that captures task progression in a self-supervised manner by exploiting the temporal cue in the videos.
We propose temporal optimal transport for jointly learning representations and performing online clustering in an unsupervised manner.
The approach learns prototype vectors via backpropogation. The prototype vectors are initialized at random and act as cluster centroids.
We propose alignment as pre-text task for self-supervised video representation learning.
The proposed approach leverages differentiable dynamic time warping for learning global alignment across pairs of videos.
We collect a video dataset of road-based anomalies. We propose an object-object interaction reasoning approach for detecting anomalies without additional supervision.
We experiment with reconstruction based and one-class classification based approaches.