Graduated with a Ph.D. in computer science, from the
Center for Research in
Computer Vision, UCF, under the guidance of Prof. Mubarak Shah.
My research interests span various domains within computer vision and machine learning.
During my doctoral studies, I have focused extensively on tackling diverse challenges in video comprehension using supervised,
weakly supervised, self-supervised, and zero-shot learning. This includes tasks such as action detection, temporal action localization,
and complex activity recognition. I have also worked on anomaly detection, gait recognition, person-Reid, and video understanding using large language models. Experienced in working with deep learning frameworks such as PyTorch, Keras, and Tensorflow.
  May 2025: Paper accepted to ICML 2025 as an Oral (Top 1%)
  September 2024: Joined PhotoDay as a Machine Learning Researcher
  August 2024: Graduate with Ph.D in Computer Science, from UCF
  June 2024: Successfully defended my Ph.D dissertation
  December 2023: Paper accepted to AAAI 2024
  May 2023: Started summer internship at Amazon
  October 2022: Patent granted for real-time spatio-temporal activity detection from untrimmed videos
  May 2022: Started summer internship at Pinterest
  March 2021: Paper accepted to CVPR 2021 as an Oral
  January 2021: Our Gabriella paper has been awarded the best scientific paper award at ICPR 2020
  June 2020: Placed first at ActEV SDL Challenge (ActivityNet workshop at CVPR 2020)
  October 2019: Placed second at the TRECVID leaderboard
  August 2018: Paper accepted to ACM MM 2018
We present a novel dataset distillation approach that uses a pre-trained diffusion model to generate high-quality synthetic datasets — without any retraining or fine-tuning. Our approach avoids expensive model retraining, saving computation; delivers better diversity and representativeness than existing methods; matches or exceeds SOTA performance on ImageNette, ImageIDC, ImageNet-100, and ImageNet-1K.
This study introduces a novel approach to multi-view action recognition, emphasizing the separation of learned action representations from view-specific details in videos. Addressing challenges like background variations and visibility differences in multiple viewpoints, we propose a unique configuration of learnable transformer decoder queries with two supervised contrastive losses. Our model outperforms all uni-modal counterparts significantly on four different datasets.
Our work delves into attributes measuring dataset quality for video action detection, probing existing datasets' limitations and proposing the Multi Actor Multi Action (MAMA) dataset, addressing real-world application needs. We conduct a biasness study examining the temporal aspect's significance, questioning assumptions on temporal ordering's importance, revealing biases despite meticulous modeling.
We propose an attention-based architecture to capture action relationships in the context of
temporal action localization within untrimmed videos. Our approach discerns between relationships
among actions unfolding simultaneously and those occurring at different time steps, labeling them as
distinct action dependencies. To enhance action localization performance, we introduce a novel
Multi-Label Action Dependency (MLAD) layer, leveraging attention mechanisms to model these intricate
dependencies.
This paper outlines the TinyAction Challenge held at CVPR 2021, focusing on recognizing real-world
low-resolution activities in security videos. It introduces the benchmark dataset TinyVIRAT-v2, an
extension of TinyVIRAT, featuring naturally occurring low-resolution actions from security videos.
The challenge aims to address the difficulty of action recognition in tiny regions, providing a
benchmark for state-of-the-art methods.
Gabriella consists of three stages: tubelet extraction, activity classification, and online
tubelet merging. Gabriella utilizes a localization network for tubelet extraction, with a novel
Patch-Dice loss to handle variations in actor size, and a Tubelet-Merge Action-Split (TMAS)
algorithm to detect activities efficiently and robustly.
This paper explores decoding and visualizing human thoughts through Brain Computer Interface (BCI)
research.
Using ElectroEncephaloGram (EEG) signals, the proposed conditional Generative Adversarial Network
(GAN) effectively synthesizes visual representations of specific thoughts, such as digits,
characters, or objects.
The study showcases the potential of extracting meaningful visualizations from limited EEG data,
demonstrating the explicit encoding of thoughts in brain signals for semantically relevant image
generation.
Research Scientist Intern Amazon Inc., Palo Alto, California, USA. May 2023- Nov 2023
Mentor: Jay Krishnan
Worked on representation learning for long-form video understanding with vision-language training. Explored
the idea of leveraging pre-trained Large Language Models (LLMs) to improve temporal understanding
of video models.
Research Scientist Intern Pinterest Inc., Remote, USA. May 2022 - Aug 2022
Mentor: Rex Wu
Worked on building a unified model for both image and video representation learning. Explored large-scale
self-supervised training to learn representations for multiple visual modalities. Obtained improved
performance over the in-house image-based model using the multi-modal training.