I am a Ph.D. candidate at Mila starting Fall 2023, where I'am advised by Prof. Aishwarya Agrawal. I completed my Master's at IIIT Hyderabad, where I was co-advised by Prof. Vineet Gandhi and Prof. K Madhava Krishna. I have worked on Visual Grounding, Language-Guided Autonomous Navigation, Multi-View Detection and Multi-Object Tracking.
I am interested in following research topics: learning from multiple data modalities, language understanding in autonomous systems during navigation, explainable deep learning, mutli-object tracking, improving robustness to domain shifts and adversarial attacks, learning in low-data regimes, and ensemble learning.
We investigate the problem of reducing mistake severity for fine-grained classification. Our novel approach of Hierarchical Ensembles (HiE) utilizes label hierarchy to improve the performance of fine-grained classification at test-time using the coarse-grained predictions.
We introduce a novel instance-focused scene representation for indoor settings, enabling seamless language-based navigation across various environments. Our representation accommodates language commands that refer to specific instances within the environment.
We investigate the problem of reducing mistake severity for fine-grained classification. Our novel approach of Hierarchical Ensembles (HiE) utilizes label hierarchy to improve the performance of fine-grained classification at test-time using the coarse-grained predictions.
We investigate the Vision-and-Language Navigation problem in the context of autonomous driving in outdoor settings. We explicitly ground the navigable regions corresponding to the textual command and use them directly as guidance for the navigation stack.
We find that existing state-of-the-art models show poor generalization by overfitting to a single scene and camera configuration. We formalize three critical forms of generalization and propose experiments to evaluate them.
We investigate Referring Image Segmentation, which outputs a segmentation map corresponding to the natural language description. We propose a novel architecture to effectively capture all forms of multi-modal interactions synchronously.
We propose a novel visual-grounding-based approach to language-guided navigation which brings interpretability and explainability to Vision Language Navigation task.