I am a PhD student at the University of Illinois, Urbana Champaign, advised by Prof. Heng Ji. Previously, I was an AI Resident at IBM Research, New York wherein I had the pleasure of working with Vittorio Castelli, Avirup Sil and Salim Roukos.
I graduated from Indian Institute of Technology Madras in 2018 with a bachelors degree in computer science. While at IIT Madras, I worked with Prof. Mitesh Khapra and Prof. Balaraman Ravindran
I'm interested in natural language processing and large language models, particulary in the fields of agentic search, embedding models and reranking, and information-seeking for retrieval-augmented generation.
We introduce WiNELL, an agentic framework for automatic wikipedia updating, that continuously monitors online sources for recent facts, identifies relevant updates for the Wiki article under consideration, and generates well-formed edit suggestions.
We introduce SWERank, a code ranking framework for software issue localization, which identifies the relevant code that needs to be modified to fix a software issue.
We introduce CoRNStack, a large-scale, high-quality contrastive training dataset for code that spans multiple programming languages. We demonstrate that contrastive training of embedding models using CoRNStack leads to state-of-the-art performance across a variety of code retrieval tasks.
We introduce INFOGENT,a novel modular and feedback-driven framework for web information aggregation involving three distinct components: Navigator, Extractor and Aggregator.
We propose to compute an improved vector representation of the query using supervision from the re-ranker at inference time, thereby improving the retriever's Recall@K. Our approach is parameter-free, lightweight, and can serve arbitrary retrieve-and-rerank pipelines, significantly improving retrieval recall in multiple domains, languages, and modalities.
We introduce SmartBook, a generalizable automated framework designed to assist human analysts in real-time situation report generation from large news corpora, by generating a structured report with multiple hypotheses (claims) summarized and grounded with rich links to factual evidence.
We introduce FIRST, a novel listwise LLM reranking approach leveraging the output logits of the first generated identifier to directly obtain a ranked ordering of the candidates.
We introduce any-granularity ranking which leverages multi-vector embeddings to rank at varying levels of granularity while maintaining encoding at a single (coarser) level of granularity.
We propose to improve an OpenQA model's generalizability across different corpora and domains by mitigating the model's over-memorization of knowledge.
We introduce a framework that enhances the accuracy and context efficiency of retrieval-based LLM personalization through collaborative data refinement. The method also excels in cold-start scenarios.
We introduce the use of progressive response generation to integrate real-time web search results, where the preliminary
response buys time for a detailed follow-up, ensuring a smooth user interaction. As a result, our method cuts down user
waiting times for voice-based chatbots by 50%.
To tackle passive conversations, we propose to integrate social commonsense reasoning for the generation of search queries in knowledge-powered conversations. We leverage a commonsense dialog system to establish connections related to the conversation topic, which subsequently guides an instruction-driven query generation model.
We propose the novel task of summarizing the reactions of different speakers with respect to a given event. We create a new multi-document summarization benchmark, SumREN, along with a pipeline-based framework for summarizing reported speech, which generates summaries that are more abstractive and factually consistent.
We present NewsClaims, a new benchmark for knowledge-aware claim detection, that re-defines the claim detection problem to include extraction of additional attributes related to the claim. NewsClaims aims to benchmark claim detection in emerging scenarios, comprising unseen topics with no training data.
Using a novel targeted synthetic data generation method that identifies poorly attended entities and conditions the generation episodes on those, we teach neural IR to attend more uniformly and robustly to all entities in a given passage.
We show that synthetic examples generated using a sequence-to-sequence generator can be effective in improving the robustness of neural IR, with gains in both in-domain and out-of-domain scenarios.
We propose a fine-grained claim detection framework that leverages zero-shot question answering using directed questions to solve a diverse set of sub-tasks such as topic filtering, claim object detection, and claimer detection.
We propose a new benchmark for multimedia question answering over news articles and introduce a novel data generation framework for generating questions that are grounded on objects in images and answered using the news body text.
We present COVID-19 Claim Radar, a system that automatically extracts claims relating to COVID-19 in news articles. We provide a comprehensive structured view of such claims, with rich attributes (such as claimers and their affiliations) and associated knowledge elements (such as events, relations and entities).
We explore using a synthetic example generation approach to improve the performance of state-of-the-art open-domain end-to-end question answering systems in a specialized domain, such as COVID-19.
While most previous work is on document level fake news detection, for the first time we propose misinformation detection at knowledge element level. It not only achieves higher detection accuracy but also makes the results more explainable.
We introduce a neuro-symbolic question answering system that leverages AMR for question understanding and uses a pipeline-based approach involving a semantic parser, entity and relationship linkers and a neuro-symbolic reasoner.
We formulate synthetic pre-training tasks that can transfer to downstream tasks, by using structure in unlabeled text. We show considerable gains on multiple tasks in the IT domain: question answering, document ranking and duplicate question detection.
We propose an approach for correcting partial match answers (EM=0, 0<F1<1) into exact match (EM=1, F1=1) and obtain upto 1.3% improvement over a RoBERTa-based machine reading comprehension system in both monolingual and multilingual evaluation.
We propose self-learning approaches to improve AMR parsers, via generation of synthetic text and synthetic AMR as well as refinement of actions from the oracle. We achieve state-of-the-art performance in AMR parsing on benchmark AMR 1.0 and AMR 2.0 datasets.
We design a novel multi-level memory architecture that retains natural hierarchy of the knowledge base without breaking it down into subject-relation-object triples. We use separate memories for dialog context and KB to learn different memory readers.
We design a modular network that uses depth-wise and 1D convolutions for visual reasoning on scientific plots. We achieve state-of-the-art accuracy on FigureQA dataset, bettering Relation Networks by 7%, with a training time over an order of magnitude lesser.
We propose a graph generative model based on probabilistic edge replacement grammars. We design an algorithm to build graph grammars by capturing the statistically significant sub-graph patterns.