Portfolio

Unsupervised Image Captioning using Multi-agent RL

Submitted to AAAI 2026 (November 2024 - July 2025)

The progress of large-scale Vision-Language Models (VLMs) is fundamentally bottlenecked by the immense cost and labor required to create high-quality, human-annotated image-caption datasets. This “data ceiling” necessitates a shift towards unsupervised methods that can learn without paired data.

We reframed unsupervised image captioning as an emergent property of goal-oriented communication. We designed LoGIC (Lewis Communication Game for Image Captioning), a multi-agent reinforcement learning game where a “Speaker” agent describes an image and a “Listener” agent must identify it from a set of candidates based on the description. The shared goal incentivizes the Speaker to generate maximally informative and precise captions. The optimal strategy for winning the game is high-quality image captioning, which emerges naturally from the agents’ drive to communicate successfully.

Fine-tuning a state-of-the-art VLM with LoGIC improved its BLEU score from 44 to 46 with zero additional labeled data. When trained from scratch, our system achieved a 31 BLEU score, establishing a new state-of-the-art in fully unsupervised image captioning by a remarkable 10-point margin.

Offline Goal-Conditioned Visual RL with Large Language Models

Submitted to AAAI 2026 (May 2024 - July 2025)

Training embodied AI agents (e.g., robots) to follow natural language instructions using online Reinforcement Learning is prohibitively expensive. These methods require hundreds of millions of interactions with the environment and massive computational resources (e.g., multi-GPU nodes), making them impractical for real-world deployment and inaccessible to most researchers.

This work proposes a radical shift from an online to an offline learning strategy to solve this resource-intensive problem.

A novel offline, hierarchical RL algorithm, CUVA (Contrastive Universal Value-function Approximation), was developed. CUVA adapts a pre-trained, frozen Large Language Model (LLM) to accept text instructions and visual observations, and then output actions. It learns to infer an implicit sequence of sub-goals from the instruction and uses contrastive learning to train a value function. A New Community Resource: To enable this research, a massive, high-quality offline dataset was collected and released. It contains ~5 million environment interactions from expert, intermediate, and novice policies on the Language-Conditioned Rearrangement benchmark, totaling ~750GB of data.

The offline approach not only proved viable but surpassed the online methods it was built upon. Using the collected offline data, CUVA achieved a 51.2% success rate on out-of-distribution generalization tasks, outperforming the 46.9% success rate of the online-trained policies that were used to generate the data in the first place. This state-of-the-art result was achieved with minimal computational cost, trainable on a single consumer-grade GPU, democratizing research in this domain.

Representation Learning for Visual RL using Fourier Neural Operators

In Proceedings of ICCV 2025 (December 2023 - March 2025)

In visual reinforcement learning, Convolutional Neural Networks (CNNs) are the de facto standard for encoding image observations. However, they are notoriously sensitive to hyperparameters and, more critically, to the dimensions of the input image. This brittleness means that a model trained on low-resolution images will fail when presented with high-resolution images during inference—a common domain shift problem—necessitating costly retraining or error-prone resizing.

This work introduces a paradigm shift by proposing a Fourier Neural Operator (FNO) as a drop-in replacement for the CNN encoder. Unlike a CNN which operates on pixels, an FNO operates in the frequency domain and learns a resolution-invariant representation. It does this by approximating the underlying Partial Differential Equations (PDEs) that govern the environment’s dynamics. By learning the “physics” of the environment rather than just its visual patterns, the FNO-encoder is inherently robust to changes in discretization (i.e., image size).

The FNO-encoder demonstrated superior performance and flexibility across multiple benchmarks. It achieved state-of-the-art results (in the model-free setting) on both the CARLA Autonomous Driving benchmark and the Atari 100k benchmark. It was successfully integrated into three distinct families of RL algorithms (PPO, A2C, and Rainbow), improving both final rewards and sample efficiency without requiring any additional hyperparameter tuning. Crucially, it demonstrated true zero-shot domain adaptation, where a policy trained on low-resolution images (48×48) maintained its high performance when evaluated on high-resolution images (224×224) without any fine-tuning.

Active Reinforcement Learning for Offline Policy Improvement

In Proceedings of AAAI 2025 (December 2021 - August 2024)

Reinforcement Learning agents often require vast amounts of interaction to learn, which is expensive or dangerous in real-world scenarios like robotics or autonomous driving. Offline RL learns from fixed datasets, but these are often incomplete. The key question is: with a limited budget for new interactions, how can an agent intelligently collect the most valuable data to improve its policy?

We developed a novel active RL method that directs exploration towards areas of highest uncertainty. By training an ensemble of state representation models, we use the disagreement between their outputs as a proxy for the agent’s uncertainty. The agent then preferentially collects new data from these high-uncertainty states, ensuring every new interaction is maximally informative and efficiently fills the most critical gaps in its knowledge.

Our method reduced the required amount of additional online interaction by up to 75% compared to standard fine-tuning approaches. The framework’s robustness was validated across diverse and complex environments, including MuJoCo, AntMaze, and high-fidelity simulators like CARLA (autonomous driving) and IsaacSim (robotics).

Causal Feature Alignment for Robust Computer Vision

In Proceedings of WACV 2024 (October 2022 - September 2023)

Deep neural networks often learn “shortcuts” by exploiting spurious correlations in training data (e.g., associating “cow” with “green pasture” instead of the cow’s actual features). This leads to poor out-of-distribution (OOD) generalization and catastrophic failures in real-world deployment.

We introduced Causal Feature Alignment (CFA), a method that forces a network to focus only on the causal foreground object. After initial training, the model is fine-tuned with an alignment objective: its internal representation of a full image must match the representation of a masked version containing only the foreground object. This penalizes the model for encoding background information. We made this practical by using the Segment Anything Model (SAM) to automate mask generation.

On the challenging Waterbirds dataset, CFA improved the worst-group accuracy from a baseline of 74% to an unprecedented 93%, setting a new state-of-the-art. The method’s effectiveness was confirmed with significant gains on other benchmarks like Colored-MNIST and the ImageNet Backgrounds Challenge.

Deep Representation Learning for Predicting Temporal Event Sets

In Proceedings of ACML 2023 (December 2021 - August 2022)

Many real-world phenomena occur not as single events, but as sets of events (e.g., a patient with multiple diagnoses, a customer buying a basket of items). Traditional forecasting models like Temporal Point Processes (TPPs) are designed for single events and struggle with the combinatorial explosion of modeling event sets.

We introduced TESET, the first principled framework to model and predict the intensity of event sets in continuous time. It uses a self-supervised contrastive learning objective to learn rich vector embeddings for individual events, where frequently co-occurring events are pulled closer together. The representation for an event set is then formed by aggregating these embeddings, and a transformer model processes the sequence of sets to learn temporal dynamics.

As a foundational work, TESET established a new, highly effective, and scalable methodology for a previously challenging problem domain. Experiments on multiple real-world datasets showed that TESET significantly outperforms existing methods in both prediction accuracy and computational efficiency.

Contextually Regularized Self-Supervised Hate Speech Detection (CRUSH)

In Proceedings of NAACL 2022 (July 2021 - November 2021)

Detecting hate speech is notoriously difficult, especially in code-mixed text where meaning is fluid and context-dependent. Models need to understand the relational context between posts, comments, and replies in a social network thread.

We developed CRUSH, a framework that integrates graph neural networks (GNNs) with pre-trained transformers. The GNNs explicitly model the conversational structure of social media, while transformers provide powerful language understanding. By training them end-to-end, the model learns to combine linguistic features with conversational context for superior classification.

CRUSH provides a more robust, nuanced, and context-aware detection system that moves beyond simple text classification to a more sophisticated form of contextual analysis, making it more effective in the complex environment of online social networks.

CoviHawkes: District-Wise Daily COVID-19 Case Prediction

In association with StatsML Group, IISc (March 2021 – June 2021)

I was part of the team that developed a temporal model for forecasting daily COVID-19 cases across Indian districts. We used graph-based models incorporating inter- and intra-district mobility features to predict case counts with low error. The project, which was featured on the IISc homepage, provided actionable forecasts to help policymakers enact targeted, localized lockdowns instead of disruptive nationwide measures. I also contributed to the public-facing website and poster designs.

Few-Shot Transfer Learning for Spine Segmentation

In collaboration with StatsML Group (IISc), MedImg Group (IISc), and GE-Healthcare (August 2020 – January 2021)

To address the high cost of medical image annotation, I worked on two methods for spine segmentation in a low-data regime. Our most successful approach involved pre-training a network on partially annotated data and fine-tuning it on a small, fully annotated dataset. This method achieved a high Dice similarity coefficient of 0.86 using as few as 20 complete CT scans, demonstrating a massive boost in performance with minimal supervision.

Actively Reducing Redundancies in Active Learning for NLP

In Proceedings of NAACL 2021 (August 2020 - November 2020)

Active Learning (AL) aims to reduce annotation costs, but many strategies suffer from redundancy, repeatedly selecting similar examples and wasting the annotation budget. This is especially problematic in tasks like Neural Machine Translation (NMT).

We developed an enhanced AL strategy that explicitly targets and eliminates redundancy. We used a Siamese network to map sentences to an embedding space where similar sentences are clustered together. The AL selection process then picks representative sentences from diverse clusters, ensuring the annotated data is both informative (high uncertainty) and diverse.

Our method demonstrated superior data efficiency, achieving higher performance on machine translation and other NLP tasks after consuming the same amount of labeled data as existing AL methods.

Explainable Classification of Mitochondria Microscope Images

In collaboration with StatsML Group (IISc) and CML (IISc/UNSW) (June 2020 – October 2020)

In biological sciences, explainability is as important as accuracy. I developed a novel, lightweight convolutional architecture (4M parameters) for classifying mitochondria images. The model not only achieved 93% accuracy but also produced highly coherent saliency maps, making its predictions transparent to researchers. This stood in contrast to a non-explainable Fast-RCNN model that achieved 96% accuracy, quantifying the trade-off.

Deepfake Detection in High-Compression Videos

Deep Learning for Computer Vision Course Project (February 2020 - June 2020)

Video compression artifacts make deepfake detection incredibly challenging. I developed a simple but effective CNN architecture (inspired by a DCGAN discriminator) that detected 92% of forged videos by sampling just 6-10 frames, outperforming existing methods by over 9%. By incorporating sequential information between frames, the accuracy was pushed to 95%.

Polymorphic Malware Detection Using Deep Learning

Systems Security Course Project (September 2019 - November 2019)

Polymorphic malware evades detection by constantly changing its signature. As a proof-of-concept, I explored a novel detection method by converting malware binaries into grayscale images and training a CNN classifier on them. By fine-tuning the model on a custom-built polymorphic malware, I demonstrated that this probabilistic deep learning approach could identify threats that deterministic methods would miss.

Prediction of Colon Cancer Based on Genome Features

B.Tech. Final Year Project (September 2018 - May 2019)

This project tackled a classic bioinformatics challenge: a dataset with over 2000 genomic features but only 62 observations. My work focused on a comparative study of dimensionality reduction techniques (PCA, LDA, PLS, etc.) combined with various classifiers (SVM, Random Forest, etc.). The most effective pipeline involved reducing the data to 40 features with PLS and using a Linear SVM for classification.

Providing Relevant Answers to Search Engine Queries

Microsoft AI Challenge (November 2018 - December 2018)

The goal was to provide precise, direct answers to factual search queries. I developed a pipelined model using pre-trained transformers and a multi-hop attention network. The system first ranked relevant text passages from search results and then used a Machine Reading Comprehension (MRC) component to extract the exact answer span. Our model significantly outperformed most others in the competition.

B.Tech. Extra Credit Project (August 2018 - September 2018)

This was a comparative study on a dataset of 1.6 million tweets to benchmark the performance of various recurrent neural network models. I implemented and evaluated a range of architectures, from basic frequency-based classifiers to RNNs, LSTMs, GRUs, and finally attention-based models, which achieved the highest accuracy.

Swarm Intelligence for Drone Reconnaissance

DRDO Robotics and Unmanned Systems Exposition (October 2017 - March 2018)

I proposed an intelligent system to control a swarm of drones as a single, cohesive unit. Using Particle Swarm Optimization, individual drones could move collaboratively without collision. The architecture was designed with a “hive-mind” intelligence, enabling group reinforcement learning and shared computational resources, creating a scalable platform for advanced surveillance tasks.

MERN Full-Stack Web Development (Internship)

National Informatics Center (January 2018 - March 2018)

During this internship, I was involved in the ground-up development of a dynamic and responsive progressive web application. I built the entire MERN stack, including a RESTful API backend with NodeJS, ExpressJS, and MongoDB, and a reactive single-page application frontend using ReactJS.

Vehicle Numberplate Recognition at Security Checkpoint

Project with Indian Oil Corporation Limited (September 2017 - November 2017)

I developed a system to automate security logging by reading vehicle number plates from live camera footage. The system used convolutional filters to detect the number plate region, an SSD algorithm for OCR, and a modified locking protocol for database synchronization.

Autonomous ATV Movement Based on Live Camera Feed

ISRO’s National Student Space Challenge (June 2017 - September 2017)

For this challenge, I built an autonomous ATV that could navigate a terrain using only the feed from an overhead camera. The system used A* search for optimal pathfinding, various image processing techniques for real-time perception (detecting the ATV’s position, obstacles, and targets), and stepper motors for precise hardware control.