Prior to MIT, I was a MS-Research student in the CMU Robotics Institute studying Artificial Intelligence and Robotics, advised by Prof. David Held. I also worked at Amazon as an Applied Scientist II.
My current research focuses on trustworthy AI and autonomous systems. Specifically, I design algorithms for machines to learn representations for more robust real-world generalization and better certifiability. My research revolves around the theme of learning-based perception systems and robotic systems.
We introduce CUPS, a novel method for learning sequence-
to-sequence 3D human shapes and poses from RGB videos
with uncertainty quantification. To improve on top of prior
work, we develop a method to score multiple hypothe-
ses proposed during training, effectively integrating uncer-
tainty into the learning process. This process results in a
deep uncertainty function that is trained end-to-end with the
3D pose estimator. Post-training, the learned deep uncer-
tainty model is used as the conformity score. Since the data in human
pose-shape learning is not fully exchangeable, we also pro-
vide two practical bounds for the coverage gap in confor-
mal prediction, developing theoretical backing for the un-
certainty bound of our model.
We introduce CHAMP, a novel method for learning sequence-to-sequence, multi-hypothesis 3D human poses from 2D keypoints by leveraging a conditional distribution with a diffusion model. To predict a single output 3D pose sequence, we generate and aggregate multiple 3D pose hypotheses. For better aggregation results, we develop a method to score these hypotheses during training, effectively integrating conformal prediction into the learning process. This process results in a differentiable conformal predictor that is trained end-to-end with the 3D pose estimator. Post-training, the learned scoring model is used as the conformity score, and the 3D pose estimator is combined with a conformal predictor to select the most accurate hypotheses for downstream aggregation.
We consider the problem of estimating object pose and shape
from an RGB-D image. Our first contribution is to introduce
CRISP, a category-agnostic object pose and shape estimation pipeline. The pipeline implements an encoder-decoder
model for shape estimation. It uses FiLM-conditioning for
implicit shape reconstruction and a DPT-based network for
estimating pose-normalized points for pose estimation. As
a second contribution, we propose an optimization-based
pose and shape corrector that can correct estimation errors
caused by a domain gap.
We investigate a variation of the 3D registration
problem, named multi-model 3D registration. In the multi-model
registration problem, we are given two point clouds picturing a
set of objects at different poses (and possibly including points
belonging to the background) and we want to simultaneously
reconstruct how all objects moved between the two point clouds.
We explore yet another novel method to perceive and manipulate 3D articulated objects that generalizes to enable the robot to articulate unseen classes of objects.
We conjecture that the task-specific pose relationship between relevant parts of interacting objects is a generalizable notion of a manipulation task that can transfer to new objects. We call this task-specific pose relationship "cross-pose". We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task.
We explore a novel method to perceive and manipulate 3D articulated objects that generalizes to enable the robot to articulate unseen classes of objects.
We present present AVPLUG: Approach Vector PLanning for Unicontact Grasping: an algorithm for efficiently finding the approach vector using an efficient oct-tree occupancy model and Minkowski sum computation to maximize information gain.
We propose a self-supervised learning framework that enables a UR5 robot to perform these three tasks. The framework finds a 3D apex point for the robot arm, which, together with a task-specific trajectory function, defines an arcing motion that dynamically manipulates the cable to perform tasks with varying obstacle and target locations.
We present a distributed pipeline, Dex-Net AR, that allows point clouds to be uploaded to a server in our lab, cleaned, and evaluated by Dex-Net grasp planner to generate a grasp axis that is returned and displayed as an overlay on the object.