I am a fifth year Ph.D. student in the Media Lab, Dept. Electrical Engineering at The City College of New York, CUNY, advised by Professor Ying-Li Tian.
My current research focuses on Video Analysis including human action recognition and self-supervised video feature learning. Prior to CUNY, I finished my Bachelor's degree from Central South University.
Our work on 3D semi-supervised learning is accepted by WACV2023.
Our work on monocular depth estimation is accepted by ECCV2022.
Our work on semi-supervised video classification is accepted by CVPR2022.
Our work on monocular long-range depth estimation is accepted by ICLR2022.
Our work on monocular 3D detection and tracking is accepted by ICRA2022.
Research
I am interested in computer vision, machine learning, and image processing. Most of my research is about video analysis such as human action recognition, video feature self-supervised learning, and video feature learning from noisy data. I have also worked in weakly supervised semantic segmentation and lung nodule segmentation in CT scans with Generative Adversarial Networks.
Proposed a novel loss function named cross-modal center loss to learn modality-invariant features with minimum modality discrepancy for multi-modal data.
We proposed to jointly learn modal-invariant and view-invariant features for different modalities including image, point cloud, and mesh with heterogeneous networks for 3D data.
We propose a novel and effective self-supervised learning approach to jointly learn both 2D image features and 3D point cloud features by exploiting cross-modality and cross-view correspondences without using any human annotated labels.
Our solutions to the image-based vehicle re-identification track and multi-camera vehicle tracking track on AI City Challenge 2019 (AIC2019). Our proposed
framework outperforms the current state-of-the-art vehicle ReID method by 16.3% on Veri dataset.
We propose a novel deep learning Generative Adversarial Network (GAN) based lung segmentation schema for CT scans by redesigning the loss function of the discriminator which leads to more accurate result.
We propose an efficient and straightforward approach to capture the overall temporal dynamics from an entire video in a single process for action recognition.
We propose an efficient and effective action recognition framework by combining multiple feature models from dynamic image, optical flow and raw frame, with 3D ConvNet.