Bo He's Personal Website

Bo He (何博)

I am currently a research scientist at Meta. I have graduated from the department of Computer Science at University of Maryland, College Park, advised by Prof. Abhinav Shrivastava. I obtained my Bachelor's degree at University of Chinese Academy of Sciences, China in 2018.

I'm interested in video-related tasks, especially on video understanding, video compression, and multimodal learning.

Email: bohe [at] umd [dot] edu

CV / Github / Google Scholar / LinkedIn

News

[2024-03] Two papers are accepted to CVPR 2024.
[2024-02] I joined Meta as a research scientist.
[2023-07] One paper is accepted to ICCV 2023.
[2023-05] I start an internship at Meta working on large language models in video understanding tasks.
[2023-03] Two papers are accepted to CVPR 2023.
[2022-09] One paper is accepted to BMVC 2022.
[2022-07] One paper is accepted to ECCV 2022.
[2022-05] I start an internship at Adobe working on multi-modal summarization task, supervised by Zhaowen Wang and Trung Bui.
[2022-03] One paper is accepted to CVPR 2022.

Publications

	MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Abhinav Shrivastava, Ser-Nam Lim CVPR, 2024 project page / arxiv / code We propose a memory-augmented large multimodal model for efficient and effective long-term video understanding ability. Our model can achieve state-of-the-art performances across multiple tasks such as long-video understanding, video question answering, and video captioning.
	OmniViD: A Generative Framework for Universal Video Understanding Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang CVPR, 2024 arxiv / code We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. This enables us to address various types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework.
	To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang arXiv arxiv / code We introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks.
	Chop & Learn: Recognizing and Generating Object-State Compositions Nirat Saini, Hanyu Wang, Archana Swaminathan, Vinoj Jayasundara, Kamal Gupta, Bo He, Abhinav Shrivastava ICCV, 2023 project page / arxiv / code We focus the task of cutting objects in different styles and the resulting object state changes. We propose a new benchmark suite Chop & Learn, to accommodate the needs of learning objects and different cut styles using multiple viewpoints.
	Towards Scalable Neural Representation for Diverse Videos Bo He, Xitong Yang, Hanyu Wang, Zuxuan Wu, Hao Chen, Shuaiyi Huang, Yixuan Ren, Ser-Nam Lim, Abhinav Shrivastava CVPR, 2023 project page / arxiv / code We propose D-NeRV, a novel implicit neural representation based framework designed to encode large-scale and diverse videos. It achieves state-of-the-art performances on video compression.
	Align and Attend: Multimodal Summarization with Dual Contrastive Losses Bo He, Jun Wang, Jielin Qiu, Trung Bui, Abhinav Shrivastava, Zhaowen Wang CVPR, 2023 project page / arxiv / code We propose A2Summ, a novel supervised multimodal summarization framework that summarize video frames and text sentences with time correspondence. We also collect a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries.
	CNeRV: Content-adaptive Neural Representation for Visual Data Hao Chen, Matt Gwilliam, Bo He, Ser-Nam Lim, Abhinav Shrivastava BMVC, 2022 (oral) project page / arxiv We propose neural visual representation with content-adaptive embedding, which combines the generalizability of autoencoders with the simplicity and compactness of implicit representation. We match the performance of NeRV, a state-of-the-art implicit neural representation, on the reconstruction task for frames seen during training while far surpassing for unseen frames that are skipped during training.
	Learning Semantic Correspondence with Sparse Annotations Shuaiyi Huang, Luyu Yang, Bo He, Songyang Zhang, Xuming He, Abhinav Shrivastava ECCV, 2022 project page / arxiv / code We address the challenge of label sparsity in semantic correspondence by enriching supervision signals from sparse keypoint annotations. We first propose a teacher-student learning paradigm for generating dense pseudo-labels and then develop two novel strategies for denoising pseudo-labels. Our approach establishes the new state-of-the-art on three challenging benchmarks for semantic correspondence.
	ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization Bo He, Xitong Yang, Le Kang, Zhiyu Cheng, Xin Zhou, Abhinav Shrivastava CVPR, 2022 project page / arxiv / code We propose ASM-Loc, a novel weakly supervised temporal action localization framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods. We establish new state of the art on THUMOS-14 and ActivityNet-v1.3 datasets.
	Recognizing Actions using Object States Nirat Saini, Bo He, Gaurav Shrivastava, Sai Saketh Rambhatla, Abhinav Shrivastava ICLR Workshop, 2022 arxiv We propose a computational framework that uses only two object states, start and end, and learns to recognize the underlying actions.
	NeRV: Neural Representations for Videos Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav Shrivastava NeurIPS , 2021 project page / arxiv / code We propose a novel image-wise neural representation (NeRV) to encodes videos in neural networks, which takes frame index as input and outputs the corresponding RGB image. Compared to image-wise neural representation, NeRV imrpoves encoding speed by 25× to 70×, decoding speed by 38× to 132×. And it also shows comparable preformance for visual compression and denoising task.
	GTA: Global Temporal Attention for Video Action Understanding Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, Abhinav Shrivastava BMVC , 2021 arxiv We introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. We apply GTA on both pixels and semantically similar regions to capture temporal relationships at different levels of spatial granularity.

Services

Program Committee/Reviewers: CVPR, ICCV, ECCV, AAAI, NeurIPS, TPAMI

Thank Dr. Jon Barron for sharing the source code of his personal page.

Web Counters