Visual Understanding Team targets on understanding, generating, and transforming multimedia content via computer vision and natural language processing techniques. We are working on sign language translation, image/video captioning, visual dialogue, video grouding and VQA. We have published 20+ journal articles and conference papers, including IEEE TPAMI, IEEE TIP, IEEE TMM, ACM TOMCCAP, CVPR, AAAI, IJCAI, ACM MM, etc.
This part covers the researches related to sign language recognition, which focuses on continuous sign language translation (CSLT). In order to improve the recognition accuracy of isolated sign words, some early works design an adaptive hidden Markov model (HMM) framework. These methods can fully explore the intrinsic properties and complementary relationship among the hidden sign states. CSLT suffers from challenges presented by hybrid semantics learning among sequential variations of visual representations, sign linguistics, and textual grammars... [Details]
This part covers the researches related to visual dialog and video question answering. Visual dialog is a multi-round extension for visual question answering (VQA). The interactions between the image and multi-round question answer pairs are progressively changing, and the relationships among the objects in the image are influenced by the current question. Video question answering task aims... [Details]
This part covers the researches related to visual captioning, including image captioning and video captioning. To releax the reliance on paired image-sentence data for image captioning training, unsupervised captioning with no annotations is explored through two-stage memory mechanisms. A GAN based method is proposed for exploring implicit semantic correlation between disjointed images and sentences through building a multimoda semantic aware space...[Details]
This part covers the researches related to visual understanding, which focuses on Crowd Counting, Visual Grounding, Video Grounding, and Temporal Action Localization (TAL). Crowd Counting is a task to count people in image. Different from object detection, Crowd Counting aims at recognizing arbitrarily sized targets in various situations including sparse and cluttering scenes at the same time...[Details]
Resources
Conferences Links: International Conferences on Machine Learning and Artificial Intelligence.
Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Meng Wang*, and Yiran Zhong*, "Audio−Visual Segmentation", European Conference on Computer Vision (ECCV), 2022. [Code]
Shengeng Tang, Richang Hong*, Dan Guo*, and Meng Wang, "Gloss Semantic-Enhanced Network with Online Back-Translation for Sign Language Production", ACM International Conference on Multimedia (ACM MM), 2022.
Hui Wang, Dan Guo*, Xiansheng Hua, and Meng Wang*, "Pairwise VLAD Interaction Network for Video Question Answering", ACM International Conference on Multimedia (ACM MM), 2021.
Kun Li, Dan Guo*, and Meng Wang*, "Proposal-Free Video Grounding with Contextual Pyramid Network", AAAI Conference on Artificial Intelligence (AAAI), 2021.
Dan Guo, Yang Wang*, Peipei Song*, and Meng Wang, "Recurrent Relational Memory Network for Unsupervised Image Captioning", International Joint Conference on Artificial Intelligence (IJCAI), 2020.
[Link][PDF][BibTeX]
Dan Guo, Hui Wang*, Hanwang Zhang, Zhengjun Zha, and Meng Wang*, "Iterative Context-Aware Graph Inference for Visual Dialog", Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Fan Peng, Kun Li, Xueliang Liu, and Dan Guo, "AOPNet: Anchor Offset Prediction Network for Temporal Action Proposal Generation", International Conference on Signal Processing, Communications and Computing (ICSPCC), 2020.
Yuling Gui, Dan Guo, and Ye Zhao, "Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning", Workshop on Multimedia for Accessible Human Computer Interfaces (MAHCI), 2019.
Xiankun Pei, Dan Guo, and Ye Zhao, "Continuous Sign Language Recognition Based on Pseudo-supervised Learning", Workshop on Multimedia for Accessible Human Computer Interfaces (MAHCI), 2019.
Peipei Song, Dan Guo, Haoran Xin, and Meng Wang, "Parallel Temporal Encoder For Sign Language Translation", IEEE International Conference on Image Processing (ICIP), 2019.
[Link][PDF][BibTeX]
Dan Guo, Kun Li*, and Meng Wang, "DADNet:Dilated-Attention-Deformable ConvNet for Crowd Counting", ACM International Conference on Multimedia (ACM MM), 2019.
Dan Guo, Shengeng Tang,and Meng Wang, "Connectionist Temporal Modeling of Video and Language:A Joint Model for Translation and Sign Labeling", International Joint Conference on Artificial Intelligence (IJCAI), 2019.
[Link][PDF][BibTeX]
Dan Guo, Shuo Wang, Qi Tian, and Meng Wang, "Dense Temporal Convolution Network for Sign Language Translation", International Joint Conference on Artificial Intelligence (IJCAI), 2019.
[Link][PDF][BibTeX]
Dan Guo, Hui Wang, and Meng Wang, "Dual Visual Attention Network for Visual Dialog", International Joint Conference on Artificial Intelligence (IJCAI), 2019.
Shuo Wang, Dan Guo*, Wengang Zhou, Zhengjun Zha, and Meng Wang, "Connectionist Temporal Fusion for Sign Language Translation", International ACM International Conference on Multimedia (ACM MM), 2018.
[Link][PDF][BibTeX]
Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang, "Hierarchical LSTM for Sign Language Translation", AAAI Conference on Artificial Intelligence (AAAI), 2018.
[Link][PDF][BibTeX]
Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang, "Sign Language Recognition Based on Adaptive HMMs with Data Augmentation", IEEE International Conference on Image Processing (ICIP), 2016.
[Link][PDF][BibTeX]
Journal papers:
Kun Li, Jiaxiu Li, Dan Guo*, Xun Yang*, and Meng Wang, "Transformer-Based Visual Grounding with Cross-Modality Interaction", ACM Transactions on Multimedia Computing Communications and Applications (TOMCCAP), 2023. [Link]
Peipei Song, Dan Guo*, Jun Cheng, and Meng Wang*, "Contextual Attention Network for Emotional Video Captioning", IEEE Transactions on Multimedia (TMM), 2022.
Peipei Song, Dan Guo*, Jinxing Zhou, Mingliang Xu, and Meng Wang*, "Memorial GAN with Joint Semantic Optimization for Unpaired Image Captioning", IEEE Transactions on Cybernetics (TCYB), 2022.
Dan Guo, Hui Wang, and Meng Wang, "Context-Aware Graph Inference with Knowledge Distillation for Visual Dialog", IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.[Link]
Shengeng Tang, Dan Guo*, Richang Hong*, and Meng Wang, "Graph-Based Multimodal Sequential Embedding for Sign Language Translation", IEEE Transactions on Multimedia (TMM), 2021.[Link][PDF][BibTeX]
Dan Guo, Hui Wang, Shuhui Wang, and Meng Wang*, "Textual-Visual Reference-Aware Attention Network for Visual Dialog", IEEE Transactions on Image Processing (TIP), 2020.
Dan Guo, Wengang Zhou*, Anyang Li, Houqiang Li, and Meng Wang*, "Hierarchical Recurrent Deep Fusion Using Adaptive Clip Summarization for Sign Language Translation", IEEE Transactions on Image Processing (TIP), 2020.
[Link][PDF][BibTeX]
Shuo Wang, Dan Guo*, Xin Xu, Li Zhuo, and Meng Wang, "Cross-Modality Retrieval by Joint Correlation Learning", ACM Transactions on Multimedia Computing Communications and Applications (TOMCCAP), 2019.
[Link][PDF][BibTeX]
Dan Guo, Wengang Zhou*, Houqiang Li*, and Meng Wang*, "Online Early-Late Fusion Based on Adaptive HMM for Sign Language Recognition", ACM Transactions on Multimedia Computing Communications and Applications (TOMCCAP), 2018.
[Link][PDF][BibTeX]
Dan Guo, Shengeng Tang, Richang Hong, and Meng Wang, "Review of Sign Language Recognition, Translation and Generation", Computer Science, 2021.[Link][PDF][BibTeX]
Chengxin Xiong, Dan Guo, and Xueliang Liu, "Temporal Proposal Optimization for Temporal Action Detection", Journal of Image and Graphics, 2020.[Link]
Zhihong Lu, Dan Guo*, and Meng Wang, "Motion-compensated Frame Interpolation Based on Weighted Motion Estimation and Vector Segmentation", Acta Automatica Sinica, 2015.[Link]