<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by OpenGVLab on Medium]]></title>
        <description><![CDATA[Stories by OpenGVLab on Medium]]></description>
        <link>https://medium.com/@opengvlab?source=rss-6247e25c46e0------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*Fb7ddCAlK0lnariMuoFtCQ.jpeg</url>
            <title>Stories by OpenGVLab on Medium</title>
            <link>https://medium.com/@opengvlab?source=rss-6247e25c46e0------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 05 Apr 2026 22:27:23 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@opengvlab/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Open-source vision-language model now comparable to GPT-4V]]></title>
            <link>https://ai.gopubby.com/open-source-vision-language-model-now-comparable-to-gpt-4v-33afe9587edd?source=rss-6247e25c46e0------2</link>
            <guid isPermaLink="false">https://medium.com/p/33afe9587edd</guid>
            <category><![CDATA[vision-language-model]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[gpt-4]]></category>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[multimodal-learning]]></category>
            <dc:creator><![CDATA[OpenGVLab]]></dc:creator>
            <pubDate>Thu, 02 May 2024 14:05:32 GMT</pubDate>
            <atom:updated>2024-05-02T14:05:32.737Z</atom:updated>
            <content:encoded><![CDATA[<p><strong>InternVL 1.5 VS GPT4V. Here are some real cases.</strong></p><p>Research and development in multimodal large models are advancing rapidly, but despite significant progress, there has always been a gap in capabilities between open-source and commercial models. The OpenGVLab team from Shanghai AI Laboratory, in partnership with Tsinghua University and SenseTime, recently released a new open-source multimodal large language model project called InternVL 1.5. This project aims to challenge the dominance of commercial model giants like GPT-4V and raises the question of how far the power of open source can go.</p><p>Today, InternVL has been the top 1 Huggingface trending model for both image feature extraction and VQA.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*LYCr_44kreB1OcC0NCABpw.png" /><figcaption>screenshot on 2nd May 2024 <a href="https://huggingface.co/models?pipeline_tag=image-feature-extraction&amp;sort=trending">https://huggingface.co/models?pipeline_tag=image-feature-extraction&amp;sort=trending</a></figcaption></figure><p>As the latest generation of the open-source vision-languaged model(VLM) or multimodal large language models (MLLM), InternVL 1.5 achieves performance parity with top commercial models like GPT-4V and demonstrates outstanding technical advantages in the open-source domain. The power of the open-source community is once again affirmed, driving not only technological advancement but also the construction of a tech ecosystem that emphasizes sharing, cooperation, and innovation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/644/0*YJom6L8uKi8f4B69" /></figure><h3>Technical report:</h3><p><a href="https://arxiv.org/pdf/2404.16821">https://arxiv.org/pdf/2404.16821</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*dSLHcxTPvtNMbGC8" /></figure><h3>Demo（Try it！）</h3><p><a href="https://internvl.opengvlab.com">https://huggingface.co/spaces/OpenGVLab/InternVL</a></p><h3>Code:</h3><p><a href="https://github.com/OpenGVLab/InternVL">https://github.com/OpenGVLab/InternVL</a></p><h3>3 Highlights of InternVL 1.5</h3><h4><strong>Powerful Visual Encoder</strong></h4><p>InternVL 1.5, through its unique visual encoder, InternViT-6B, employs a continuous learning strategy to greatly enhance the depth and breadth of visual understanding. This strategy enables InternViT-6B to achieve seamless transfer and reuse among various large language models, strengthening the model’s ability to parse complex visual content and exhibit more precise recognition and interpretation capabilities in image-intensive tasks.</p><h4><strong>Dynamic High Resolution</strong></h4><p>InternVL 1.5 introduces a brand-new dynamic high-resolution strategy for image processing.This feature supports input resolutions of up to 4K, optimizing the presentation of image details and improving the model’s expressiveness and accuracy on high-resolution images, while also ensuring efficient computation. This revolutionary feature significantly enhances the overall image processing performance and is expected to be a game-changer in the field of image processing.</p><h4><strong>High-Quality Bilingual Dataset</strong></h4><p>InternVL 1.5 integrates a wide range of high-quality bilingual datasets covering English and Chinese, significantly improving the model’s operational flexibility and accuracy in multilingual environments. In addition, through the data translation pipeline developed by open-source large language models, InternVL 1.5 can automatically expand to more languages, showing tremendous potential in global applications.</p><h3><strong>InternVL 1.5 VS GPT4V</strong></h3><h3>Benchmark Evaluation</h3><p>To evaluate InternVL 1.5’s performance, the research team conducted extensive evaluations in 18 multimodal benchmark tests. These benchmarks cover various aspects, including OCR-related, general multimodal, mathematical, and multi-turn dialogues. InternVL 1.5 performs well on multiple key dimensions, narrowing the gap between open-source models and GPT-4V, especially reaching the SOTA level in tasks such as OCR, MMB, SEED, and Math.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*OldFxptDLAq1bVek" /></figure><h3>Evaluation on real case</h3><h4><strong>General QA</strong></h4><p>InternVL 1.5 was tested with everyday user questions to evaluate its performance in general question answering. In the first example, InternVL 1.5 demonstrated a good understanding of the image and correctly explained the movement in the image. It also distinguished the positioning of two individuals, slightly outperforming GPT-4V in detail. In the second example, InternVL 1.5 identified the action as imitating Bruce Lee’s posture, while GPT-4V failed to answer this question.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Of1YPP-ffaY_mIqD" /></figure><h4><strong>OCR-related QA</strong></h4><p>Compare the OCR ability of the InternVL 1.5 and GPT-4V with two small cases. In the first test, attention was paid to the model’s understanding of Chinese scenes. The results show that GPT-4V performs poorly in extracting all useful information from the image, while InternVL 1.5 can more accurately identify and parse details in the image. The second test focuses on graph understanding. GPT-4V and InternVL 1.5 can effectively parse graph data and structure, identifying the highest and lowest years, but GPT-4V fails to answer the final difference accurately. In contrast, InternVL 1.5 accurately answers the difference as “1580 billion RMB”.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*RmdZcgCfCpK72KL3" /></figure><h4><strong>Scientific Understanding QA</strong></h4><p>In the field of multimodal large language models, models perform poorly in complex scenarios involving domain-specific knowledge and logical reasoning. However, in the first question shown in the figure below, the InternVL 1.5 model not only accurately analyzes the elements in the image but also provides precise answers, while GPT-4V demonstrates its unique insights in inferring trends in amino acid transport. In the second example, both models can accurately answer and provide in-depth analysis from the perspective of aerodynamics, demonstrating their efficient capabilities in handling scientific problems. These achievements also demonstrate the comparability of InternVL 1.5 and GPT-4V in scientific understanding and reasoning abilities.</p><h4><strong>Object Localization</strong></h4><p>Accurate object localization techniques become crucial in multimodal tasks. In the following tests, the InternVL 1.5 model performs exceptionally well. It not only quickly and accurately identifies the specific position of the basketball player but also describes the posture of the corresponding player in detail. It can also accurately interpret dynamic interactions in complex scenes, such as correctly identifying the man’s right hand pointing and understanding the meaning the man wants to convey based on the actual scene.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*hscpyHTzxjNIqmfe" /></figure><h3>Conclusion</h3><p>InternVL 1.5 is not just a project but also an important contribution of the open-source community to high-level multimodal artificial intelligence technology. In the commercial field, multimodal models like GPT-4V have become industry benchmarks. In this context, InternVL 1.5 demonstrates that the open-source community can not only keep pace with the commercial field but also lead in some aspects.</p><p>InternVL 1.5 significantly improves the performance of open-source multimodal models and narrows the gap with commercial models. This open-source model is expected to drive the development of the multimodal field and deserves further exploration and optimization.</p><p>As a completely open-source project, InternVL 1.5 encourages everyone interested to participate and jointly promote the future development of AI. By visiting the online demo of InternVL 1.5 (https://internvl.opengvlab.com), researchers and developers worldwide can now experience the charm of this technology and participate in this exciting open-source project.</p><h3>Demo（Try it！）</h3><p><a href="https://internvl.opengvlab.com">https://huggingface.co/spaces/OpenGVLab/InternVL</a></p><blockquote>Opensource link: <a href="https://github.com/OpenGVLab/InternVL">https://github.com/OpenGVLab/InternVL</a></blockquote><blockquote>Follow us: <a href="https://twitter.com/opengvlab">https://twitter.com/opengvlab</a></blockquote><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=33afe9587edd" width="1" height="1" alt=""><hr><p><a href="https://ai.gopubby.com/open-source-vision-language-model-now-comparable-to-gpt-4v-33afe9587edd">Open-source vision-language model now comparable to GPT-4V</a> was originally published in <a href="https://ai.gopubby.com">AI Advances</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Does Your Multi-model LLM Truly See The Diagrams In Visual Math Problems?]]></title>
            <link>https://ai.gopubby.com/does-your-multi-model-llm-truly-see-the-diagrams-in-visual-math-problems-8c3b3fc5d832?source=rss-6247e25c46e0------2</link>
            <guid isPermaLink="false">https://medium.com/p/8c3b3fc5d832</guid>
            <category><![CDATA[math]]></category>
            <category><![CDATA[multimodal]]></category>
            <category><![CDATA[100-followers]]></category>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[OpenGVLab]]></dc:creator>
            <pubDate>Wed, 17 Apr 2024 13:05:44 GMT</pubDate>
            <atom:updated>2024-04-17T13:05:44.252Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pEFovM6G3VzInSJ4LGMuJQ.jpeg" /></figure><h3><strong>1. Background</strong></h3><p>With the substantial advances in big data and computational power, large language models (LLMs) have emerged as a central point of interest in industry and academia. To broaden their applicability across diverse contexts, multimodal LLMs have recently become a fast-evolving field.</p><p>Concurrently, various evaluation benchmarks are curated to assess their visual comprehension performance across different domains. However, most of them only contain images from daily life. To deeply assess the multimodal logical thinking prowess of MLLMs, the capability to solve mathematical problems involving diagrams is a critical measure.</p><p>Nevertheless, an appropriate mathematical benchmark for MLLMs still exists. Previous efforts in the field, e.g., GeoQA, MathVista, and MMMU, suffer from several issues under our analysis. Therefore, we present MathVerse, <strong>a holistic and specialized visual math benchmark crafted to evaluate the multimodal mathematical reasoning skills of MLLMs</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/865/1*7BDi87sesIZFTDEH4XeKZA.png" /></figure><p>Paper：<a href="https://arxiv.org/pdf/2403.14624.pdf">https://arxiv.org/pdf/2403.14624.pdf</a></p><p>Project：<a href="https://mathverse-cuhk.github.io/">https://mathverse-cuhk.github.io/</a></p><p>Code：<a href="https://github.com/ZrrSkywalker/MathVerse">https://github.com/ZrrSkywalker/MathVerse</a></p><p>Dataset：<a href="https://huggingface.co/datasets/AI4Math/MathVerse">https://huggingface.co/datasets/AI4Math/MathVerse</a></p><p><strong>PS： The work ranks 1st in the Huggingface Daily Paper🔥🔥🔥. It is also trending in X with over 10K views.</strong></p><h3><strong>2. Key Observation</strong></h3><p>Through our comprehensive observation and analysis, we identify three primary issues in current mathematical benchmarks for evaluating MLLMs:</p><h4>(1) <strong>Do MLLMs truly see the math diagrams in evaluation?</strong></h4><p>This is the most fundamental question concerning assessing visual math problem-solving. Figure 1 (a) showcases three examples from current benchmarks. We observe that their texts contain too much duplicate information (highlighted in red), also depicted in the diagram. This redundancy may provide MLLMs with a shortcut to resolve the problem by mostly reading the text rather than interpreting the diagram. Our hypothesis gains support from the experiment in Figure 1 (b). For 40 randomly sampled problems from each benchmark, we remove redundant texts from the question, challenging MLLMs to capture the corresponding information exclusively from visual inputs.</p><p><strong>The results reveal a significant drop in accuracy among most MLLMs (the blue column), even falling below the scores without taking diagrams as input (the grey column).</strong></p><p>This outcome suggests that <strong>MLLMs primarily depend on textual cues rather than the visual diagrams themselves to solve these problems in evaluation</strong>. Given this, we demonstrate that current visual math benchmarks might need to be more comprehensive to assess the genuine multimodal mathematical reasoning capabilities of MLLMs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fpC-yqFljNnOjMjvXq0yAg.png" /><figcaption>Figure 1: (a) We showcase three examples of Text Redundancy (highlighted in red) within existing visual math benchmarks. (b) We report an ablation study by respectively removing the redundant texts and input diagrams on 120 randomly selected problems, for closed-sourced and open-sourcedMLLMs</figcaption></figure><h4><strong>(2)</strong> <strong>Is it equitable to assess solely by the final answer?</strong></h4><p>Most existing multimodal benchmarks directly compare model outputs with ground truths to derive a binary evaluation result. While this approach may suffice for general visual contexts, it must improve in math problems requiring intricate step-by-step reasoning. In Figure 2, we examine three model outputs. <strong>Although they all arrive at incorrect answers in the end, they demonstrate varying levels of precision in the intermediate reasoning processes. Merely categorizing these outputs as Incorrect fails to capture the nuanced differences in the reasoning quality of MLLMs.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Nt1sxkvd_75DxJY86Etu7A.png" /><figcaption>Figure 2.Comparison of Visual Mathematical Reasoning by Three MLLMs. Despite the incorrect final answer, GPT-4V , Gemini-Pro , and SPHINX-MoE exhibit different levels of quality in the intermediate reasoning process.</figcaption></figure><h4>(3) <strong>Do they specialize in mathematical reasoning evaluation?</strong></h4><p>GeoQA narrowly targets specific aspects of plane geometry. This limits evaluating broader mathematical capabilities, e.g., functions and solid geometry. Instead, MathVista expands its scope by including a wide array of peripheral tasks (19 out of 28), encompassing natural images, statistic plots, and charts, which do not directly evaluate professional math skills. Furthermore, the math problems in MMMU are of college-level complexity with extensive domain-specific knowledge, potentially hindering MLLMs from fully demonstrating their reasoning capacity.</p><h3><strong>3. MathVerse Benchmark</strong></h3><h4><strong>(1)</strong> <strong>Data Composition and Categorization</strong></h4><p>MathVerse comprises 2,612 visual math problems, which contribute to the final created 15K test samples. These test samples cover 3 fundamental math subjects: plane geometry, solid geometry, and functions, as well as 12 fine-grained categories. With meticulous annotation and review, the high-quality data in MathVerse provides a robust and comprehensive benchmark for evaluating the capability of multimodal mathematical reasoning.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*BpPejXD_DxmMfDXHp5gxZw.png" /></figure><h4><strong>(2)</strong> <strong>How can the MLLMs’ capability of mathematical image understanding be assessed?</strong></h4><p>We first define three distinct categories for the textual information within the questions:</p><p>1) <strong>Descriptive Information </strong>refers to the diagram’s directly observable and clearly portrayed content. It depicts the elemental figure composition, spatial arrangement, and annotated entities, such as the presence of geometric shapes or intersection points of functions. This information is repetitive of the visual components in the diagram and is thus regarded as redundant information for problem-solving.</p><p>2) <strong>Implicit Property </strong>involves information that requires a higher level of visual perception but less mathematical knowledge to discern from the diagram. It signifies solid visual conditions for problem-solving, such as parallelism and perpendicularity between lines, similarity, congruence among triangles, and the category and periodicity of functions.</p><p>3) <strong>Essential Condition </strong>denotes the specific numerical or algebraic measurements, indispensable conditions to derive the solution, and cannot be derived from the visual diagram. This category encompasses precise values of angles, lengths, and function expressions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DRkwBGGxXVUr4PBggtSZGA.png" /><figcaption>Figure 3: Three Categories of Question Texts in MATHVERSE. According to the significance for problem-solving, we categorize the question texts into three categories, and transform each problem into six versions for evaluation, with varying content in multi-modality. We present three examples in MATHVERSE for illustration</figcaption></figure><p>Based on the three categories, expert annotators systematically remove different textual information within questions and incrementally incorporate the critical elements into diagrams, resulting in six<strong> different question versions</strong>. <strong>This approach can progressively reduce textual redundancy and information content, thereby increasingly compelling MLLMs to capture mathematical conditions from the visual input.</strong> With this curated problem set, we can provide a holistic evaluation of the genuine visual comprehension of MLLMs and whether it can facilitate multimodal mathematical reasoning.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_5IvcEhyubgISwo5v1ecpg.png" /><figcaption>Figure 5: Six Versions of Each Problem in MATHVERSE. Expert annotators meticulously transform each visual math problem within MATHVERSE into six versions. They contain different vision language content for a holistic visual mathematical evaluation</figcaption></figure><h4><strong>(3)</strong> <strong>How can the intermediate solution steps of MLLM be evaluated fine-grained?</strong></h4><p>Compared to visual question-answering in general scenarios, solving MLLMs for mathematical problems requires nuanced, <strong>step-by-step</strong> <strong>CoT reasoning</strong>. To this end, we propose a CoT evaluation strategy to thoroughly assess their mathematical CoT skills in visual contexts. It involves two prompting phases with GPT4 and GPT4V: Key-step Extraction and Multi-step Scoring, as shown in the following graph.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/929/1*n5tqtsnvaK5e8fUGkNYk8Q.png" /><figcaption>Figure 6: Examples of the CoT Evaluation Strategy for MATHVERSE. We present two outputs from Qwen-VL-Max with our CoT evaluation strategy, which assesses the fine-grained reasoning capabilities with a detailed explanation for error analysis</figcaption></figure><p>This evaluation strategy focuses on the correctness of the final answer and <strong>emphasizes the logical coherence and depth of reasoning during the problem-solving process</strong>. Through this approach, we can more accurately reveal the true capabilities of MLLMs in solving complex mathematical problems, especially in how they construct problem-solving solutions step by step. This is crucial for understanding the thinking patterns, reasoning abilities, and how they process and interpret the integration of visual and mathematical information of MLLMs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*IHLU_nJPQTFFwFmIMo1uEQ.png" /></figure><h3><strong>4. Experiments and Conclusion</strong></h3><p>As shown in the table below, we conducted experiments on 17 MLLMs on MathVerse. ‘CoT-E’ or ‘w/o’ denotes whether to employ the proposed CoT evaluation strategy.</p><p>Based on the evaluation result, we come to the following conclusions:</p><p>(1) MLLMs rely more on textual information for problem-solving rather than observing mathematical images.</p><p>(2) Apart from GPT-4V and ShareGPT4V, <strong>most MLLMs can achieve higher scores solely through text, without the input of images.</strong> This proves that low-quality visual encoding currently plays a negative role in problem-solving.</p><p>(3) MLLMs need help to accurately interpret the primary conditions and questions from images.</p><p>(4) Closed-source models have better multimodal mathematical problem-solving capabilities than open-source models.</p><p>(5) Comparing G-LLaVA and LLaVA-1.5 shows that fine-tuning models with mathematical training data can enhance specific problem-solving abilities but also reduce their generalization capabilities.</p><p>(6) Compared to binary evaluation, The Chain of Thought (CoT) evaluation can more comprehensively reflect the model’s logical reasoning ability.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8c3b3fc5d832" width="1" height="1" alt=""><hr><p><a href="https://ai.gopubby.com/does-your-multi-model-llm-truly-see-the-diagrams-in-visual-math-problems-8c3b3fc5d832">Does Your Multi-model LLM Truly See The Diagrams In Visual Math Problems?</a> was originally published in <a href="https://ai.gopubby.com">AI Advances</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[InternVid: Video-Text Dataset to Empowering Video Creation and Understanding]]></title>
            <link>https://ai.gopubby.com/internvid-video-text-dataset-to-empowering-video-creation-and-understanding-19dd83749d13?source=rss-6247e25c46e0------2</link>
            <guid isPermaLink="false">https://medium.com/p/19dd83749d13</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[dataset]]></category>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[video-understanding]]></category>
            <category><![CDATA[100-followers]]></category>
            <dc:creator><![CDATA[OpenGVLab]]></dc:creator>
            <pubDate>Wed, 03 Apr 2024 12:31:04 GMT</pubDate>
            <atom:updated>2024-04-03T12:31:04.877Z</atom:updated>
            <content:encoded><![CDATA[<p>InternVid is a large-scale video-text dataset designed to advance video understanding and generation tasks. It comprises <strong>over 7 million videos, which vary by 760,000 hours</strong>, and are accompanied by detailed captions. The release of InternVid is expected to propel progress in text-video multimodal understanding and video generation, offering new opportunities for related research and applications. Key features include:</p><ul><li><strong>Large Scale:</strong> InternVid is one of the largest publicly available video-text datasets, containing more than 7 million videos with a total duration of nearly 760,000 hours.</li><li><strong>Diverse Content:</strong> Video content spans various domains, including everyday life, sports, entertainment, education, and more, catering to the needs of different research and applications.</li><li><strong>High Quality:</strong> Both the videos and texts have been carefully selected and processed to ensure the high quality of the dataset, which provides rich captions, video-text similarity scores, and video aesthetics scores.</li></ul><h3>The Starting Point of InternVid</h3><p>Learning transferable video-text representations is crucial for video understanding, especially in practical applications such as autonomous driving, intelligent surveillance, and visual search. OpenAI’s release of the Sora model has recently marked significant progress in text-to-video generation. Sora not only breaks through the limitations of video coherence but also maintains consistency across multi-angle camera transitions and demonstrates a profound understanding of the logic of the real world. This breakthrough opens new possibilities for multimodal contrastive learning in the video-language domain.</p><p>Although Sora is not yet available for public use, its achievements in video generation, akin to GPT-3 in the text domain, suggest that the realization of general artificial intelligence may arrive sooner than anticipated. The development of InternVid is set against this backdrop, aiming to advance video understanding and generation by providing a large-scale dataset that can facilitate the training and evaluation of models capable of handling the complexities of real-world video-text interactions.</p><p>However, a key reason currently limiting exploration is the need for high-quality video-language datasets for large-scale pre-training. Current research relies on datasets such as HowTo100M [1], HD-VILA [2], and YT-Temporal [3, 4], where the text is generated using Automatic Speech Recognition (ASR). Despite the vast scale of these datasets, they often need higher semantic relevance between the videos and their corresponding text descriptions. Such data is not well-suited for generation tasks like text-to-video synthesis, and on the other hand, enhancing this relevance (for example, by aligning videos with captions to improve their match) would significantly benefit downstream tasks, such as video retrieval and video question-answering.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*k195Lwx_0BhKUmz-" /></figure><p>To address <strong>the challenge of scaling up video-language modeling</strong> while maintaining high video-text correspondence, we propose a large-scale video-centric dataset, InternVid, as shown in Figure 1. ASR transcripts barely describe the visual elements in the videos, whereas generated captions cover more visual content. This dataset includes highly relevant video-text pairs, comprising over 7 million videos, totaling <strong>760,000 hours, generating 234 million video clips, covering 16 scenes and approximately 6,000 actions</strong>. We employ a multi-scale approach to generate captions to enhance the video-text match. On a coarse scale, we describe the middle frame of each video. On a fine scale, we generate frame-by-frame captions and summarize them with a language model.</p><p>Through InternVid, we learned about a video representation model called ViCLIP, which achieves zero-shot solid performance. For text-to-video generation, we have curated<strong> </strong>an aesthetic subset of InternVid,<strong> </strong>as <strong>InternVid-Aes, covering 18 million video clips</strong>. Together with WebVid-10M [5], InternVid can significantly enhance the generative capabilities of diffusion-based video generation models.</p><h3>II. Dataset Collection</h3><p>We have identified three critical factors for building this dataset: <strong>significant temporal information, rich and diverse semantics, and video-text correlation</strong>. We collected raw videos based on query terms for actions and activities to ensure high temporality. To enrich and diversify semantics, we crawled popular videos from various categories and deliberately increased the proportion of data from <strong>different countries and languages.</strong> To enhance video-text relevance, we used image description and language models to generate video descriptions from annotations of specific frames. Next, we will detail the dataset construction process and discuss its statistics and characteristics.</p><p><strong>Diversity of the Dataset.</strong> Crawled popular videos from various categories, and various videos were collected from <strong>16 popular categories,</strong> with videos from multiple percentages. To ensure diversity, we selected videos from countries with different languages rather than relying on a dominant language environment. The countries sampled include the United Kingdom, the United States, Australia, Japan, South Korea, China, Russia, and France. In terms of duration, each video averages 351.9 seconds in length. Half (49%) of the videos are at most five minutes, while a quarter (26%) of the videos range from five to ten minutes. 8% of the videos are at most 20 minutes in duration. Among the curated videos, <strong>85% are of high resolution (720P),</strong> with the remaining 15% varying in resolution from 360P to 720P. Although videos of lower resolution may not perform as well as high-resolution videos in content generation tasks, they can still be used for video-language representation learning as long as appropriate captions accompany them.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*JAbOzF8OsN3UxHEn" /></figure><p>InternVid demonstrates diversity at the segmented clip level with varying clip durations and caption lengths. Aesthetic scores and video-text similarities are evenly distributed. Most clips are between 0 and 10 seconds long, accounting for 85% of all clips. Approximately half of the clip captions contain 10 to 20 words, while a third contains fewer than 10. About 11% of the clips have long captions with more than 20 words.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*scqF_TaYB9P-CIhj" /></figure><p><strong>Original Data Collection.</strong> We collected videos from YouTube, considering the diversity and richness of its data and its availability for academic use. We obtained 7<strong> million YouTube video links</strong> with an average duration of 6.4 minutes. To ensure the uniqueness of our dataset, we created a database of YouTube IDs and excluded any videos already available in publicly available datasets before April 2023. On the one hand, we selected popular channels and their corresponding trending or highly-rated videos, obtaining 2 million videos from categories such as news and gaming. On the other hand, we created a list of verbs related to actions/activities. We also obtained 5.1 million videos with this list, selecting videos at the top of search results.</p><p><strong>Video Retrieval Keywords.</strong> We defined approximately <strong>6.1K action phrases</strong> from ATUS [6], public video datasets, and text corpora. These were then<strong> refined using a language model and manually created</strong>. We utilized ATUS actions from 2017 to 2022, merging them and removing duplicates. For reference to public video data, we leveraged datasets such as Kinetics [7], SomethingSomething[8,9], UCF101 [10], etc. This provided us with 1103 action labels. Additionally, we accessed several grounding datasets. We used language models to extract actions and their corresponding objects (where applicable) from the corpora, forming phrases, of which 5001 actions passed manual inspection. In total, we collected <strong>6104 video query terms for searching videos online</strong>.</p><p><strong>Collection Strategy.</strong> To ensure the quality of our dataset, we established specific crawling rules. We only collected videos between <strong>10 seconds and 30 minutes</strong>, with resolutions ranging from <strong>360P to 720P.</strong> In this process, we prioritized the highest available resolution. We collected videos along with their <strong>audio, captions, and titles to provide a comprehensive multimodal dataset.</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*XJ7wgyZWC-ECXChC" /></figure><p><strong>Multi-scale Video Caption Generation. </strong>To generate scalable, rich, and diverse video captions, we employed a multi-scale approach encompassing two distinct captioning strategies, as illustrated in Figure 4. At a finer scale, we streamlined the video captioning process by concentrating on common objects, actions, and scenes within video clips. We intentionally overlooked intricate details, such as nuanced facial expressions, movements, and other subtle elements. At a coarser scale, captions were crafted solely for the central frames of the videos. Considering our focus on short video clips filtered through scene segmentation (around 10 seconds), most between 10 seconds and 30 minutes videos predominantly feature consistent objects without significant visual changes. This approach circumvents the need to address identity preservation issues from an image perspective when handling videos. Technically, we utilized the lightweight image captioning model <strong>Tag2Text </strong>[11] <strong>for finer-scale captioning, which describes videos on a per-frame basis at low frames per second (fps).</strong> For the coarser scale, we leveraged <strong>BLIP2 </strong>[12]<strong> to caption the central frames of the clips</strong>. These individual frame captions were then <strong>aggregated into an integrated video caption using pre-trained language models</strong> [13,14].</p><h3><strong>III. Multimodal Representation Model ViCLIP</strong></h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*FMnU9fZW6qPuYKx_" /></figure><p>Based on InternVid and CLIP[15], we propose <strong>ViCLIP</strong>, a video-text contrastive learning model based on ViT-L. This model adopts video occlusion and contrastive loss methods to learn transferable video-text representations, as shown in Figure 5. By introducing DeepSpeed and FlashAttention, ViCLIP was trained for 3 days on 64 NVIDIA A100 GPUs. Most.</p><p>We train ViCLIP on five subsets of InternVid and evaluate its performance on multiple popular video-related benchmarks, including zero-shot and thoroughly fine-tuned settings. These five subsets are randomly sampled InternVid-10M, InternVid-50M, InternVid-200M, InternVid-10M-DIV, which assumes that the diversity of training video clips is more critical than the quantity, and InternVid-10M-FLT which assumes that text-video match is more important. We compare InternVid-10M and InternVid-10M-DIV/-FLT with WebVid10M while using InternVid-50M and InternVid-200M to further validate the indicative data of video-language contrastive learning. The specific performance tables can be referred to in the original text, and Figure 6 lists the performance of the retrieval task.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*JCrNfOhaQrT6Ibo1" /></figure><h3><strong>IV. Exploration of Text-to-Video Generation</strong></h3><p>Additionally, we provide a subset filtered for specific video-text relationships and visual aesthetics, InternVid-Aes, which aids in generating high-resolution, watermark-free videos. Using InternVid-Aes can significantly enhance the visual and quantitative results of a simple text-to-video baseline model. These resources offer a powerful tool for researchers and practitioners interested in multimodal video understanding and generation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/995/0*uDJUvtOOpZ1ys9Hn" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*1TWbGJP8_iFpzb06" /></figure><p>To quantitatively evaluate our model, we conducted zero-shot text-to-video experiments. We randomly selected2020 videos from the UCF-101 dataset and 2990 videos from the MSRVTT dataset, testing the CLIPSIM, IS, FID, and ViCLIP shows significant performance improvements in zero-shot settings compared to previous Video CLIP variants and FVD metrics in Figure 8.</p><h3><strong>IV. Download</strong></h3><p>The InternVid dataset can be accessed through the following:</p><p>🤗 Hugging Face 🤗</p><ul><li><strong>Dataset meta info:</strong> <a href="https://huggingface.co/datasets/OpenGVLab/InternVid">https://huggingface.co/datasets/OpenGVLab/InternVid</a></li><li><strong>ViCLIP weights:</strong> <a href="https://huggingface.co/OpenGVLab/ViCLIP">https://huggingface.co/OpenGVLab/ViCLIP</a></li></ul><p>Currently, on HuggingFace, the InternVid-10M-FLT (default), InternVid-10M-DIV, and InternVid-18M-Aes have been open-sourced. By selecting different subsets, you can preview them, which provide video IDs, start and end timestamps, video captions, aesthetic scores, and video-text match scores. You can download them in conjunction with yt-dlp or video2dataset.</p><p>To download the metafile, you can also use the following command (please log in to the website to obtain permissions first):</p><pre>pip install -U huggingface_hub<br>hugging face-cli download --repo-type dataset --token hf_***(your_huggingface_token) --<br>resume-download --local-dir-use-symlinks False OpenGVLab/InternVid --local-dir <br>InternVid_metainfo</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ZpZ4BivhfBrgvcGI" /><figcaption>Figure 9 Preview of InternVid-10M-FLT、InternVid-10M-DIV、InternVid-18M-Aes</figcaption></figure><p>We also provide information covering the source languages, which you can download at <a href="https://huggingface.co/datasets/OpenGVLab/InternVid-10M-FLT-INFO">https://huggingface.co/datasets/OpenGVLab/InternVid-10M-FLT-INFO</a>.</p><p><strong>Github:</strong></p><ul><li>Dataset information and Model Zoo: <a href="https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid">https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid</a></li><li>ViCLIP training code: <a href="https://github.com/OpenGVLab/ViCLIP">https://github.com/OpenGVLab/ViCLIP</a></li></ul><h3><strong>Citation</strong></h3><p>The release of InternVid will stimulate research and development in the field of video understanding and generation. In the future, the OpenGVLab team will continue to update and refine InternVid, and is committed to providing researchers and developers with higher quality datasets and tools.</p><p>If you have used the InternVid dataset or ViCLIP in your research or applications, we welcome you to cite the following references:</p><pre>@inproceedings{wang2023internvid,<br>  title={InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation},<br>  author={Wang, Yi and He, Yinan and Li, Yizhuo and Li, Kunchang and Yu, Jiashuo and Ma, Xin and Chen, Xinyuan and Wang, Yaohui and Luo, Ping and Liu, Ziwei and Wang, Yali and Wang, Limin and Qiao, Yu},<br>  booktitle={The Twelfth International Conference on Learning Representations},<br>  year={2023}<br>}</pre><h3>Contact us</h3><p>If you have any questions or suggestions regarding the InternVid dataset, please feel free to contact the OpenGVLab-InternVideo team:</p><ul><li>Email: <a href="mailto:gvx-sh@pjlab.org.cn">gvx-sh@pjlab.org.cn</a></li></ul><p>For any issues or inquiries, the team can provide further assistance and support. If you encounter any difficulties with the dataset or have ideas for improvements, don’t hesitate to reach out. video understanding and generation. In the future, the OpenGVLab team will continue to update and refine InternVid and is committed to providing researchers and developers with higher-quality Your feedback is valuable for the continuous development and enhancement of the dataset.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/900/1*JEIV-60mnoSwW6v1OF5t0Q.jpeg" /></figure><p>Reference</p><p>[1] Miech A, Zhukov D, Alayrac J B, et al. Howto100m: Learning a text-video embedding by watching a hundred million narrated video clips[C]//ICCV 2019: 2630–2640.</p><p>[2] Xue H, Hang T, Zeng Y, et al. Advancing high-resolution video-language representation with large-scale video transcriptions[C]//CVPR 2022: 5036–5045.</p><p>[3] Zellers R, Lu X, Hessel J, et al. Merlot: Multimodal neural script knowledge models[J]. NeurIPS 2021, 34: 23634–23651.</p><p>[4] Zellers R, Lu J, Lu X, et al. Merlot reserve: Neural script knowledge through vision and language and sound[C]//CVPR 2022: 16375–16387.</p><p>[5] Bain M, Nagrani A, Varol G, et al. Frozen in time: A joint video and image encoder for end-to-end retrieval[C]//ICCV 2021: 1728–1738.</p><p>[6] <a href="https://www.bls.gov/tus/">https://www.bls.gov/tus/</a></p><p>[7] Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//CVPR 2017: 6299–6308.</p><p>[8] Goyal R, Ebrahimi Kahou S, Michalski V, et al. The” something something” video database for learning and evaluating visual common sense[C]//ICCV 2017: 5842–5850.</p><p>[9] Mahdisoltani F, Berger G, Gharbieh W, et al. On the effectiveness of task granularity for transfer learning[J]. arXiv preprint arXiv:1804.09235, 2018.</p><p>[10] Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv preprint arXiv:1212.0402, 2012.</p><p>[11] Huang X, Zhang Y, Ma J, et al. Tag2text: Guiding vision-language model via image tagging[J]. arXiv preprint arXiv:2303.05657, 2023.</p><p>[12] Zheng L, Chiang W L, Sheng Y, et al. Judging llm-as-a-judge with mt-bench and chatbot arena[J]. NeurIPS 2024, 36.</p><p>[13] Li J, Li D, Savarese S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]//ICML 2023: 19730–19742.</p><p>[14] Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of machine learning research, 2020, 21(140): 1–67.</p><p>[15] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//ICML 2021: 8748–8763.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=19dd83749d13" width="1" height="1" alt=""><hr><p><a href="https://ai.gopubby.com/internvid-video-text-dataset-to-empowering-video-creation-and-understanding-19dd83749d13">InternVid: Video-Text Dataset to Empowering Video Creation and Understanding</a> was originally published in <a href="https://ai.gopubby.com">AI Advances</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[VideoMamba: State Space Model for Efficient Video Understanding]]></title>
            <link>https://ai.gopubby.com/videomamba-state-space-model-for-efficient-video-understanding-3c0f4fce6ae2?source=rss-6247e25c46e0------2</link>
            <guid isPermaLink="false">https://medium.com/p/3c0f4fce6ae2</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[video-understanding]]></category>
            <category><![CDATA[100-followers]]></category>
            <category><![CDATA[ai]]></category>
            <dc:creator><![CDATA[OpenGVLab]]></dc:creator>
            <pubDate>Wed, 27 Mar 2024 12:41:11 GMT</pubDate>
            <atom:updated>2024-03-27T12:41:11.402Z</atom:updated>
            <content:encoded><![CDATA[<p><strong><em>Welcome to follow us on Twitter: </em></strong><a href="https://twitter.com/opengvlab"><strong><em>https://twitter.com/opengvlab</em></strong></a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZORu2CVfkWS1TlstZ_Y4ww.png" /><figcaption>Comparing two models, TimeSformer-Ti 4 is built on DeiT-Ti 76 with joint spatiotemporal attention. Input frames are 224x224. It was tested on an NVIDIA A100–80G GPU, using PyTorch 2.1 and CUDA 11.8, with a batch size of 128. Our VideoMamba model is better, faster, and cheaper for short-term and long-term video understanding.</figcaption></figure><blockquote>VideoMamba is a pioneering SSM-based model for video understanding🎥:<br>Visual Domain Scalability;<br>Short-term Action Sensitivity;<br>Long-term Video Superiority;<br>Modality Compatibility.</blockquote><h3>Motivation</h3><p>The core objective for video understanding lies in mastering spatiotemporal representations, which inherently presents two formidable challenges: the large spatiotemporal redundancy within short video clips and the complex spatiotemporal dependencies among long contexts. The once-dominant 3D CNNs and video transformers, which use convolution and self-attention mechanisms, addressed these two issues. In our previous work, UniFormer<a href="https://github.com/Sense-X/UniFormer">[1]</a>, we attempted to combine convolution and self-attention seamlessly. It could simultaneously address both problems but fell short for long videos. The popularity of Gemini<a href="https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/">[2]</a> and Sora<a href="https://openai.com/sora">[3]</a> shifted the research focus towards understanding and generating long videos, urgently demanding more efficient representation learning.</p><p>Fortunately, the NLP field has seen the emergence of several efficient operators in the past two years, such as S4<a href="https://github.com/state-spaces/s4">[4]</a>, RWKV<a href="https://www.rwkv.com/">[5]</a>, and RetNet<a href="https://github.com/microsoft/unilm/tree/master/retnet">[6]</a>. Mamba<a href="https://github.com/state-spaces/mamba">[7]</a> stands out with its selective state space model (S6), which is capable of long-term dynamic modeling with linear complexity. This led to the adaptation of a series of vision tasks, such as Vision Mamba<a href="https://github.com/hustvl/Vim">[8]</a> and VMamba<a href="https://github.com/MzeroMiko/VMamba">[9]</a>, proposing a multi-directional SSM mechanism for processing 2D images. These models compete with attention-based architectures and significantly reduce memory consumption.</p><p>Given videos’ inherently longer token sequences, the question arises: Is Mambaequally effective for video understanding? The answer is yes.</p><p>We proposed an efficient video understanding architecture, VideoMamba, which is solely based on the State Space Model (SSM), and conducted extensive experiments to show its excellent properties, including (1) Visual Domain Scalability, (2) Short-term Action Sensitivity, (3) Long-term Video Superiority; (4) Modality Compatibility. These enable VideoMamba to achieve remarkable results across a range of video benchmarks, especially in long-video benchmarks, providing an efficient solution for comprehensive video understanding in the future.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Bhhngm9xV79G7itJOZKOhg.png" /></figure><blockquote>Paper: <a href="https://arxiv.org/abs/2403.06977">https://arxiv.org/abs/2403.06977</a></blockquote><blockquote>Code: <a href="https://github.com/OpenGVLab/VideoMamba">https://github.com/OpenGVLab/VideoMamba</a></blockquote><blockquote>Model: <a href="https://huggingface.co/OpenGVLab/VideoMamba">https://huggingface.co/OpenGVLab/VideoMamba</a></blockquote><blockquote>Online Demo: <a href="https://huggingface.co/spaces/OpenGVLab/VideoMamba">https://huggingface.co/spaces/OpenGVLab/VideoMamba</a></blockquote><h3>Methods</h3><h4>Architecture</h4><p>Before formally introducing VideoMamba’s structure, let’s first examine the Mamba block for 1D sequences and the bidirectional Mamba block for vision tasks. We will not delve into the underlying principles of SSM and Mamba here, but those who are interested can learn through YouTube videos at <a href="https://www.youtube.com/watch?v=8Q_tqwpTpVU">https://www.youtube.com/watch?v=8Q_tqwpTpVU</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ijraJc4pOdTgKi9iQlGPDA.png" /></figure><p>The bidirectional Mamba (B-Mamba) block builds upon the normal Mamba by introducing the extra SSM for the reverse sequence, enabling better modeling of 2D sequences and thus enhancing the perceptual capability for visual inputs. Based on the B-Mamba block, our VideoMamba follows the design of ViT<a href="https://github.com/google-research/vision_transformer">[10]</a>, introducing the [CLS] token and spatial position encoding. Specifically for video modeling, we utilize 3D patch embedding and spatial position encoding.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*x91uRspmEPyGauiP8fcunQ.png" /></figure><p>To apply the B-Mamba block for spatiotemporal tokens, we expand the original 2D scanning to different bidirectional 3D scanning methods:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wxstKP43jpPmIoQE-pRMNQ.png" /></figure><p>Among them, the spatial-first scanning is the simplest and the most effective. Based on this, we propose three model sizes: VideoMamba-Ti, VideoMamba-S, and VideoMamba-M.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zwns5SjtXv5EdsoVSzzMGw.png" /></figure><p>In our experiments, increasing the size of VideoMamba’s model is prone to overfitting, resulting in the larger models performing even worse than the smaller ones. To address this, we proposed a Self-Distillation strategy, using the pretrained smaller model as a teacher to guide the training of the larger model, effectively avoiding model overfitting with only a minimal additional cost.</p><h4>Masked Modeling</h4><p>Recently, VideoMAE<a href="https://github.com/MCG-NJU/VideoMAE">[11] </a>introduced masked modeling, significantly enhancing the model’s capacity for fine-grained temporal understanding. UMT[12] further proposes an efficient mask alignment strategy, significantly reducing training costs and enabling the model to handle various single-modality and multi-modality tasks robustly. To enhance VideoMamba’s temporal sensitivity and verify its compatibility with the text modality, we follow UMT to introduce CLIP-ViT as a teacher for two-stage training.</p><p>Unlike the multi-layer alignment, due to architectural differences between VideoMamba and ViT, we only align the last layer of the models. Considering the Mamba block’s preference for continuous tokens, we design the row masking strategies:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8Wc-eowbO_jFjZnVvnE2kw.png" /></figure><p>Additionally, we explore attention masking to preserve meaningful adjacency among tokens, leveraging the inherent strengths of the 1D convolution within the B-Mamba block for improved performance.</p><h3>Experiments</h3><h4>Scale Up</h4><p>We first conduct experiments on ImageNet as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*G6MTsrF-mCr5McDhfLGBeA.png" /></figure><p>It is evident that without Self-Distillation (SD), both VideoMamba-M and VideoMamba-B tend to overfit towards the end of the training, with VideoMamba-B experiencing particularly severe overfitting. However, with SD, VideoMamba-M converges as expected and is notably superior to the teacher model, VideoMamba-S. To prevent the potential over-guidance of the teacher model, we introduce an Early Stop strategy, i.e., removing the distillation guidance, but experiments still need improvement. The complete ImageNet comparison is as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/872/1*h6dKZvIGIxSow1_ggtUkMg.png" /></figure><p>VideoMamba outperforms the convolution-based ConvNeXt and attention-based ViT compared to other isotropic models. As the model scale and resolution increase, performance consistently improves.</p><h4>Short-term Video Understanding</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/720/1*rLvMio77AWDeeypFHw0qQg.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/744/1*CmO8tlOFQmaI7HJYpm0eUw.png" /></figure><p>On K400 and SthSthV2, we also observe VideoMamba’s excellent scalability, significantly outperforming attention-based video models such as TimeSformer and ViViT and performing comparably to UniFormer, which combines convolution and self-attention. Furthermore, VideoMamba’s performance significantly improves after introducing masked training, notably surpassing the ViT-based UMT on the fine-grained action dataset SthSthV2.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*M7x2sL2v6ekJdSdRGh4FNA.png" /></figure><p>Further ablation experiments show that the spatial-first scanning approach yields the best results. Unlike on ImageNet, where performance gradually increases with resolution, on video datasets, the resolution has a limited impact on performance, whereas the frame number significantly affects it. For masked modeling, row masking is superior to random masking strategies, and attention masking strategies are the most effective; aligning the last layer yields the best results; appropriate masking ratios and Droppath rate can significantly enhance training effectiveness.</p><h4>Long-term Video Understanding</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/819/1*bC1ZHRiOhgDI9_PPYk4zXg.png" /></figure><p>We evaluate VideoMamba’s understanding of long-duration videos on the Breakfast, COIN, and LVU datasets. Compared to previous feature-based methods, VideoMamba requires only 32 to 64 sparse frames to achieve significantly superior performance, while the model size is smaller.</p><h4>Multi-modality Video Understanding</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*o0i_mYagmi52aA_igiGo2g.png" /></figure><p>We connected VideoMamba with BERT to construct a multimodal model, used large-scale multimodal data for pre-training, and evaluated its performance on several video-text retrieval tasks. Experiments reveal that VideoMamba can also serve as a multimodal visual encoder. Larger pre-training data can continuously improve multimodal capacity. Compared with UMT based on ViT, VideoMamba excels on datasets containing long videos (ANet and DiDeMo) and more complex scenarios (LSMDC).</p><h3>Conclusion</h3><p>We propose VideoMamba for efficient video understanding based solely on the stateSpace Model. Comprehensive experiments have shown that VideoMamba possesses a series of excellent characteristics for video understanding. We hope it can pave the way for future representation learning of long videos.</p><blockquote>Paper: <a href="https://arxiv.org/abs/2403.06977">https://arxiv.org/abs/2403.06977</a></blockquote><blockquote>Code: <a href="https://github.com/OpenGVLab/VideoMamba">https://github.com/OpenGVLab/VideoMamba</a></blockquote><blockquote>Model: <a href="https://huggingface.co/OpenGVLab/VideoMamba">https://huggingface.co/OpenGVLab/VideoMamba</a></blockquote><blockquote>Online Demo: <a href="https://huggingface.co/spaces/OpenGVLab/VideoMamba">https://huggingface.co/spaces/OpenGVLab/VideoMamba</a></blockquote><p><strong><em>Welcome to follow us on Twitter: </em></strong><a href="https://twitter.com/opengvlab"><strong><em>https://twitter.com/opengvlab</em></strong></a></p><h3>References</h3><p>[1] UniFormer: <a href="https://github.com/Sense-X/UniFormer">https://github.com/Sense-X/UniFormer</a></p><p>[2] Gemini: <a href="https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/">https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/</a></p><p>[3] Sora: <a href="https://openai.com/sora">https://openai.com/sora</a></p><p>[4] S4: <a href="https://github.com/state-spaces/s4">https://github.com/state-spaces/s4</a></p><p>[5] RWKV: <a href="https://www.rwkv.com/">https://www.rwkv.com/</a></p><p>[6] RetNet: <a href="https://github.com/microsoft/unilm/tree/master/retnet">https://github.com/microsoft/unilm/tree/master/retnet</a></p><p>[7] Mamba: <a href="https://github.com/state-spaces/mamba">https://github.com/state-spaces/mamba</a></p><p>[8] Vision Mamba: <a href="https://github.com/hustvl/Vim">https://github.com/hustvl/Vim</a></p><p>[9] VMamba: <a href="https://github.com/MzeroMiko/VMamba">https://github.com/MzeroMiko/VMamba</a></p><p>[10] ViT: <a href="https://github.com/google-research/vision_transformer">https://github.com/google-research/vision_transformer</a></p><p>[11] VideoMAE: <a href="https://github.com/MCG-NJU/VideoMAE">https://github.com/MCG-NJU/VideoMAE</a></p><p>[12] UMT: <a href="https://github.com/OpenGVLab/unmasked_teacher">https://github.com/OpenGVLab/unmasked_teacher</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3c0f4fce6ae2" width="1" height="1" alt=""><hr><p><a href="https://ai.gopubby.com/videomamba-state-space-model-for-efficient-video-understanding-3c0f4fce6ae2">VideoMamba: State Space Model for Efficient Video Understanding</a> was originally published in <a href="https://ai.gopubby.com">AI Advances</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The All-Seeing Project: Towards Panoptic Visual Recognization and General Relation Comprehension…]]></title>
            <link>https://ai.gopubby.com/the-all-seeing-project-towards-panoptic-visual-recognization-and-general-relation-comprehension-f76c2bde3e2c?source=rss-6247e25c46e0------2</link>
            <guid isPermaLink="false">https://medium.com/p/f76c2bde3e2c</guid>
            <category><![CDATA[computer-vision]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[computer-science]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[research]]></category>
            <dc:creator><![CDATA[OpenGVLab]]></dc:creator>
            <pubDate>Tue, 26 Mar 2024 12:56:23 GMT</pubDate>
            <atom:updated>2024-03-26T12:56:23.281Z</atom:updated>
            <content:encoded><![CDATA[<h3>The All-Seeing Project: Towards Panoptic Visual Recognization and General Relation Comprehension of the Open World</h3><p><strong><em>Welcome to follow us on Twitter: </em></strong><a href="https://twitter.com/opengvlab"><strong><em>https://twitter.com/opengvlab</em></strong></a></p><p>The study of artificial general intelligence (AGI) systems that can match human intelligence and excel in any task across domains represents the ultimate goal in the field of artificial intelligence. Benefiting from the advancements of Large Language Models (LLMs), Multi-modal Large Language Models (MLLMs) have demonstrated impressive capabilities in a variety of Vision-Language tasks, suggesting new avenues for achieving AGI. However, most popular MLLMs are limited to understanding the image as a whole instead of a specific region or instance in it.</p><p>As an effective method to improve interaction efficiency, the capabilities of grounding and referring (i.e., adopting bounding boxes in responses) have attracted increasing attention and have been widely integrated into current Grounded MLLMs. Such capabilities empower users and models to directly point to objects or image regions without the necessity of elaborate textual descriptions for reference and enable the model to provide visual responses (e.g., bounding boxes), supporting more vision-language tasks such as region captioning, referring expression comprehension, and referring question answering. However, without the support of large-scale instance-level visual understanding data, the generalization ability of these models is still limited. Besides, due to the lack of appropriate modeling methods and suitable training data for relation knowledge, these models also struggle to comprehend the inter-object relations within images accurately.</p><p>To address these challenges, we propose the All-Seeing project. From the data aspect, this project proposes (1) <strong>the All-Seeing Dataset (AS-1B)</strong>, which is constructed for pretraining and consists of over 1.2 billion region annotations in various formats, such as semantic tags, question-answering pairs, and detailed captions; (2) <strong>the All-Seeing Dataset v2 (AS-V2)</strong>, which is constructed for instruction tuning and consists of 127K high-quality samples, focusing on relation comprehension.</p><p>In terms of the model perspective, we propose (1) <strong>the All-Seeing Model (ASM)</strong>, a unified location-aware image-text foundation model supporting both image-text retrieval and generation tasks, demonstrating impressive zero-shot capability. (2) <strong>the All-Seeing Model v2 (ASMv2)</strong>, which unfiies the formulation of text generation, object localization, and relation comprehension, demonstrating powerful performance across various tasks, including Open-ended Scene Graph Generation and other general image-level and region-level vision-language tasks..</p><p>Notably, ASMv2 achieves SoTA performance on a total of 20 multimodal benchmarks, including 13 image-level benchmarks and 7 region-level benchmarks. In addition, benefiting from the ability to ground relations between objects, ASMv2 can also be applied to the Scene Graph Generation task, on which it demonstrates competitive performance.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2uXM0xsGam_AXiNwCMfOdA.png" /></figure><blockquote><strong>Paper</strong>：<a href="https://arxiv.org/abs/2308.01907">https://arxiv.org/abs/2308.01907</a></blockquote><blockquote><strong>Code</strong>：<a href="https://github.com/OpenGVLab/all-seeing/tree/main/all-seeing">https://github.com/OpenGVLab/all-seeing/tree/main/all-seeing</a></blockquote><blockquote><strong>Huggingface：</strong><a href="https://huggingface.co/collections/OpenGVLab/all-seeing-project-65e19865ffaa2b2de9161c8c">https://huggingface.co/collections/OpenGVLab/all-seeing-project-65e19865ffaa2b2de9161c8c</a></blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/970/1*7GORuDE47_dj01EbBndHbA.png" /></figure><blockquote><strong>Paper</strong>：<a href="https://arxiv.org/abs/2402.19474">https://arxiv.org/abs/2402.19474</a></blockquote><blockquote><strong>Code</strong>：<a href="https://github.com/OpenGVLab/all-seeing/tree/main/all-seeing-v2">https://github.com/OpenGVLab/all-seeing/tree/main/all-seeing-v2</a></blockquote><blockquote><strong>Huggingface：</strong><a href="https://huggingface.co/collections/OpenGVLab/all-seeing-project-65e19865ffaa2b2de9161c8c">https://huggingface.co/collections/OpenGVLab/all-seeing-project-65e19865ffaa2b2de9161c8c</a></blockquote><h3>Dataset</h3><h4>All-Seeing Dataset</h4><p>AS-1B is the first large-scale region-level vision-language dataset, consisting of over 1 billion region-text pairs, 3.5 million open-world concepts, and 100 billion tokens of region-related question-answering and caption. Each region is annotated with comprehensive information, including semantic tags, question-answer pairs, and captions. Compared with the previous visual recognition datasets like ImageNet and COCO, visual understanding datasets like Visual Genome and Laion-5B, the proposed AS-1B dataset stands out due to its rich and diverse instance-level annotation and corresponding detailed object concepts and descriptions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*F7CGJHiN326MqwMNXxG75g.png" /></figure><p>AS-1B is constructed based on a semi-automatic data engine, which greatly reduces the annotation cost.</p><p>Specifically, the process of the data engine begins by generating noisy pseudo data using well-trained off-the-shelf foundation models from diverse fields. Subsequently, a subset of these pseudo data are sampled to be verified and corrected by human annotators. After that, we pre-train our All-Seeing-Model (ASM) with the noisy pseudo data and finetune it with the human-verified data. Then we re-annotate the data with the aid of ASM. The process of annotation, verification, and fine-tuning are repeated to iteratively refine the annotation quality. By employing this ``data-human-model’’ cycle, we can generate a large number of region-level annotations with exceptional quality.</p><p>The fully verified subset of AS-1B, termed AS-Core, is released separately.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/894/1*6hnEseKGSZEDDofoAxF3GA.png" /></figure><h4>All-Seeing Dataset V2</h4><p>We propose a novel task, termed <strong>Relation Conversation (ReC)</strong>, which unifies the formulation of text generation, object localization, and relation comprehension. Based on the unified formulation, we construct the AS-V2 dataset, which consists of 127K high-quality relation conversation samples, to unlock the ReC capability for Multi-modal Large Language Models (MLLMs).</p><p>As depicted in the Figure below, ReC requires the model to generate the text response while linking all mentioned objects as well as the subjects and objects of each predicate in the response to the corresponding regions in the image simultaneously.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GtmiJP-zYnXUeFomKlawzQ.png" /></figure><p>We utilize GPT-4V to construct AS-V2 based on COCO images and their caption, location, and relation annotations. The key idea is to query GPT-4V to generate responses while linking the objects and predicates mentioned in the generated response to specific regions within the image, referring to the given location annotations and relation annotations.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/828/1*RkuVss3tOLhxQyRL_e7qWw.png" /></figure><h4>Circular-based Relation Probing Evaluation</h4><p>To evaluate the relation comprehension ability of existing MLLMs, we construct a benchmark called Circular-based Relation Probing Evaluation (CRPE), which is the first benchmark that covers all elements of the relation triplets (subject, predicate, object), providing a systematic platform for the evaluation of relation comprehension ability.</p><p>CRPE is formulated as single-choice questions and consists of four splits: Existence, Subject, Predicate, and Object. The Existence split evaluates the object recognition ability while the remaining splits are designed to evaluate the relation comprehension capability.</p><p>Additionally, to evaluate the dependency on language priors, we include abnormal data in CRPE, which depict relations that are rare but reasonable in the real world.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/835/1*RLmGQDkC01e7oobSKDIxKg.png" /></figure><h3>Model</h3><h4>All-Seeing Model</h4><p>Our All-Seeing Model (ASM) is a unified location-aware image-text foundation model, which supports both generative tasks (e.g., Image Captioning and Region Captioning) and discriminative tasks (e.g., Region Recognition) and demonstrates powerful performance on these tasks.</p><p>This model comprises three key designs:</p><p>(1) <strong>a location-aware image tokenizer</strong> extracting features from both the image and region levels based on the input image and bounding box, respectively.</p><p>(2) <strong>a trainable task prompt</strong> that is incorporated at the beginning of the vision and text tokens to guide the model in distinguishing between discriminative and generative tasks.</p><p>(3) <strong>a LLM-based decoder</strong> that is utilized to extract vision and text features for discriminative tasks, as well as to auto-regressively generate response tokens for generative tasks.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/873/1*P0Nkib5Y8PhM23zsC84INw.png" /></figure><p>In the case of the discriminative task, a trainable align token is appended to the input sequence to gather the overall representation, and its embedding is then used in the matching process.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/873/1*Zst-ELHD4GXqCFNbA-npdA.png" /></figure><h4>All-Seeing Model v2</h4><p>All-Seeing Model v2 (ASMv2) is a powerful Multi-modal Large Language Model (MLLM), which integrates the Relation Conversation (ReC) ability while maintaining powerful general capabilities. This model can deal with three types of tasks, including (1) <strong>Relation Conversation</strong>, which requires the model to link all objects and predicates mentioned in the response to the corresponding regions in the image; (2) <strong>Open-ended Scene Graph Generation</strong>, which requires the model to generate a scene graph based on the given image in an open-ended manner; (3) <strong>General Multimodal Conversation</strong>, which requires the model to generate a text response given the image and user query. Compared to ASM, ASMv2 places more emphasis on the recognization and comprehension of objects and the relations between them within images during multimodal dialogue processes.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/656/1*ZjelmP6Th2pXLbGYk4K1wQ.png" /></figure><p>ASMv2 follows the model architecture of LLaVA, comprising <strong>a vision encoder</strong> (CLIP-L-336), <strong>a vision-language connector</strong> (2-layer MLP), and <strong>a language model</strong> (Vicuna-13B). Unlike ASM, ASMv2 directly understands regions in images based on the bounding boxes provided in the text, eliminating the need for additional RoI Align operations to assist in understanding region information.</p><p>Specifically, our relation conversation marks the object and predicate in the sentence using &lt;ref&gt;&lt;/ref&gt; and &lt;pred&gt;&lt;/pred&gt;, respectively. Each marked object is followed by a bounding box, indicating its localization. Similarly, each predicate is followed by two bounding boxes, which specifically refer to the subjects and objects of the predicate. All bounding boxes are normalized to integer values within the range [0, 1000) and formatted as &lt;box&gt;[[x1, y1, x2, y2]]&lt;/box&gt;. Notably, models trained on ReC can be naturally adapted to the Scene Graph Generation task. The grounded objects serve as the nodes in the scene graph while the grounded predicates serve as the edges.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UMXJL2ioSqUhK67cGgv_kA.png" /></figure><h3>Experiment</h3><p>The experimental results show that ASMv2 achieves state-of-the-art performance on a total of 20 multimodal benchmarks, including 13 image-level benchmarks and 7 region-level benchmarks. Specifically, ASMv2 outperforms LLaVA-1.5 by 6.7 points on MMBench and improves by 89.7 points on MME.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/701/1*gu-Z4L4e74LNWOu1wUWX8g.png" /></figure><p>On region-level benchmarks, ASMv2 demonstrates an average improvement of approximately 1.7 points over Qwen-VL in the Referring Expression Comprehension task. Additionally, ASMv2 achieves state-of-the-art performance on Region Captioning and Referring Question Answering tasks.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/703/1*ftNB2AhnlZfsk7dfeThfsw.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*42W0sw8Z8lCwJepMhQjsXA.png" /></figure><p>ASMv2 also exhibits outstanding performance in the Scene Graph Generation task. Notably, compared to traditional scene graph generation models, ASMv2 not only generates scene graphs in an open-ended manner, thereby having the potential to generalize to previously unseen predicate labels, but also maintains the general multimodal capabilities of MLLMs, thus broadening the applicability in real-world scenarios.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/347/1*m9L2GPwN61tc5PpBFPUhIg.png" /></figure><p>On the CRPE benchmark, ASMv2 also demonstrates state-of-the-art performance. For example, ASMv2 achieves an overall accuracy of 52.04, which is significantly higher than 43.14 of LLaVA-1.5 and 27.27 of Qwen-VL. These results demonstrate that ASMv2 can comprehend the relations between the objects within the image better, benefiting from the training of relation conversation data.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/456/1*l3W7y3YFLdGYR2Ra7svEMg.png" /></figure><blockquote><strong>Paper</strong>：<a href="https://arxiv.org/abs/2308.01907">https://arxiv.org/abs/2308.01907</a></blockquote><blockquote><strong>Code</strong>：<a href="https://github.com/OpenGVLab/all-seeing/tree/main/all-seeing">https://github.com/OpenGVLab/all-seeing</a></blockquote><p><strong><em>Welcome to follow us on Twitter! </em></strong><a href="https://twitter.com/opengvlab"><strong><em>https://twitter.com/opengvlab</em></strong></a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f76c2bde3e2c" width="1" height="1" alt=""><hr><p><a href="https://ai.gopubby.com/the-all-seeing-project-towards-panoptic-visual-recognization-and-general-relation-comprehension-f76c2bde3e2c">The All-Seeing Project: Towards Panoptic Visual Recognization and General Relation Comprehension…</a> was originally published in <a href="https://ai.gopubby.com">AI Advances</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[OmniQuant: Calibrated Quantization for LLMs, Has been Integrated with commercial APP]]></title>
            <link>https://medium.com/@opengvlab/omniquant-calibrated-quantization-for-llms-has-been-integrated-with-commercial-app-83019d10c465?source=rss-6247e25c46e0------2</link>
            <guid isPermaLink="false">https://medium.com/p/83019d10c465</guid>
            <category><![CDATA[model-quantization]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[100-followers]]></category>
            <category><![CDATA[large-language-models]]></category>
            <category><![CDATA[research]]></category>
            <dc:creator><![CDATA[OpenGVLab]]></dc:creator>
            <pubDate>Tue, 12 Mar 2024 07:59:14 GMT</pubDate>
            <atom:updated>2024-03-12T07:59:14.261Z</atom:updated>
            <cc:license>http://creativecommons.org/licenses/by/4.0/</cc:license>
            <content:encoded><![CDATA[<p>Model quantization is a key technology in large language model compression and acceleration, which quantizes model weights and activations to low-bit representation, allowing the model to occupy less memory and accelerate inference speed. Model quantization is even more critical for large language models with a vast number of parameters. For example, the GPT-3 model with 175B parameters requires 350GB of memory when loaded in FP16 format, necessitating at least five 80GB A100 GPUs. However, if the GPT-3 model weights could be quantized to 3-bit, it would be possible to load all model weights on a single A100–80GB GPU.</p><p>Existing post-training quantization algorithms for large language models rely on manually set quantization parameters and lack a gradient optimization process, leading to significant performance degradation when facing low-bit quantization. Although quantization-aware training is effective in determining the best quantization parameters, it requires a large amount of additional training overhead and data. The computational burden of large language models themselves further hinders the application of quantization-aware training in the quantization of large language models. This raises the question: Can we achieve the performance of quantization-aware training while maintaining the training time and data efficiency of post-training quantization?</p><p>To address the optimization problem of quantization parameters in post-training quantization of large language models, researchers from Shanghai AI Lab, The University of Hong Kong, and The Chinese University of Hong Kong have proposed OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models. This algorithm simultaneously supports the quantization of weights and activations in large language models and covers various quantization bit settings.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/875/1*djoJrzK4t0RUT74L7zEHFQ.png" /></figure><blockquote>arXiv Paper：<a href="https://arxiv.org/abs/2308.13137">https://arxiv.org/abs/2308.13137</a></blockquote><blockquote>OpenReview Paper：<a href="https://openreview.net/forum?id=8Wuvhh0LYW">https://openreview.net/forum?id=8Wuvhh0LYW</a></blockquote><blockquote>Code：<a href="https://github.com/OpenGVLab/OmniQuant">https://github.com/OpenGVLab/OmniQuant</a></blockquote><h3>Overall Framework</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/875/1*iuAAPvQGJ9VhlzmmFZSzLg.png" /></figure><p>As shown in the figure above, OmniQuant is a differentiable quantization technology for large language models (LLM) that supports both weight and activation quantization. It achieves high-performance quantized models while maintaining the efficiency of training time and data in post-training quantization. For example, OmniQuant can complete the update of quantization parameters for LLaMA-7B to LLaMA70B models on a single A100–40GB GPU within 1–16 hours. To achieve this goal, OmniQuant adopts a Block-wise quantization error minimization framework. At the same time, OmniQuant designs two novel strategies to increase learnable quantization parameters, including learnable weight clipping (LWC) to reduce the difficulty of quantizing weights, and a learnable equivalent transformation (LET) to further shift the challenge of quantization from activations to weights. Additionally, all learnable parameters introduced by OmniQuant can be fused and eliminated after quantization, allowing the quantized model to be deployed on multiple platforms, including GPUs, Android, iOS, etc., based on existing tools.</p><h3>Block-wise Quantization Error Minimization</h3><p>OmniQuant proposes a new optimization process that uses Block-wise quantization error minimization and optimizes additional quantization parameters in a differentiable manner. The optimization objective is formulated as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/899/1*rJj0A8Iap9VVvUCGQK-aKA.png" /></figure><p>where <strong><em>F</em></strong> represents the mapping function of a transformer block in LLM, <strong>W</strong> and <strong>X</strong> are the full-precision weights and activations, respectively, <strong>QW(·)</strong> and <strong>Qa(·)</strong> represent the weight and activation quantizers, respectively, and $\Theta_1$ and $\Theta_2$ are the quantization parameters in learnable weight clipping (LWC) and learnable equivalent transformation (LET). OmniQuant quantizes the parameters in a transformer block sequentially and then moves to the next.</p><h3>Learnable Weight Clipping (LWC)</h3><p>Equivalent transformation migrates the scale between model weights and activations. The learnable equivalent transformation used in OmniQuant makes the model weight distribution change continuously during the parameter optimization process. Previous methods that directly learn weight clipping thresholds [1,2] are only suitable for situations where the weight distribution does not change drastically, otherwise, they will struggle to converge. Based on this issue, unlike previous methods that directly learn weight clipping thresholds, LWC optimizes the clipping strength in the following way:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jTcXPZQgDcwn9cHafX-erg.png" /></figure><p>where [·] denotes the floor operation. <strong><em>N</em></strong> is the target bit-width. <strong><em>Wq</em></strong> and <strong>W</strong> represent the quantized and full-precision weights, respectively. <strong><em>h</em></strong> is the normalization factor for weights, and <strong><em>z</em></strong> is the zero-point value. The clipping (clamp) operation limits the quantized values within the range of <strong><em>N</em></strong>-bit integers, i.e., <strong><em>[N-1]</em></strong>. In the above formula, $\gamma\in[0,1]$ and $\beta\in[0,1]$ are the learnable clipping strengths for the upper and lower bounds of weights, respectively. Thus, in the optimization objective function,</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/356/1*peSY31AOAKPOyzzSl8cBVw.png" /></figure><h3>Learnable Equivalent Transformation (LET)</h3><p>In addition to LWC, which optimizes weights for quantization through the clipping threshold, OmniQuant further reduces the difficulty of activating quantization through LET. Considering that outliers in LLM activations exist in specific channels, previous methods such as SmoothQuant[3], Outlier Suppression+[4] shift the difficulty of quantization from activations to weights through mathematical equivalent transformations. However, manually chosen or greedily searched equivalent transformation parameters limit the performance of quantized models. Thanks to the introduction of Block-wise quantization error minimization, OmniQuant’s LET can determine the optimal equivalent transformation parameters in a differentiable manner. Inspired by Outlier Suppression+~\citep{outlier-plus}, it adopts channel-level scaling and channel-level shifting to manipulate the activation distribution, providing an effective solution for the outlier problem in activations. Specifically, OmniQuant explores equivalent transformations in linear layers and attention operations.</p><h4><strong><em>Equivalent Transformation in Linear Layers:</em></strong></h4><p>The linear layer accepts an input token sequence$\mathbf{X} \in \mathbb{R}^{T\times C_{in}}$, where <strong><em>T</em></strong> is the token length, and computes the product of the weight matrix $\mathbf{W} \in \mathbb{R}^{C_{in} \times C_{out}}$and the bias vector $\mathbf{B}\in\mathbb{R}^{1 \times C_{out}}$ . The mathematically equivalent linear layer expression is:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/836/1*GH76L48avWaWlEfuOKy5YA.png" /></figure><p>Here, <strong>Y</strong> represents the output, $\mathbf{s} \in \mathbb{R} ^{1\times C_{in}}$and $\mathbf{\delta} \in \mathbb{R}^{1\times C_{in}}$ are the channel-level scaling and shifting parameters, respectively, and <strong>X W B</strong> are the equivalent activations, weights, and biases, respectively, with $\oslash$ and <strong>⊙</strong> representing element-wise division and multiplication. Through this equivalent transformation, activations are converted into a form more amenable to quantization, at the cost of increasing the difficulty of weight quantization. In this sense, LWC can improve the model quantization performance achieved by LET because it makes weights more quantizable. Finally, OmniQuant quantizes the transformed activations and weights as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/426/1*PxuoZenVThPAZaaEkr8x-Q.png" /></figure><p>where <strong><em>Qa</em></strong> is a standard MinMax quantizer, and <strong><em>Qw</em></strong> is a MinMax quantizer with learnable weight clipping (i.e., the proposed LWC).</p><h4><strong>Equivalent Transformation in Attention Operations:</strong></h4><p>In addition to linear layers, attention operations also account for a significant portion of LLM computations. Moreover, the autoregressive inference mode of LLMs requires storing key-value (KV) caches for each token, leading to substantial memory demands for long sequences. Therefore, OmniQuant also considers quantizing the <strong>Q/K/V</strong> matrices in self-attention computations to low bits. Specifically, the learnable equivalent transformation in the self-attention matrix can be written as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*MgJ_VkXfPsv_0IPrhiZrkQ.png" /></figure><p>where $s_a\in \mathbb{R}^{1\times C_{out}}$is the scaling factor. The quantized computation in the self-attention is expressed as <strong>P</strong>=(<em>Qa</em>(<strong>Q</strong>)<em>Qa</em>(<strong>KT</strong>)). OmniQuant also uses the MinMax quantization scheme as <strong><em>Qa </em></strong>to quantize the <strong><em>Q K </em></strong>matrices. Thus, the final optimization objective function includes $\Theta_2={\delta,s,s_a}$.</p><p>## Pseudocode</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hChBi6lrtndpPxT5ulFWEA.png" /><figcaption>The pseudo-algorithm for OmniQuant is shown in the figure. Note that the additional parameters introduced by LWC and LET can be eliminated after model quantization, meaning that OmniQuant does not introduce any extra overhead to the quantized model.</figcaption></figure><h3>Experiments</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*c3tdkc2HTmrheYflvb1NqA.png" /><figcaption>The figure above shows the experimental results of OmniQuant on the LLaMA model with weight-only quantization. More results for the OPT model can be found in the original paper.</figcaption></figure><p>It can be seen that OmniQuant consistently outperforms previous LLM weight-only quantization methods across various LLM models (OPT, LLaMA-1, LLaMA-2) and diverse quantization configurations (including W2A16, W2A16g128, W2A16g64, W3A16, W3A16g128, W4A16, and W4A16g128). These experiments demonstrate the versatility of OmniQuant, which can adapt to various quantization configurations. For example, although AWQ [5] is particularly effective for group-wise quantization, OmniQuant shows superior performance in both channel-wise and group-wise quantization. Moreover, as the number of quantization bits decreases, the performance advantage of OmniQuant becomes more pronounced.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1021/1*PZeIluvkOaauK1rWG3dQaw.png" /></figure><p>In the settings of weight-activation quantziation, the experiments mainly focus on W6A6 and W4A4 quantization. The W8A8 quantization is excluded from the experimental setup because the previous SmoothQuant almost achieves lossless W8A8 model quantization compared to the full-precision model. The figure above shows the experimental results of OmniQuant on the LLaMA model with both weights and activations quantized. Notably, in the W4A4 quantization for different models, OmniQuant significantly improves the average accuracy, with an increase ranging from +4.99% to +11.80%. Especially in the LLaMA-7B model, OmniQuant even surpasses the recent quantization-aware training method LLM-QAT [6] with a significant gap of +6.22%. This improvement proves the effectiveness of introducing additional learnable parameters, which is more beneficial than the global weight adjustments used in quantization-aware training.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*gMvnW0eno6NKiI3pjTrxyg.png" /></figure><p>Furthermore, models quantized using OmniQuant can be seamlessly deployed on MLC-LLM [7]. The figure above shows the memory requirements and inference speeds of the LLaMA series quantized models on the NVIDIA A100–80G. Weights Memory (WM) represents the storage of quantized weights, while Running Memory (RM) indicates the memory during inference, which is higher due to retaining some activations. Inference speed is measured by generating 512 tokens. It is evident that compared to the 16-bit full-precision model, the quantized models significantly reduce memory usage. Moreover, W4A16g128 and W2A16g128 quantization almost double the inference speed.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZNfNwZ_kaFHFkz4OscTarQ.png" /></figure><p>It is worth noting that MLC-LLM [7] also supports the deployment of OmniQuant-quantized models on other platforms, including Android and iOS phones. As shown in the figure, the recent application <strong>Private LLM</strong> utilizes the OmniQuant algorithm to achieve memory-efficient deployment of LLMs on iPhones, iPads, macOS, and other platforms.</p><h3>Conclusion</h3><p>OmniQuant is an advanced algorithm for large language model quantization that pushes quantization down to low-bit formats. The core principle of OmniQuant is to retain the original full-precision weights while adding learnable quantization parameters. It optimizes the quantization compatibility of weights and activations using learnable weight clipping and equivalent transformations. By fusing gradient updates, OmniQuant maintains training time efficiency and data efficiency comparable to existing PTQ methods. Additionally, OmniQuant ensures hardware compatibility, as its added trainable parameters can be fused into the original model without any extra overhead.</p><h3>Reference</h3><p>[1] Pact: Parameterized clipping activation for quantized neural networks.</p><p>[2] LSQ: Learned step size quantization.</p><p>[3] Smoothquant: Accurate and efficient post-training quantization for large language models.</p><p>[4] Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling.</p><p>[5] Awq: Activation-aware weight quantization for llm compression and acceleration.</p><p>[6] Llm-qat: Data-free quantization aware training for large language models.</p><p>[7]MLC-LLM：<a href="https://github.com/mlc-ai/mlc-llm">https://github.com/mlc-ai/mlc-llm</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=83019d10c465" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>