Stories by OpenMMLab on Medium

InternLM3 Open Source: Achieving High-Performance Models with 4T Data

OpenMMLab — Thu, 16 Jan 2025 10:33:58 GMT

Written by InternLM Team

On January 15, Shanghai AI Lab announced a major upgrade to the InternLM model with the release of InternLM3. By refining its data framework, InternLM3 significantly improved data efficiency and achieved a leap in IQPT (Intelligence Quality per Token).

The InternLM3–8B-Instruct, trained with only 4T of data, outperformed other open-source models of the similar scale while cutting training costs by over 75%. Additionally, for the first time, IntenLM3 integrates routine conversational capabilities with deep thinking in a general-purpose model, enabling it to handle a wider range of real-world scenarios.

Demo page: https://internlm-chat.intern-ai.org.cn

HuggingFace: https://huggingface.co/internlm/internlm3-8b-instruct

GitHub: https://github.com/InternLM/InternLM

High IQPT drives high-performance reasoning

Data is a key driver for enhancing large model capabilities. Currently, most popular open-source models rely on expanding the scale of pretraining data to improve performance, with datasets typically approaching 20T tokens. This approach, however, leads to a linear increase in training costs and raises industry-wide concerns about data bottlenecks and the sustainability of the Scaling Law.

Our research team believes that improving data quality offers far greater benefits than merely increasing data volume.The data IQPT (Intelligence Quality per Token) is the core of data quality, i.e., the logic, complexity, and inspiration embedded in the thinking process of data. To this end, we proposes a large-scale data streamlined framework, which substantially improves the quality of training data.

In practice, InternLM3 achieved the performance of popular open-source models trained on 18T tokens by using only 4T tokens of pretraining data. Enhancing the performance of models by improving the data IQPT brings a new research paradigm for breaking the Scaling Law.

To better evaluate the impact of data IQPT, we quantified the metric by defining IQPT as the ratio of a model’s average performance to the amount of training data. This provides a measure of the “return on investment” for large model training data. Compared to leading open-source models of similar scale, using Llama3.1 as a benchmark, the data IQPT of InternLM3 is more than 4 times higher.

Through the data streamlined framework, we significantly improved data efficiency for InternLM3, achieving a substantial increase in IQPT. This framework consists of two key components:

Intelligent Data Processing: To enable fine-grained data handling, we divided the data into millions of domains. Using self-evolution techniques for intelligent agents, large-scale automated quality checks were implemented. Based on misclassifications, reflections were made, and customized processing was applied to each domain.
Synthesis of High-Value Data: By integrating general and specialized models, we rapidly iterated synthesis algorithms, then selected data to train specialized models. Through material mining in vast amounts of natural data, improved tree-search strategies, and multi-dimensional quality validation, we synthesized a large volume of rich, high-quality data.

Using the OpenCompass open-source evaluation framework, we conducted evaluations on InternLM3 and other models with a reproducible, unified method. The evaluation utilized over ten authoritative benchmark sets, including CMMLU, GPQA, and others, covering various performance dimensions such as reasoning, mathematics, programming, instruction-following, long texts, dialogue, and overall performance. The results show that InternLM3 outperforms similar open-source models on most benchmarks, with overall performance closely matching GPT-4o-mini.

Fusion of Deep Thinking and General Conversation

Exploring general artificial intelligence through the “general-specialized integration” approach relies on synchronously enhancing deep reasoning and domain-generalization capabilities. With the release of InternLM3 for the first time, deep reasoning and general conversational abilities have been integrated into a single general-purpose model, enabling it to handle a broader range of real-world scenarios.

Due to the significant differences in data styles between deep reasoning and general conversation, the industry often develops specialized models for reasoning tasks. Previously, the Shanghai AI Lab introduced InternThinker, a high-performing reasoning model capable of long-form reasoning, self-reflection, and correction during inference, outperforming o1-preview on mathematical competition benchmarks.

Following the “general-specialized integration” approach, we explored methods for training on a fusion of different data types. This enables to combine general conversational and deep reasoning abilities seamlessly. By leveraging system prompts, users can switch the model between conversational and reasoning modes with a single command, endowing the general-purpose model with deep thinking capabilities.

In the post-training phase, we developed task-driven and knowledge-system-driven synthetic data strategies. This includes instruction annotation and synthesis based on the World Knowledge Tree and high-quality response generation using multi-agent techniques. By maximizing the potential of real and synthetic user instructions, we categorized multi-task scenarios with fine granularity, creating hundreds of thousands of high-quality fine-tuning instruction datasets, which significantly improved the model’s conversational experience.

As shown in the diagram below, during inference tasks, users can switch InternLM3 from general conversation mode to deep reasoning mode with a single click.

Announcing XTuner: An Efficient Finetune Toolkit for LLM

OpenMMLab — Fri, 01 Sep 2023 11:01:11 GMT

The advent of ChatGPT has provided a glimpse into the dawn of general artificial intelligence. Meanwhile, tech companies are making their own large language models open source one after another.

However, the high hardware costs associated with large models often pose a significant barrier to many researchers.

To democratize access to these powerful models and empower various industries, Shanghai Artificial Intelligence Laboratory has developed XTuner, a low-cost toolkit for training large models. With XTuner, 8GB GPU memory is all you need to create your own AI assistant.

https://github.com/InternLM/xtuner

XTuner offers a wealth of features that can be freely composed and matched. In addition to standalone functions, XTuner introduces three breakthrough technologies allowing developers to truly work “data-centric”:

Efficient Data Engine: XTuner supports several popular open-source dataset formats and allows for mixed usage:

Alpaca format

MOSS format

Gunacao format

OpenAI format

…and more being added continuously

pip install xtuner

# To train data in a mixture of Alpaca and Gunacao datasets
xtuner train internlm_7b_qlora_alpaca_enzh_oasst1_e3

Also, XTuner offers full decoupling for different dataset formats based on the characteristics of LLM-Training data. This allows fine-tuning without disrupting the dialogue templates of Chat models.

2. Multiple Training Engines: XTuner pioneers the integration of HuggingFace and OpenMMLab, balancing usability and configurability. It supports both MMEngine Runner and HuggingFace Trainer. Thus, developers with deep customization needs can flexibly configure according to their habits.

pip install xtuner

# To use MMEngine Runner for training
xtuner train internlm_7b_qlora_oasst1_e3

# To use HuggingFace Trainer for training
xtuner train internlm_7b_qlora_oasst1_e3_hf

3. One-click Training Start: XTuner integrates standard procedures for incremental pre-training, single-round & multi-round dialogue instruction tuning. Developers only need to focus on the data. Furthermore, XTuner incorporates technologies like QLoRA, DeepSpeed, and FSDP, offering solutions for varying model sizes under different hardware specifications. Just 8GB GPU memory can fine-tune a 7B model.

pip install 'xtuner[deepspeed]'

# Fine-tuning a Llama2-7B model with only 8GB GPU memory
xtuner train llama2_7b_qlora_oasst1_512_e3 --deepspeed deepspeed_zero2

Developers can focus on the data and leave the rest to XTuner, freeing up more energy to explore the vast universe of large models!

With XTuner, developers can add plugins to large models to supplement their capabilities, or even gain some skills exclusive to ChatGPT. XTuner provides a rich selection of large model plugins on the HuggingFace Hub. Feel free to download and try them out!

https://huggingface.co/xtuner

Inspired by Pac-Man, the XTuner logo reflects a playful spirit. The development team and the open-source community aim to have fun exploring large models and developing a variety of entertaining applications. Your ideas are always welcome in the XTuner discussions, where you can find like-minded partners. The XTuner team will continue to replicate popular community projects at a low cost, inviting more people to ride the wave of large models.

https://github.com/InternLM/xtuner/discussions

The AI New Era: How Should Large Models “Rack Their Brains”

OpenMMLab — Thu, 24 Aug 2023 11:01:37 GMT

Welcome to follow the official OpenMMLab Twitter account to stay updated with our latest news.

As ChatGPT is launched, Large Language Models (LLMs) have been garnering significant attention in the AI field. However, despite their impressive capabilities, LLMs still face substantial challenges in handling complex multi-step reasoning tasks, such as mathematical applications and common-sense reasoning. This has led to the inclusion of more complex reasoning datasets like GSM8k and MATH in the evaluations of large models.

To address the shortcomings of LLMs in complex reasoning, researchers have been actively developing innovative techniques. Among these attempts, the “Chain-of-Thought Prompting” technique has gained special attention. This approach aims to guide the model in breaking down complex multi-step problems into more manageable intermediate steps, aiding the model in better understanding and solving problems accurately. Practical results have shown significant advancements in various reasoning tasks, especially arithmetic reasoning, through the application of Chain-of-Thought Prompting.

What is Chain of Thought？

Imagine a scenario where a teacher presents a challenging thinking and reasoning problem to a student named Tom: "In a cage on a farm, there are a total of chickens and rabbits, totaling 36 animals. The total number of legs is 100. Determine how many chickens and rabbits there are." Suppose Tom doesn't have paper and pen and must provide an answer directly. He attempts: "There are 20 chickens and 16 rabbits." ❌ , Unfortunately, there's an error in his mental calculation.

The teacher gives Tom a second chance, allowing him to solve the problem step by step using paper and pen. Tom records the intermediate steps:

Let x be the number of chickens, and y be the number of rabbits.
x + y = 36 (total number of animals is 36)
2x + 4y = 100 (total number of legs is 100)
Solve the first equation for one variable, for example, x = 36 - y.
Substitute the value of x into the second equation:
2(36 - y) + 4y = 100
72 - 2y + 4y = 100
2y = 28
y = 14
Substitute the value of y back into the first equation to find x:
x = 36 - 14
x = 22
Hence, the answer is:
There are 22 chickens and 14 rabbits. ✅

Through step-by-step reasoning, Tom successfully arrives at the correct answer and earns the teacher’s approval :>

Similarly, when using large language models, you can prompt them step by step, just like the teacher helped Tom, guiding the model to solve complex problems. This is the essence of the earliest Chain of Thought : few-shot Chain of Thought. It involves prompting the model with a few-shot example that includes intermediate reasoning steps before the answer. For example:

Few-Shot CoT：Unveiling the Chain of Thought

Chain of Thought Prompting Elicits Reasoning in Large Language Models.

The early version of Chain of Thought is suitable for Few-Shot prompts. Compared to standard Few-Shot prompts, Chain of Thought Few-Shot prompts only add reasoning steps before the answer. For instance, an original prompt might look like:

Sample problem + Answer + Actual problem

The input to the model directly yields the problem’s answer. With the addition of Chain of Thought to the prompt, it becomes:

Sample problem + Sample reasoning steps + Answer + Actual problem

The sample reasoning steps are the Chain of Thought used to solve the original problem. This guides the model to generate intermediate steps before outputting the answer, breaking down the original problem into multiple sub-problems and aiding the model’s “thinking”. Not only does this method significantly enhance the model’s reasoning ability without requiring model modifications, but it also yields immediate effects. On the PaLM-540B dataset, there was an almost threefold improvement using Chain of Thought compared to the traditional method of boosting model performance through fine-tuning. Chain of Thought can be seen as opening new doors for enhancing large model reasoning.

So, are there any simpler ways to implement the Chain of Thought, and is there a Zero-Shot version of the Chain of Thought implementation? Indeed, there is.

Zero-Shot CoT：Simple yet Effective

Large Language Models Are Zero-Shot Reasoners

Perhaps the simplest CoT method:

You can easily implement a Zero-Shot CoT prompt by simply adding “Let’s think step by step” after the question. This effortlessly implements a zero-shot Chain of Thought prompt without requiring additional samples. It clearly instructs the model to think through the problem step by step, enhancing its problem-solving capabilities:

This is a simple and effective way to enhance the model’s reasoning abilities, akin to how the teacher helped Tom by breaking down and solving the problem step by step. When using large models, this approach serves as a tool to prompt the model to decompose and answer complex questions.

For the same MultiArith dataset, the paper experimented with various similar prompts, showing varying effects.

Thus, in practice, when employing the Zero-Shot CoT approach for reasoning tasks, it’s essential to experiment with various prompts tailored to the dataset’s characteristics.

Self-Consistency (SC): Multi-Path Reasoning + Voting Mechanism

Self-Consistency Improves Chain of Thought Reasoning in Language Models. The SC method, developed by the Google Brain team, enhances reasoning accuracy by generating multiple different reasoning paths for a given problem and voting for the most frequent answer among these paths.

In the SC method, for a single problem, multiple CoT results are generated. This is equivalent to having the model generate reasoning steps and answers multiple times. The final answer is obtained through majority voting. For example, if k = 3, the generated paths and answers could be 18, 18, and 26. Taking the majority yields 18.

This method excels in complex reasoning tasks; however, compared to the regular CoT approach, it requires more time and resources. Does this mean that the more samples are taken, the better the effect?

According to experimental results from the paper, the performance improvement of the SC method starts to plateau when the sampling count “k” ranges from 20 to 40 in various reasoning datasets. In most cases, the datasets tend to saturate at around 40 samples. However, conducting 40 samples requires significant resource consumption. Therefore, when using the SC method, it’s important to choose an appropriate sampling count based on the specific needs and available resources. This allows for a balance between effect enhancement and resource utilization.

Tree-of-Thoughts (ToT)：Multi-dimensional Thinking for Comprehensive Problem Solving

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

The ToT method differs from traditional CoT approaches. It allows models to consider multiple different reasoning paths concurrently. It involves evaluating multiple segments of reasoning processes and making global choices through foresight or backtracking when necessary. This results in the formation of a tree-like reasoning structure, as depicted on the right side of the following diagram:

Specifically, it comprises the following four stages:

Thought Decomposition Based on the nature of the problem, the problem is broken down into multiple intermediate steps. Each step could be a phrase, equation, or writing plan, depending on the problem’s characteristics.
Thought Generation Assuming that solving the problem requires “k” steps, there are two methods to generate the reasoning content for each step:

Independent Sampling: For each state, the model independently extracts “k” reasoning contents from the CoT prompts, without relying on other reasoning contents.

Sequential Generation: Sequentially using prompts to guide the generation of reasoning content step by step, where each reasoning content might depend on the previous one.

3. Heuristic Evaluation Heuristic methods are used to evaluate the contribution of each generated reasoning content to problem solving. This self-assessment is based on the language model’s self-feedback, achieved by designing prompts that allow the model to score multiple generated results.

4. Search Algorithm Based on the methods for generating and evaluating reasoning content, an appropriate search algorithm is chosen. For example, breadth-first search (BFS) or depth-first search (DFS) can be used to systematically explore the tree of thoughts, including foresight and backtracking.

Using the example of the 24-Game, where the task is to determine if four given integer values can be combined using +, -, ×, and ÷ operations to yield a result of 24, the ToT method can be applied as follows:

It can be divided into three steps, with each step representing an intermediate equation. For example, given the numbers 4 9 10 13, the following three steps solve the problem:

13 - 9 = 4 (left: 4 4 10);

10 - 4 = 6 (left: 4 6);

4 * 6 = 24 (left: 24)

2. In each step, the process of generating multiple candidates using a few-shot prompt method (as shown in (a) below) is used.

3. For each candidate in each step, the process described in (b) is employed to evaluate the candidates using the model. This involves evaluating whether the remaining numbers 10, 13, and 13 can yield a possible result of 24 pointsImpossible. The next step begins with candidates that have higher scores.

4. Steps 2 and 3 are carried out for each step to generate intermediate equations and perform evaluations.

5. A breadth-first search (BFS) is conducted to sample feasible solution paths (shown in green).

Regarding the effectiveness of ToT, taking the 24-Game as an example, when using the GPT-4 model as the base model, the performance of ToT is significantly superior to general CoT methods. In this scenario, SC and Few-Shot CoT achieve less than 10% accuracy on the task, while ToT achieves an accuracy of 74%:

However, in terms of task usability, employing the ToT method requires familiarity with the task and the ability to break it down into logical and manageable steps. Furthermore, it necessitates the design of corresponding generation and evaluation methods for each step of the task. Finally, DFS or BFS techniques are used for sampling solutions. Additionally, it’s crucial to have a strong grasp of following prompt instructions in the base model, much like the GPT-4 model used by the authors of the paper. If you can meet these requirements, ToT could serve as a powerful tool for solving complex problems.

Make your model even stronger

OpenCompass is a comprehensive evaluation platform for large models launched by the Shanghai Artificial Intelligence Laboratory. OpenCompass currently supports a range of CoT (Chain-of-Thought) techniques, including those mentioned earlier, from Zero-Shot CoT to Tree-of-Thoughts.

Leveraging OpenCompass’ extensive evaluation capabilities, you can effortlessly conduct diverse CoT evaluations on over 300,000 questions from 50+ datasets for more than 20 open-source large models and OpenAI API models. Below is an example of testing SC method on the GSM8k dataset using OpenCompass:

# Configuration for SC version of gsm8k test can be found in:
# opencompass.configs.datasets.gsm8k.gsm8k_gen_a3e34a.py.

gsm8k_infer_cfg = dict(
    inferencer=dict(
        type=SCInferencer, # Replace GenInferencer with SCInferencer
        # Set generation parameters to ensure diverse model outputs, currently applicable for models loaded from HuggingFace.
        generation_kwargs=dict(do_sample=True, temperature=0.7, top_k=40),
        sc_size=SAMPLE_SIZE # Number of SC paths to sample
    )
)

gsm8k_eval_cfg = dict(sc_size=SAMPLE_SIZE)

In addition to implementing these methods, OpenCompass has introduced some new features. For instance, while the official ToT repository currently only supports OpenAI API models, OpenCompass extends this support to common open-source large models. This makes it easy to experiment with customizations and classic datasets across models of different scales and types.

Below is a comparison of SC and ToT evaluation results obtained using OpenCompass:

OpenCompass aims to integrate the powerful tool of CoT to help the community unlock the immense potential of large language models across various tasks. With more researchers and practitioners joining the effort, AI technology is expected to become smarter, more efficient, and more practical in the future.

OpenCompass Project Link:

https://github.com/internLM/OpenCompass/

CoT Tutorial: https://opencompass.readthedocs.io/zh_CN/latest/prompt/chain_of_thought.html

Large Model Leaderboard: https://opencompass.org.cn/leaderboard-llm

Welcome everyone to submit evaluation applications on OpenCompass :

Benchmarking the multi-modal capability of Bard with MMBench

OpenMMLab — Fri, 18 Aug 2023 02:57:24 GMT

In March 2023, Google launched Bard, a lightweight and optimized version of LaMDA based on Transformer. Similar to ChatGPT, Bard is a close-sourced model and provide service to users via web UI. In July 2023, Google announced the latest update of Bard, which is capable of processing image input. In order to provide an overview of Bard’s multi-modal ability, we evaluate it on the test split of MMBench as below and compare it with other state-of-the-art VLMs.

Project: https://opencompass.org.cn/MMBench

Evaluation Setting

The test split of MMBench includes 1798 questions. During testing, we find that Bard refuses to process images containing human faces. For a fair comparison, we remove questions that Bard refuse to answer and discard questions that evaluate four human-related capabilities (Image Emotion, Identity Reasoning, Social Relation, and Action Recognition) in the test split. After filtering, we build a subset of 1226 samples and 16 leaf ability dimensions.

Quantitative Results

We compare Bard with two state-of-the-art VLMs that perform well on MMBench, namely Shikra and Otter-I. The result is shown in the figure below. Bard attains an impressive overall accuracy of 51%, positioning itself among the top-tier VLMs proposed to date. Notably, Bard excels in answering questions that involve common sense reasoning. It achieves 62.3% accuracy on Nature Relation questions and 45.2% accuracy on Physical Relation questions, outperforming its counterparts, e.g., Otter-I and Shikra, by a substantial margin. Meanwhile, an analysis reveals that Bard’s performance is comparatively lower in tasks requiring spatial perception, such as Spatial Relationship and Object Localization. This observation aligns with expectations, considering that Shikra incorporates visual grounding tasks into its training data to enhance its localization capabilities, a facet potentially not integrated into Bard’s training process.

Qualitative Results

To complement the quantitative analysis, we also provide some qualitative examples of Bard. Some good cases are demonstrated in the figure below. In the left-hand example, Bard adeptly processes intricate scenes, distills key information, and arrives at a reasonable conclusion. Notably, the majority of VLMs subjected to our testing fail to deliver the correct response to this particular question. In the right-hand example, Bard recognizes the correct concept from cartoon, sidestepping any potential confusion arising from the harmonious interaction between a snake and a mouse. This highlights Bard’s exceptional common sense reasoning ability.

In the next figure, we present illustrative examples that highlight Bard’s performance shortcomings. These instances originate from both image style and image quality tasks. The former entails the model to discern image categories, while the latter involves assessing visual attributes, such as brightness, across a pair of images. A shared characteristic between these tasks is the insignificance of image content concerning the task’s objectives. Bard performs bad on the two tasks, achieving 50% and 7% accuracy on each tasks respectively. The accompanying tables within these cases visually demonstrate Bard’s tendency to excessively focus on semantic concepts and depicted objects within the provided text and image, leaving it struggling to effectively address inquiries regarding holistic styles and attributes.

Bard Provides well-structured Responses

Last but not least, in all the aforementioned examples, Bard consistently delivers well-structured responses, frequently utilizing bullet-point lists and tables to enhance clarity. Moreover, across a majority of the questions, Bard adheres to a consistent response format: presenting the predicted option initially, subsequently offering a comprehensive rationale, and culminating by enumerating the reasons for the incorrectness of alternative choices. From the perspective of being a chatbot, Bard undeniably stands out as one of the most exceptional multi-modal chatbots.

A Brief Intro of MMBench

MMBench is a multi-modality benchmark released in early July 2023, which includes ~3000 multiple choice questions to evaluate over 20 different multi-modal capabilities. Since the benchmark release, more than 200 submissions have been received, and the leaderboard has covered 15 VLMs till now. More information:

Project: https://opencompass.org.cn/MMBench

Paper: https://arxiv.org/pdf/2307.06281.pdf

Codebase: https://github.com/InternLM/opencompass

Faster and More Efficient 4-bit quantized LLM Model Inference

OpenMMLab — Wed, 16 Aug 2023 03:53:05 GMT

LMDeploy has released an exciting new feature — 4-bit quantization and inference. This not only trims down the model’s memory overhead to just 40% of what FP16 inference would take, but more importantly, with extreme optimized kernel, the inference performance has not been compromised. Instead, it’s more than three times the speed of FP16 inference on Gerforce RTX 4090.

We conducted benchmarks on both Llama-2–7B-chat and Llama-2–13B-chat models, utilizing with 4-bit quantization and FP16 precision respectively. The throughput for generating completion tokens was measured by setting a single prompt token and generating 512 tokens in response. All the results was measured for single batch inference.

As shown in the diagram, 4-bit inference with LMDeploy achieves 3.16 times faster than FP16 inference. And it outperforms other extradinary competitors by a margin of around 30% to 80%.

As for memory overhead, we tested scenarios with context window sizes of 1024, 2048 and 4096 respectively. The 4-bit 7B model can easily be accommodated by a single Geforce RTX 3060.

For more detailed test results, please refer to the benchmark section of this article.

Quick Start

Installation

The minimum requirement for performing 4-bit LLM model inference with LMDeploy on NVIDIA graphics cards is sm80, which includes models such as the A10, A100, and Geforce RTX 30/40 series.

Before proceeding with the inference, please ensure that lmdeploy(>=v0.0.5) is installed.

pip install lmdeploy

Get 4-bit quantized model

You can visit LMDeploy’s model zoo to download pre-quantized 4-bit models.

git-lfs install
git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4

Alternatively, you can quantize the model weights to 4-bit by following the instructions presented in this guide.

Inference

## Convert the model's layout and store it in the default path, ./workspace.
python3 -m lmdeploy.serve.turbomind.deploy \
    --model-name llama2 \
    --model-path ./llama2-chat-7b-w4 \
    --model-format awq \
    --group-size 128

## inference
python3 -m lmdeploy.turbomind.chat ./workspace

Serve with gradio

If you wish to interact with the model via web ui, please initiate the gradio server as indicated below:

python3 -m lmdeploy.serve.gradio.app ./workspace --server-ip {ip_addr} --server-port {port}

Subsequently, you can open the website http://{ip_addr}:{port} in your browser and interact with the model

Benchmark

LMDeploy uses two evaluation metrics to measure the performance of the inference API, namely completion token throughput (also known as output token throughput) and request throughput. The former tests the speed of generating new tokens, given specified number of prompt token and completion token, while the latter measures the number of requests processed per minute under real dialogue data.

Completion token throughput

We utilized lmdeploy’s profile_generation.py to test the token generation throughput and memory usage of the 4-bit and 16-bit Llama-2–7B-chat models under different batches on an A100–80G. The number of prompt tokens and completion tokens were set to 1 and 512, respectively.

The comparison results for throughput and GPU memory usage are as follows:

Request throughput

We also tested the request throughput of the 4-bit and 16-bit Llama-2–7B-chat by lmdeploy’s profile_throughput.py on an A100–80G GPU. The comparison results are as follows:

In addition to 4-bit quantization, LMDeploy also supports int8 quantization for k/v cache. We believe that the combination of these two will further improve inference performance. More detailed performance evaluation will be reported soon. Welcome to follow https://github.com/InternLM/lmdeploy to stay up-to-date with the latest news!

The more you star, the more you get :)

Thoroughly evaluate AX620A from the perspective of security industry

OpenMMLab — Thu, 10 Aug 2023 10:48:19 GMT

We’ll thoroughly evaluate AX620A from the perspective of security and defense business as a third-party Deploee. We’ll test and assess AX620A’s performance in various dimensions such as model design, inference, and SDK. We will also provide corresponding test source code and running logs. We hope this information will provide valuable insights for chip selection.

OpenMMLab Platform Link

Product Introduction

AX620A is the second-generation visual chip launched by AXERA, configured with a 4-core Cortex-A7 CPU and 3.6Tops@int8 NPU computing power.

In int4 mode, AX620A’s computing power will be promoted to 14.4 Topps. However, int4 has the specific requirements for model design, and the official open-source documentation has not clarified the usage. We did not consider this part in this test.

Pure computing units cannot be tested practically. Therefore, we chose the Maix-III AXera-Pi, a development board designed by sipeed based on the AX620A chip. As of August 7, 2023, the retail price of the core board is less than $40. The singeboard(core board+baseboard) comes with a USB3.0 baseboard, core board, WIFI module, Ethernet interface, camera, and a 5-inch display screen, making it fully functional and convenient for work.

After acquiring the device, we first tested its CPU performance using the testing tool megpeak, a tool capable of measuring peak computational performance that supports arm, x86, and OpenCL architectures.

In this test, we modified CMakeLists.txt, and the gcc compilation parameters and explanations are as follows:

Here’s what we got:

there are 4 cores, currently use core id :0

bandwidth: 1.861453 Gbps
padal throughput: 5.411672 ns 0.739143 GFlops latency: 6.405562 ns :
padd throughput: 1.509807 ns 2.649345 GFlops latency: 5.024015 ns :
mla_s32 throughput: 5.275521 ns 1.516438 GFlops latency: 5.290761 ns :
mlal_s8 throughput: 2.923057 ns 5.473721 GFlops latency: 5.025521 ns :
mlal_s16 throughput: 2.770953 ns 2.887093 GFlops latency: 5.106042 ns :
mlal_s16_lane throughput: 2.765276 ns 2.893020 GFlops latency: 5.027750 ns :
mla_f32 throughput: 5.354490 ns 1.494073 GFlops latency: 10.047442 ns :
mul_s32 throughput: 5.393568 ns 0.741624 GFlops latency: 5.274667 ns :
mul_f32 throughput: 5.387370 ns 0.742477 GFlops latency: 5.398443 ns :
cvt throughput: 5.377729 ns 0.743808 GFlops latency: 5.352896 ns :
qrdmulh throughput: 5.275443 ns 0.758230 GFlops latency: 5.353959 ns :
rshl throughput: 2.763766 ns 1.447301 GFlops latency: 5.023833 ns :

The results we obtained are:

The actual memory bandwidth is 1.86 Gbps. Actual memory provided by the device is 1.2GB
For int8 multiplication, the 4-core CPU can provide a total of 22 gflops of computing power, 3.7 times that of fp32 multiplication

Users can estimate the execution time of image processing code based on these data — assuming memory movement and calculation are well optimized and do not interfere — such as dividing the computation by 1.49 gflops if there is a significant amount of fp32 computation, or dividing by memory bandwidth if the main operation is memory copying.

Next, we begin evaluating NPU performance. AX620A supports 1_1 mode, where half of the computing power is used for night vision enhancement, and the other half for AI computation. Considering that security scenarios usually require enhanced image quality at night, all our subsequent tests will enable 1_1 mode. If you need to compare qps with other similar chips, AX620A’s results need to be multiplied by 2.

Single Operator Test

AX620A’s model conversion tool is called pulsar, a python script inside a docker image. Currently, pulsar supports 46 kinds of onnx operators. The complete support list can be found here: onnx support list.

Testing Process

Although we can not test every operator, just like GEMM can be broken down into multiple GEPP/GEBPs, operators can also be broken down into basic operations. Based on the calculation process, we classified these 46 onnx operators into 9 categories and selected one from each category for testing.

Next, we used torch2onnx to generate corresponding onnx models for these 9 operators.

Due to the numerous operator parameters, to avoid excessively long test time, we fixed the input shape of conv at 224x224, while other models were tested at three different scales: 112x112, 384x256, and 1080x1920. Eventually, we obtained 170 single-operator onnx models. The torch code to generate these models can be found here: github link.

Then, we used pulsar to convert these onnx operators. We successfully converted 139 models, and 28 models had clear error logs during the conversion process, including one softmax conversion that got stuck.

Finally, in the successfully run models, we measured the different operators’ energy efficiency ratio, an indicator showing the number of MACs (multiply-accumulate) completed per microsecond. We sorted by energy efficiency ratio, and here are some of the results:

Result Analysis

The test results are consistent with the official Efficient Operator Design Guidelines, and we observed more phenomena:

When stride=2, the efficiency of conv7x7 surpassed conv3x3
Using conv1x1 instead of gemm can make better use of NPU
The overhead of binary operations like add is not low

The complete test results, including the conversion process logs, statistical tables, and execution scripts, can be found here: https://github.com/tpoisonooo/deploee-benchmark/blob/main/ax620a/opr_test.md

The user can adjust the parameters of the model according to mac_util and efficiency in the table.

Model Testing

The hardware model library contains 640 onnx models converted by OpenMMLab algorithms. These models are related to various tasks such as 3D detection, segmentation, key point recognition, OCR, etc. They are very suitable for testing the completeness of the vision chip software stack.

Since AX620A does not support dynamic shape, it can not support models like mmdet3d-voxel, mmaction, and LLaMa. Therefore, we have selected 318 fixed input size onnx models for testing.

However, during the conversion process of some models, errors occurred due to some operators not being implemented. Here are the cases where many operators are missing:

In the end, we successfully converted 60 models. Now let’s take a look at the running time of the resnet series models on AX620A:

Compared to resnet50’s runtime of over 100ms on the Jetson Nano, the speed of axera-pi has increased 5 times, and its retail price is only 1/2 of the former, showing obvious cost performance.

The onnx models used for testing can be searched and downloaded in the hardware model library. Execution logs and results have been published at the following link: https://github.com/tpoisonooo/deploee-benchmark/blob/main/ax620a/model_test.md

SDK Evaluation

Visual SDK is usually composed of multiple pipelines. Its inputs are images or videos, and the outputs are structured data. Since image decoding is often not a performance bottleneck, when developing a visual SDK, we need to consider video decoding, image operations, and compatibility with the pipeline.

Decoding

In the field of security, the most commonly used video formats are h.264 and h.265, mainly adopting main or high profile, with a standard size of 1088x1920. Although the video frame rate used in our tests is 60fps, this will not affect our final conclusion.

$ ffmpeg -i 1088x1920.h264
..
    Stream #0:0: Video: h264 (High 10), yuv420p10le(progressive), 1920x1088 [SAR 136:135 DAR 16:9], 57 fps, 59.94 tbr, 1200k tbn, 119.88 tbc

We made slight modifications to the ax-pipeline source code to test the peak decoding speed in different situations (such as output scaling, cropping, flipping, etc.). This is because, in practical business, the width and height of the video are not fixed. In some chip implementations, correcting video output may cause a decrease in decoding speed, while using smaller video sizes can speed up decoding. Therefore, we need to take these situations into account when testing decoding speed. Here are the test results we obtained on the AX620A:

As mentioned above, we can see that the video decoding speed of AX620A is almost stable at 60fps and is not affected by image processing operations.

Image Processing Support

The second factor affecting pipeline throughput is image processing speed. Take common face recognition as an example. Before recognition, it is necessary to correct the face image. At this time, whether you can use the CPU and NPU to efficiently achieve perspective transformation may determine the maximum throughput of the pipeline. The following table lists the image processing operators provided by AX620A in IVPS (Video Image Processing Subsystem). Users can consult the complete documentation at the following link: https://github.com/sipeed/axpi_bsp_sdk/tree/main/docs

Due to the lack of WarpPerspective and TopK, to deploy a complete face business, it may be necessary to adjust the image correction and feature engineering implementation.

Conclusion

From the perspective of security industry, we have conducted a comprehensive test of AX620A, covering various evaluation indicators such as CPU, operators, models, and SDK, and provided detailed operation logs and script support for this.

According to the experimental results, we have reached the following conclusions:

Cost Performance: ★★★★★

AX620A has strong performance. In comparison with the Jetson Nano’s contemporaneous product, it achieves more than a 5-fold performance improvement at less than 1/2 of its retail price and with 50% of its own computing power.

2. Usability: ★★★★☆

Pulsar is implemented based on Docker, and users can use it directly without complicated installation and configuration. At the same time, AX620A provides complete samples and documentation, providing good support for users. However, the lack of chip architecture documentation makes AX620A slightly deficient in transparency.

3. Model Compatibility: ★★★☆☆ AX620A has limited support for dynamic shape; a few models cannot be converted within 2 hours, reflecting room for improvement in compatibility.

4. CV Operators and Decoding Support: ★★★☆☆

AX620A can meet basic computer vision operators and decoding needs, but the API has usage restrictions.

If AX620A can further optimize pulsar, improve model compatibility, etc., it has the potential to become a perfect vision NPU chip. We have learned from industry channels that these issues have been resolved in AXERA’s third-generation vision chip AX650N, and we are looking forward to its performance.

Deploy Llama-2 models easily with LMDeploy!

OpenMMLab — Wed, 02 Aug 2023 11:09:57 GMT

This article will guide you on how to quickly deploy the Llama-2 models with LMDeploy.

There are 3 types of Llama-2 models that have been open-sourced so far: 7B, 13B, and 70B. Comparing to Llama-1, the 7B and 13B structures remain unchanged, while the 70B adjusts it, replacing Multi-Head Attention with Grouped-Query Attention. Overall, it’s not too difficult to implement. Let’s get started!

The LMDeploy’s Journey with Llama-2

Getting Started: 7B/13B

Meta provides Llama-2 7B and 13B conversation models with context window size 4096. As they have the same structure as Llama, all we need to do is to add the Llama-2 chat template in LMDeploy.

Tip: LMDeploy can deploy any language models with the same structure as Llama or Llama-2. Feel free topull PRs about their chat template to LMDeploy :)

The installation of LMDeploy is very simple:

pip install lmdeploy

By following the steps below, you will be able to interact with it via the command line:

python3 -m lmdeploy.serve.turbomind.deploy llama2 //the/path/of/llama-2-chat-7b-hf
python3 -m lmdeploy.turbomind.chat ./workspace

Launch the triton inference server to serve the model:

tritonserver --model-repository=./workspace/model_repository/ --allow-grpc=1 --grpc-port=33337

If you want to use the webui chat window, you can do the following:

python3 -m lmdeploy.app {tritonserver_ip_addr}:33337

Open the webpage https://localhost:6006 in your browser, and you can chat with AI assistant online.

LMDeploy has outstanding inference performance, outperforming similar open-source projects in output token throughput and request throughput metrics. Among them, output token throughput measures the token generation speed under fixed input and output tokens. Request throughput tests the number of requests processed per minute under real conversation scenario.

The above diagram shows output token throughput when the input and output tokens are (2048, 2048). It can be concluded that LMDeploy is about 5% — 15% higher than DeepSpeed overall and outperforms the official facebook Llama-2 inference by up to 5x.

In terms of request throughput, LMDeploy is about 30% higher than vLLM.

Advancing: 70B

Llama-2 70B uses GQA (Grouped-Query Attention). As shown in the following diagram, GQA divides query heads into groups, each of which shares a single key head and value head. When the number of groups equals the number of query heads, it becomes MHA (Multi-Head Attention). When the group is 1, it is MQA (Multi-Query Attention).

According to the literature, GQA is close to MHA in terms of model capability while being as efficient as MQA in terms of inference speed.

使用 MHA 结构的自回归模型，在推理过程中，会维护一个巨大的 k/v cache。它的内存开销公式为：

Auto-regressive models using MHA structure maintain a large k/v cache during the inference process. Its memory overhead formula is:

batch * max_seq_len * n_heads * head_dim * sizeof(half) * 2

而对于 GQA 来说，k/v cache 的内存开销公式变成：

While for GQA, the formula becomes:

batch * max_seq_len * n_kv_heads * head_dim * sizeof(half) * 2

n_heads / n_kv_heads is the size of the group. As you can see, using GQA can reduce the k/v cache to 1/group of MHA. This is very beneficial for attention, which is memory-intensive computation.

LMDeploy has implemented GQA and supports tensor parallelism. The deployment method is similar to that of 7B. You just need to set the tensor parallel parameter to 8 when converting the model structure. For more details, please refer to serving.

LMDeploy’s Special Features

Interactive mode Inference: No More Paying for Conversation History

In multi-turn conversation scenarios, most inference engines require users to send the prompt as well as the past conversation history to the server. This means that users have to pay for the history in each round of the conversation. LMDeploy can cache all attention k/v of the conversation, thus avoiding repetitive processing of historical conversations. We call this procedure interactive mode, which can greatly reduce the latency of generating the first token, especially for long conversation history.

Persistent Batch: The key to high throughput

LMDeploy models the inference of a conversational LLM as a persistently running batch whose lifetime spans the entire serving process. To put it simply:

- The persistent batch as N pre-configured batch slots.
- Requests join the batch when there are free slots available. A batch slot is released and can be reused once the generation of the requested tokens is finished.
- The batch grows or shrinks automatically to minimize unnecessary computations.

Conclusion

Other exciting features of LMDeploy are still under intense development. Welcome to follow our project at https://github.com/InternLM/lmdeploy for the latest updates!

It’s 2023. Is PyTorch’s FSDP the best choice for training large models?

OpenMMLab — Tue, 01 Aug 2023 09:03:47 GMT

The wave of large model training initiated by ChatGPT has made many eager to try their hand at training large models. When looking for training baselines, you’ve surely noticed that the codebase for training large models tends to use frameworks like DeepSpeed (MMEngine v0.8.0 also supports it, allowing one-click switching for convenience!) or ColossalAI (MMEngine will support it in the next version!), with scant regard for PyTorch’s native FSDP (FullyShardedDataParallel). But why is this? Is FSDP not memory-efficient enough? Is it too slow for training? Or is it simply inconvenient to use? Read on, and I’m sure you’ll gain some insights.

Background of FSDP

FSDP’s implementation was inspired by FairScale. When developing large features, PyTorch typically creates a new library to provide some experimental support and collect user feedback, such as FairScale, Dynamo (the cornerstone of PyTorch 2.0), and torchdistx. Once the feature becomes more mature, it may be incorporated into PyTorch. Compared to the brief introduction of FSDP in PyTorch’s official tutorial, FairScale has done a much better job. Before we start the introduction, here is an introduction by FairScale, and it’s worth considering: do you really need FSDP? (This is also true for other large-scale training frameworks)

Introduction to the ZeRO Series

Having seen the above figure, you’ll notice that FairScale defines FSDP as ZeRO3. Considering that some may not be familiar with the ZeRO series of large model optimization strategies, let me give a brief introduction:

During model training, memory usage can be largely divided into three parts: activation values, model weights, gradients, and optimizer states. For vision models, activation values take up most of the memory, so mixed-precision training can significantly reduce memory usage (fp16). However, for large language models or multimodal models, optimizing the memory usage of the latter three becomes more important.

Taking PyTorch as an example, when you use DistributedDataParallel, it allocates memory for model parameters, gradients, and optimizer states in each process and synchronously updates this data during training. Although this approach can speed up training through data parallelism, its memory allocation strategy is evidently poor. Since the parameters in each process are the same, why should each process save the complete set of parameters? Thus, ZeRO advocates that each process should only save a part of the parameters, gathering them into all processes when needed. ZeRO has three stages of optimization strategies:

ZeRO1: Sharding only the optimizer state

ZeRO2: Sharding both the optimizer state and gradients

ZeRO3: Sharding optimizer state, gradients, and model parameters

Take a model with 7.5B (φ) parameters as an example, let’s briefly calculate the memory usage of model parameters, gradients, and optimizer states:

fp32 training:

The model parameter size is φ, the gradient size is also φ, and in the case of using Adam, the optimizer state is 2φ. If it’s standard fp32 training, then the actual memory used is (1 + 1 + 2)φ * 4: 16φ bytes (4 is the memory size occupied by fp32 data).

fp16 training:

If mixed-precision training is enabled, to ensure the precision of parameter updates, the optimizer state needs to remain in fp32, and an additional copy of fp32 model parameters needs to be stored. Therefore, memory usage is 2φ(model parameters) + 2φ(model gradients) + 8φ(optimizer state) + 4φ(copy of fp32 model parameters stored in optimizer in the DeepSpeed implementation): 16φ bytes.

From this perspective, it’s clear why the memory usage of a 7.5B model can be as high as 120B, and why the ZeRO series is so effective.

FSDP — ZeRO3?

Returning to the main topic, FairScale says that FSDP is equivalent to ZeRO3’s optimization. Let’s understand this through a simple example (in this example, the optimizer is SGD because PyTorch’s Adam has been heavily optimized and its actual memory usage is much higher than theoretical). Before the official test, let’s look at the tests of single device fp32 training, single device fp16 training, and DDP fp16 training:

Single device fp16 + fp32

class Layer(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Sequential(
            *(nn.Linear(10000, 10000) for _ in range(10))
        )

    def forward(self, x):
        return self.linear(x)

def test_fp32():
    model = Layer().cuda()
    optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9)
    data = torch.ones(10000).cuda()
    for i in range(10):
        optimizer.zero_grad()
        output = model(data)
        loss = output.sum()
        loss.backward()
        optimizer.step()
        memory = max_memory_allocated()
        print(f'step memory allocate: {memory / 1e9:.3f}G')

def test_fp16():
    torch.cuda.init()
    model = Layer().cuda()
    optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9)
    data = torch.ones(10000).cuda()
    for _ in range(10):
        with autocast(device_type='cuda'):
            optimizer.zero_grad()
            output = model(data)
            loss = output.sum()
            loss.backward()
            optimizer.step()
        memory = max_memory_allocated()
        print(f'memory allocated: {memory / 1e9:.3f}G')

After running the code, we find that the memory usage is as follows:

fp32: 12.035G

fp16: 14.035G

What? Does amp use an additional 2G of memory? How is this calculated? This comes down to the implementation of amp. PyTorch’s amp doesn’t change the type of model weights, so they’re still stored in fp32,but it converts the fp32 weights to fp16 before and after the forward/backward for the whitelisted operators to calculate the fp16 activation and gradients. The fp16 gradients are further converted to fp32 to ensure the precision of parameter updates. But if both the weights and gradients remain in fp32 and the optimizer state is unchanged, why is an additional 2G used? The reason is that the fp16 weights during forward and backward operations are cached, which is implemented in the amp’s C++ code. The cached fp16 gradients are the source of the extra 2G.

To save this part of the parameters, you need to pass cache_enabled=False to autocast.

def test_fp16():
    torch.cuda.init()
    model = Layer().cuda()
    optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9)
    data = torch.ones(10000).cuda()
    for _ in range(10):
        with autocast(device_type='cuda', cache_enabled=False):
            optimizer.zero_grad()
            output = model(data)
            loss = output.sum()
            loss.backward()
            optimizer.step()
        memory = max_memory_allocated()
        print(f'memory allocated: {memory / 1e9:.3f}G')

As a result, the memory consumption is 12.235G, which is basically consistent with fp32 and meets expectations.

DDP Training

DDP just creates and updates the model in each process, so memory usage should still be around 12G, right?

def _test_ddp_fp16():
    rank = dist.get_rank()
    model = DistributedDataParallel(Layer().cuda())
    optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9)
    data = torch.ones(10000).cuda()
    for _ in range(10):
        with autocast(device_type='cuda', cache_enabled=False):
            optimizer.zero_grad()
            output = model(data)
            loss = output.sum()
            loss.backward()
            optimizer.step()
        memory = max_memory_allocated()
        if rank == 0:
            print(f'memory allocated: {memory / 1e9:.3f}G')

However, the result is:

16.036G

The principle is simple. DDP requires a bucket for gradient computation and gradient synchronization, and the bucket retains a copy of the gradient, so it consumes about 4G more memory.

FSDP Training

When using FSDP, we need to configure the auto_wrap_policy parameter to choose the model sharding strategy, otherwise the memory optimization can only reach the level of ZeRO-stage1. The configuration of auto_wrap_policy and its corresponding principle will be explained in detail in the following sections.

from torch.distributed.fsdp.wrap import _module_wrap_policy

def _test_fsdp_fp16():
    rank = dist.get_rank()
    fsdp_model = FullyShardedDataParallel(
        module=Layer(), device_id=rank,
        auto_wrap_policy=partial(
            _module_wrap_policy,
            module_classes=nn.Linear))
    optimizer = SGD(fsdp_model.parameters(), lr=0.1, momentum=0.9)
    data = torch.ones(10000).cuda()
    for _ in range(10):
        optimizer.zero_grad()
        output = fsdp_model(data)
        loss = output.sum()
        loss.backward()
        optimizer.step()
        memory = max_memory_allocated()
        if rank == 0:
            print(f'step memory allocate: {memory / 1e9:.3f}G')
        torch.cuda.reset_max_memory_allocated()

The result is 1.524G, which is basically equivalent to the memory optimization effect of ZeRO3.

Analysing the memory usage here is to help you look at memory optimization rationally when switching from DDP to FSDP.

FSDP Sharding Strategy

In the previous section, we mentioned that we need to specify the model sharding strategy through the auto_wrap_policy. So how does this parameter work? And why can the optimization only reaches ZeRO-stage1 without configuring this parameter?

Similar to DistributedDataParallel, FSDP also uses a model wrapper: FullyShardedDataParallel to implement the logic of parameter slicing. The Wrapped module will become the root fsdp module, and the root fsdp module will recursively wrap the submodule into a child fsdp module according to the user-defined auto_wrap_policy when building:

Take the officially implemented _module_wrap_policy as an example, where the key parameter module_classes is used to indicate which type of submodule should be wrapped into a child fsdp module.

def _module_wrap_policy(
    module: nn.Module,
    recurse: bool,
    nonwrapped_numel: int,
    module_classes: Set[Type[nn.Module]],
) -> bool:
    """
    This auto wrap policy wraps every module that is an instance of any type in
    ``module_classes`` as its own FSDP instance. The root module given by
    ``module`` is always wrapped as an FSDP instance regardless. Since the
    wrapping proceeds bottom up, each FSDP instance manages the parameters in
    its subtree excluding any already managed by a child FSDP instance.

    Args:
        module (nn.Module): Current module being considered.
        recurse (bool): If ``False``, then this function must decide whether
            ``module`` should be wrapped as an FSDP instance or not. If
            ``True``, then the function is still recursing down the module
            tree as a part of the DFS.
        nonwrapped_numel (int): Parameter numel not yet wrapped.
        module_classes (Set[Type[nn.Module]]): Set of module classes that are
            wrapped as FSDP instances.

    Returns:
        ``True`` if ``recurse=True``, and whether ``module`` should be wrapped
        if ``recurse=False``.
    """
    if recurse:
        return True  # always recurse
    if inspect.isclass(module_classes):
        module_classes = (module_classes, )
    return isinstance(module, tuple(module_classes))

In the previous section, we specified it as nn.Linear, which means each nn.Linear will be wrapped into a child fsdp module.

All fsdp modules will trigger parameter unsharding (all gather) and sharding during the forward process.

The forward of the root fsdp module will gather the parameters of different processes in the pre-forward stage and register some pre-backward-hook and post-backward-hook. Then it releases parameters that do not belong to the current rank in the post-forward stage. The pre-backward-hook will gather the parameters again before executing backward, and the post-backward-hook is responsible for implementing gradient reduce-scatter, that is, gradient synchronization + gradient distribution.

It should be noted that the fsdp-module forward will not further gather the parameters of the child fsdp module.

Compared with the child fsdp module, the forward of the root fsdp module will also do some additional work such as cuda stream initialization, which is not further discussed here.

2. The forward of the child fsdp module

The main logic is basically the same as the root fsdp module

It can be seen that each time the fsdp module only gathers part of the parameters, which is in line with our expectations. So what if we don’t set auto_wrap_policy? That is, there are no child fsdp modules.

During the forward stage of the root fsdp module, it will directly gather all the parameters, which means that it is impossible to achieve the memory saving through parameter slicing in ZeRO-stage3. However, the slicing of gradients and optimizer states in ZeRO1 and ZeRO2 can still be achieved. The reason is that the post-backward-hook is still registered during the forward stage, so the logic of gradient reduce-scatter will still work. When building the Optimizer, the parameters of the root fsdp module are passed in, so the optimizer will directly update the sliced parameters and record the state of the sliced parameters, so the optimization of the sliced state of the optimizer is also effective.

auto_wrap_policy needs to follow a certain interface specification, that is, accept the following parameters:

module: the module accessed when recursively traversing the submodule
recurse: Whether to further recursively wrap the submodule of child fsdp module submodule to child fsdp module
nonwrapped_numel: The meaning of this parameter is the parameter quantity of the current module that does not need to be sliced. What are the parameters that do not need to be sliced? Generally speaking, it includes two parts, namely the already sliced parameters and the parameters that the user specifies to be ignored (ignored_params). Based on this parameter, a size-based wrap policy can be implemented, such as the officially implemented size_based_auto_wrap_policy.

FSDP gives users the right to configure the auto_wrap_policy parameter, which has indeed improved its flexibility, but it has also invisibly increased the learning cost of FSDP.For example, what effect will auto_wrap_policy have, what is the meaning of its several input parameters. Users may feel puzzled when they get start with FSDP.

However, if the cost of using FSDP is limited to this, I believe everyone is still willing to learn and use it. However, some implicit conventions and some strange errors are really discouraging.

The painful lessons learned from experimenting with FSDP

Risks of Replacing Submodules

In the previous section, we mentioned that FSDP replaces submodules with child FSDP modules after wrapping. You might wonder, what will happen if the parent module accesses some attributes or methods of the submodule? Will an AttributeError be raised ?

def __getattr__(self, name: str) -> Any:
"""Forward missing attributes to the wrapped module."""
try:
    return super().__getattr__(name) # defer to nn.Module's logic
except AttributeError:
    return getattr(self._fsdp_wrapped_module, name)

This way, it looks for undefined attributes in the submodule. However, this still poses risks.

If the attribute you access happens to have the same name as an attribute in the FSDP, you might access the wrong attribute.
If you directly access the submodule’s parameter and perform some operations on it. Since parameters are gathered during the forward stage, what you get directly at this point is a sharded parameter, which will probably throw an error.
If you happen not to directly call the __call__ method of the child fsdp module, for example in this situation:

class Layer(nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.processor = nn.Linear(1, 1)
        self.linear1 = nn.Linear(1, 1)
        self.linear2 = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear1(x) + self.linear2(x)

class ToyModel(nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.linear = nn.Linear(1, 1)
        self.layer = Layer()  # 会被 auto wrap policy 指定为 child fsdp module

    def forward(self, x):
        y = self.linear(self.layer.processor(x))
        return self.layer(y)

Suppose Layer is wrapped as an fsdp module and self.layer.processor is directly called by ToyModel.forward, an error will be raised since the Layer.forward has not been called an the parameters of processor still remain sharded.

Or in this case:

class A:
    ...
    def loss(self, inputs: torch.Tensor, data_samples: List[DataSample]) -> dict:
        feats = self.extract_feat(inputs)
        return self.head.loss(feats, data_samples)
    
class B:
    ...
    def loss(self, feats: Tuple[torch.Tensor], data_samples: List[DataSample],  **kwargs) -> dict:
        cls_score = self(feats)  # 没有走 FSDP 的 forward
        losses = self._get_loss(cls_score, data_samples, **kwargs)
        return losses

class B is a submodule head of class A, and A will call self.head.loss. If class B is wrapper as a child fsdp module, the sharded tensor will no be gathered when calling self.head.loss, then a corresponding error will be raised.

Optimizer with Multiple Parameter Groups

PyTorch’s optimizer supports setting different learning rates, momentum, and other hyperparameters for different parameters in the model. The setup process looks something like this:

param_groups = []
for module in model.modules():
    if isinstance(module, nn.BatchNorm2d):
        param_groups.append({'param': module.weight, lr=0.01})
        param_groups.append({'param': module.bias, lr=0.1})
    elif:
    
optimizer = SGD(param_groups, lr=0.1)

However, the problem is, prior to PyTorch 2.0, once the root fsdp module and child fsdp module are built, it deletes the original parameters, such as bn.weights, bn.bias, and converts all the unsliced parameters under the fsdp module into a large flatten parameter.

For example, in the previous chapter’s example, if no auto_wrap_policy is specified, only the outermost root fsdp module will be retained. Then, all the parameters of the linear layers will be reconstructed into a large flatten parameter, placed under the root_fsdp_module:

        rank = dist.get_rank()
        fsdp_model = FullyShardedDataParallel(
            module=Layer(), device_id=rank,
            # auto_wrap_policy=partial(
            #     _module_wrap_policy,
            #     module_classes=nn.Linear),
        )
        print(list(fsdp_model.parameters()))

At this point, each rank will only print out one parameter:

[Parameter containing:
Parameter(FlatParameter([-4.6519e-05, -6.2861e-03,  3.9519e-03,  ..., -3.2763e-03,
                7.1111e-04, -8.2136e-03], device='cuda:3', requires_grad=True))]

Therefore, before PyTorch 2.0, once FSDP was used, it was difficult to set different learning rates for each parameter because multiple parameters would be merged into one after fsdp wrap. The subsequent gradient shard and parameter updates are also based on the flatten tensor.

Since parameter updates are also based on the flatten tensor, FSDP requires consistent dtype and requires_grad attributes of each parameter under the same fsdp module, otherwise,Parameters cannot compose a large flatten tensor.

PyTorch 2.0 added a use_orig_params parameter to FSDP. When this parameter is turned on, FSDP will not delete the original parameters during the wrap process. The memory of the original parameters point to some area of the flatten params.

This is a great update. Without introducing additional GPU memory consumption, users can still access the original parameters and set different optimizer hyperparameters for them. With the introduction of this parameter, in theory, the restriction on the uniformity of the requires_grad attribute of all parameters under the same FSDP module should also be lifted. Unfortunately, PyTorch 2.0 did not adjust this part of the logic, but this issue has been fixed on the main branch, and it is believed that the upcoming PyTorch 2.1 will be able to solve this pain point.

Stability of FSDP Interface

Although as early as PyTorch 1.11, FSDP was already a beta feature, to this day, the FSDP module is still in a state of rapid iteration. In February 2023, the developers of FSDP initiated a discussion, introducing some design concepts and internal restructuring.

In addition, the external interface of FSDP is updated relatively quickly. When you open the API documentation of PyTorch FSDP, you will find that many interfaces are marked as deprecated. However, overall, the new interface is indeed much easier to use and more flexible than the old one. The integration of FSDP by MMEngine is also based on the new interface.

Conclusion

FSDP, in terms of memory savings, is indeed equivalent to ZeRO3, but it should be noted that when mixed-precision training (autocast) is enabled, cache_enabled needs to be set to False.
FSDP has a higher learning curve in terms of ease of use. Users need to understand the logic of FSDP wrapping modules, the role of auto_wrap_policy, and some limitations. Unexpected errors are prone to happen if users do not have an overall understanding about FSDP . The error messages and the actual cause of the error may not be highly related, making it difficult to debug.
PyTorch 2.0 has greatly improved the usability of FSDP through the use_ori_params parameter, but the restriction on the uniformity of the requires_grad attribute still exists. To solve this problem, you can wait for the PyTorch 2.1 update and specify use_orig_params=True. But if you want to solve this temporarily, you need to make some changes to auto_wrap_policy . Since this is based on the internal agreement of FSDP, it may not be very stable, so I won't go into details here.

In general, FSDP leaves something to be desired in terms of ease of use, but in terms of flexibility, it gives users more room for operation. However, with the continuous iteration of PyTorch, FSDP is expected to become as easy to use as DDP. MMEngine will also closely follow the updates of FSDP, aiming to lower the entry threshold while maintaining flexibility and summarizing a set of simple, easy-to-configure best practices.

If you’re interested, feel free to encourage more updates. If there’s an opportunity, we can further discuss the design philosophy of FSDP, the construction logic of flatten params, the rules for parameter slicing, and the parallel methods for gradient computation and synchronization in FSDP. Let’s exchange ideas on how to tackle the errors thrown by FSDP (hopefully, with fewer rounds of debugging after PyTorch updates).

What? You also want to get a comprehensive analysis of DeepSpeed, ColossalAI, and FSDP? MMEngine also supports DeepSpeed from version v0.8.0, and we will bring an introduction to DeepSpeed next time. Please pay more attention to MMEngine and give it a star. We believe that in the near future, you will be able to switch freely between FSDP, DeepSpeed, and ColossalAI with just a few lines of code and experience the pros and cons of various training frameworks yourself.

Fine-tuning Llama2 takes less than 200 lines of code!

OpenMMLab — Thu, 27 Jul 2023 02:51:02 GMT

Last week, Meta AI released their next generation large language model: Llama 2. They open-sourced the model, training, and inference scripts, with the model even available in a Hugging Face version. They’ve really done an excellent job of thinking about regular users while open-sourcing conscientiously, which is incredibly cool. As soon as I heard the news, I rushed into the official repository, llama2-recipes, planning to experience the training process of Llama 2.

During my first experience, it was clear that the code release was somewhat rushed, and I ran into a few minor issues. The model stopped converging after only a few iterations. Upon carefully reviewing the code, I found a small mistake. They updated the epoch-based scheduler as if it were step-based, which led to the learning rate decreasing too rapidly. After only a few iterations, the learning rate was almost zero. So, I quickly reported this issue to the official team: https://github.com/facebookresearch/llama-recipes/issues/27

The official response was incredibly fast and they fixed the issue the very next day: https://github.com/facebookresearch/llama-recipes/pull/28

Anyone who has encountered similar issues can update their code to ensure the problem is resolved.

After solving this minor issue, Llama2 was able to train normally. Great works for Meta AI! star++

After this minor incident, I got a good handle on the Llama2 training process. As stated in the paper, it is trained using FSDP. Wait, FSDP, isn’t MMEngine v0.8.0 also supporting FSDP training? So, I implemented the Llama2 training process based on the new features of MMEngine. See the complete training example at: https://github.com/open-mmlab/mmengine/tree/main/examples/llama2

Implementing the Dataset

We directly referred to the implementation of the alpaca dataset in llama-recipe.

Building the FSDPStrategy

The constructor of FSDPStrategy initializes distributed environment, random seeds, and other environmental variables, so it needs to be done first. Strategy is a feature introduced in MMEngine v0.8.0, aimed at solving some issues with large model training. For a detailed explanation of Strategy, you can look forward to subsequent articles~

strategy = FSDPStrategy(
    model_wrapper=dict(
        auto_wrap_policy=partial(
            transformer_auto_wrap_policy,
            transformer_layer_cls={LlamaDecoderLayer})),
    state_dict_cfg='full',
    env_kwargs=dict(randomness=dict(seed=42)))

Building the dataloader and model

The configuration is completely copied from the official repo. It’s worth noting that the official repo by default enables bf16 training with full parameters, without the need for mixed precision training.

# Prepare model
tokenizer = LlamaTokenizer.from_pretrained(args.checkpoint)
tokenizer.add_special_tokens({'pad_token': ''})
model = LlamaForCausalLM.from_pretrained(args.checkpoint)
model.to(torch.bfloat16)
model.train()

# Prepare dataset
train_dataset = AlpacaDataset(
    tokenizer=tokenizer, data_path=args.data_root)
train_dataloader = DataLoader(
    train_dataset,
    batch_size=args.batch_size,
    sampler=DefaultSampler(train_dataset, seed=0),
    collate_fn=default_data_collator,
    drop_last=True)

Preparing optimizer and scheduler

The configuration aligns with the official repo, using AdamW and StepLR. The model, scheduler, and optimizer are then passed to the strategy to handle the FSDP related logic.

optim_cfg = dict(
    optimizer=dict(type=AdamW, lr=1e-4, weight_decay=0.0),
    accumulative_counts=ORI_BATCH_SIZE / args.batch_size)
scheduler_cfgs = [dict(type=StepLR, step_size=1, gamma=0.85)]
model, optimizer, schedulers = strategy.prepare(
    model,
    optim_wrapper=optim_cfg,
    param_scheduler=scheduler_cfgs,
    dispatch_kwargs=dict(max_iters=max_iters, max_epochs=args.max_epoch))

Customizing the train-loop

By using the strategy, we can break away from Runner and freely implement the training logic. Doesn’t it feel similar to native PyTorch?

for epoch in range(args.max_epoch):
    for idx, inputs in enumerate(train_dataloader):
        # Convert inputs to target device.
        inputs = apply_to(inputs, lambda m: isinstance(m, torch.Tensor),
                          lambda m: m.cuda())

        loss = model(**inputs).loss
        optimizer.update_params(loss)

        max_memory = torch.cuda.max_memory_allocated()
        strategy.logger.info(f'Epoch: {epoch+1}/{args.max_epoch}, '
                             f'Iter: {idx+1}/{epoch_length}, '
                             f'Loss: {loss.item():.3f}, '
                             f'Lr: {optimizer.get_lr()["lr"][0]:.6f} '
                             f'Memory: {max_memory/1e9:.3f}G')
        visualizer.add_scalars({'loss': loss.item()})

        torch.cuda.reset_peak_memory_stats()

    for scheduler in schedulers:
        scheduler.step()

    save_dir = f'{args.output_dir}/epoch_{epoch+1}'
    state_dict = model.state_dict()

    if is_main_process():
        model.save_pretrained(save_dir, state_dict=state_dict)
        tokenizer.save_pretrained(save_dir)

However, leaving the Runner also has some drawbacks. We have to manually update the learning rate, print logs, record logs, and save weights.

In conclusion

Users who are interested can come to MMEngine and try out the training examples. We welcome plenty of feedback. If you are interested in DeepSpeed and ColossalAI, we will also provide examples of fine-tuning with DeepSpeed and ColossalAI as soon as possible.

MMEngine：https://github.com/open-mmlab/mmengine

Join OpenMMLab Codecamp: Harness Your Coding Skills and Shape the Future of Open Source!

OpenMMLab — Mon, 24 Jul 2023 08:47:44 GMT

Want to improve programming skills?

Want to contribute to open source projects with global developers? Eager to understand the latest technology trends but lacking hands-on projects? Has it been ten years since AlexNet was released, yet you are still only practicing with local development? In today’s wave of open source, are you only utilizing git clone to use GitHub’s open source libraries? Here comes your chance, the OpenMMLab Code Camp will officially kick off on July 20th!

We offers 10 directions, 150+ tasks of varying difficulties.

Including but not limited to foundational library (MMEngine), object detection (MMDetection), 3D object detection (MMDetection3D), pre-training + multimodal (MMPreTrain), AIGC (MMagic), deployment (MMDeploy), pose estimation (MMPose), semantic segmentation (MMSegmentation), action recognition (MMAction2) among various other fields.

Moreover, we released cooperation task with Seeed Studio and Extreme Mart platform which provide extra gift (Jetson Nano is waiting for you). Meanwhile, you also have the opportunity to design and create your own innovative applications under DIY tasks.

Getting started by contributing to open-source frameworks and experiencing the charm of openness.

We sincerely invite all AI learners, researchers, and practitioners to participate in this event.

You will gain：

Deeply participate in the construction of well-known open-source projects.
10+ fields with 150+ tasks of varying difficulties, providing a richer project development experience.
One-on-one guidance from repo maintainers, developers, helping you tackle challenges and accumulate project experience.
Charming electronic prizes, OpenMMLab certificates, limited-edition merchandise, and fast-track interviews.
Participate in projects remotely, with freedom to schedule your time.

Click the link, pick tasks and join us!

https://openmmlab.com/activity/codecamp

Questions？

Join Discord for Discussion：https://discord.gg/KuWMWVbCcD