One-Click Enabling of Intel Neural Compressor Features in PyTorch Scripts

Automatic Quantization of PyTorch Models

4 min readAug 19, 2022

Authors: Kai Yao, Haihao Shen, and Huma Abidi, Intel Corporation

Intel Neural Compressor is an open-source Python library for model compression that reduces the model size and increases deep learning (DL) inference performance on CPUs or GPUs. It supports post-training static and dynamic quantization of PyTorch models.

PyTorch Inference Acceleration with Intel® Neural Compressor

Authors: Feng Tian, Haihao Shen, Huma Abidi, Chandan Damannagari

medium.com

For PyTorch, it is not always trivial to apply DL optimizations such as INT8 quantization correctly in the Python code. Not only do users have to insert the corresponding API code correctly in their code, they also need to identify the correct variable name of the calibration dataloader and of the model on which the quantization is to be performed. Users might also need to construct the evaluation function for tuning. To address this issue, Intel Neural Compressor v1.13 provides an experimental auto-quantization feature that allows users to enable quantization without coding. The feature leverages DL optimization rules and static program analysis (Python code syntax analysis, static type inference, call graph parsing) that can automatically insert the necessary API code into user scripts.

How to One-Click Enable My Model

Feature Highlight

We’re delighted to share Neural Coder, a toolkit that was recently published in Intel Neural Compressor. Neural Coder, as a code-free solution, automatically enables quantization algorithms in a PyTorch model script and evaluates for the best model performance. Supported features include post-training static quantization, post-training dynamic quantization, and mixed precision. See this user guide for more information.

The following example code shows how to enable quantization algorithms and performance evaluation on a pretrained ResNet50 model for ImageNet using Neural Coder:

from neural_coder import auto_quantauto_quant(code="https://github.com/pytorch/examples/blob/main/imagenet/main.py", args="-a resnet50 --pretrained -e /path/to/imagenet/")

Python API for PyTorch Programmers

Neural Coder can be used as a standalone Python library. It offers one-click acceleration of DL scripts via automatic platform conversions and optimization code insertions. It subsequently benchmarks applicable optimization sets acquired from the automated enabling to determine best performance.

This feature leverages static program analysis and heuristic DL optimization rules to simplify the use of DL optimization APIs, which improves developer productivity and facilitates DL acceleration. See this document for more information.

Intel Neural Compressor Bench GUI

We have also integrated Neural Coder into the Intel Neural Compressor Bench GUI for easy access. First, create a new project and upload a PyTorch script (Figure 1).

*Figure 1. Open the GUI and upload a PyTorch script*

Next, choose an optimization approach (Figure 2).

*Figure 2. Choose an optimization approach for the PyTorch script*

Post-training dynamic and static quantization (with the FX backend) are supported in the current Intel Neural Compressor release, with new features under development (Figures 3 and 4). Currently, we construct the evaluate function as a dummy version, so while the performance boost can be demonstrated through the optimization, tuning is bypassed at this stage. However, we will soon support the construction of “real” evaluation functions for most popular model zoos.

*Figure 3. Original PyTorch code for model inference*

*Figure 4. Static quantization (FX) optimization result as a patch*

Static quantization quantifies model weights and activations. It allows the user to fuse activations into previous layers when possible. Unlike dynamic quantization, where scales and zeros are collected during inference, static quantization scales and zeros are determined prior to inference using calibration data sets. Hence, static quantization is theoretically faster than dynamic quantization. As a result, static quantization models are more conducive to inference than dynamic quantization models.

Dynamic quantization weighs the weights of the neural network as integers, but the activation is dynamically quantized during inference (Figure 5). Compared to floating-point neural networks, the size of the dynamic quantization model is much smaller due to the weights being stored as low-bit wide integers. Compared to other quantization techniques, dynamic quantization does not require any data for calibration or fine-tuning.

*Figure 5. Dynamic quantization optimization result as a patch*

Users can benchmark the optimizations against the original input model to compare before and after performance of specific optimizations (Figure 6). The GUI design is kept the same as running benchmarks on TensorFlow/ONNX models, so users can enjoy a smooth experience if they have ever used this functionality for TensorFlow/ONNX models.

*Figure 6. Perform benchmarking on the model*

Future Work

We welcome any feedback on Neural Coder. Other features in Neural Coder are planned and are going to be integrated in the next Intel Neural Compressor release, e.g.: stock PyTorch INT8 enabling similar to Intel Neural Compressor INT8, support for random broad models, and more. We also encourage you to check out Intel’s other AI Tools and Framework optimizations and learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.

Intel Analytics Software