Speed up NLP inference with ONNX Runtime on AWS Graviton processors

ONNX is an open supply machine studying (ML) framework that gives interoperability throughout a variety of frameworks, working methods, and {hardware} platforms. ONNX Runtime is the runtime engine used for mannequin inference and coaching with ONNX.

AWS Graviton3 processors are optimized for ML workloads, together with help for bfloat16, Scalable Vector Extension (SVE), and Matrix Multiplication (MMLA) directions. Bfloat16 accelerated SGEMM kernels and int8 MMLA accelerated Quantized GEMM (QGEMM) kernels in ONNX have improved inference efficiency by as much as 65% for fp32 inference and as much as 30% for int8 quantized inference for a number of pure language processing (NLP) fashions on AWS Graviton3-based Amazon Elastic Compute Cloud (Amazon EC2) situations. Beginning model v1.17.0, the ONNX Runtime helps these optimized kernels.

On this put up, we present how one can run ONNX Runtime inference on AWS Graviton3-based EC2 situations and how one can configure them to make use of optimized GEMM kernels. We additionally show the ensuing speedup by way of benchmarking.

Optimized GEMM kernels

ONNX Runtime helps the Microsoft Linear Algebra Subroutine (MLAS) backend because the default Execution Supplier (EP) for deep studying operators. AWS Graviton3-based EC2 situations (c7g, m7g, r7g, c7gn, and Hpc7g situations) help bfloat16 format and MMLA directions for the deep studying operator acceleration. These directions enhance the SIMD {hardware} utilization and cut back the end-to-end inference latency by as much as 1.65 instances in comparison with the armv8 DOT product instruction-based kernels.

The AWS staff carried out MLAS kernels for bfloat16 quick math and int8 quantized Normal Matrix Multiply (GEMM) utilizing BFMMLA, SMMLA, and UMMLA directions, which have greater matrix multiplication throughput in comparison with DOT directions. The bfloat16 help permits environment friendly deployment of fashions skilled utilizing bfloat16, fp32, and automated blended precision (AMP) with out the necessity for quantization. As proven within the following diagrams, the optimized GEMM kernels are built-in into the ONNX Runtime CPU EP as MLAS kernels.

The primary determine illustrates the ONNX software program stack, highlighting (in orange) the elements optimized for inference efficiency enchancment on the AWS Graviton3 platform.

The next diagram illustrates the ONNX Runtime EP move, highlighting (in orange) the elements optimized for inference efficiency enchancment on the AWS Graviton3 platform.

Allow the optimizations

The optimizations are a part of the ONNX Runtime 1.17.0 launch, and can be found beginning with onnxruntime-1.17.0 python wheels and conda-1.17.0 packages. Optimized int8 kernels are enabled by default, and will likely be picked up routinely for AWS Graviton3 Processors. Bfloat16 quick math kernels, alternatively, are usually not enabled by default and want the next session choices in ONNX Runtime to allow them:

# For C++ purposes

SessionOptions so; 
so.config_options.AddConfigEntry( kOrtSessionOptionsMlasGemmFastMathArm64Bfloat16, "1");

# For Python purposes

sess_options = onnxruntime.SessionOptions()
sess_options.add_session_config_entry("mlas.enable_gemm_fastmath_arm64_bfloat16", "1")

Benchmark outcomes

We began with measuring the inference throughput, in queries per second, for the fp32 mannequin with none of our optimizations (utilizing ONNX Runtime 1.16.0), which is marked at 1.0 with the purple dotted line within the following graph. Then we in contrast the enhancements from bfloat16 quick math kernels from ONNX Runtime 1.17.1 for a similar fp32 mannequin inference. The normalized outcomes are plotted within the graph. You may see that for the BERT, RoBERTa, and GPT2 fashions, the throughput enchancment is as much as 65%. Related enhancements are noticed for the inference latency.

Much like the previous fp32 inference comparability graph, we began with measuring the inference throughput, in queries per second, for the int8 quantized mannequin with none of our optimizations (utilizing ONNX Runtime 1.16.0), which is marked at 1.0 with the purple dotted line within the following graph. Then we in contrast the enhancements from the optimized MMLA kernels from ONNX Runtime 1.17.1 for a similar mannequin inference. The normalized outcomes are plotted within the graph. You may see that for the BERT, RoBERTa, and GPT2 fashions, the throughput enchancment is as much as 30%. Related enhancements are noticed for the inference latency.

Benchmark setup

We used an AWS Graviton3-based c7g.4xl EC2 occasion with Ubuntu 22.04 primarily based AMI to show the efficiency enhancements with the optimized GEMM kernels from ONNX Runtime. The occasion and the AMI particulars are talked about within the following snippet:

Occasion: c7g.4xl occasion
Area: us-west-2
AMI: ami-0a24e6e101933d294 (Ubuntu 22.04/Jammy with 6.5.0-1014-aws kernel)

The ONNX Runtime repo supplies inference benchmarking scripts for transformers-based language fashions. The scripts help a variety of fashions, frameworks, and codecs. We picked PyTorch-based BERT, RoBERTa, and GPT fashions to cowl the widespread language duties like textual content classification, sentiment evaluation, and predicting the masked phrase. The fashions cowl each encoder and decoder transformers structure.

The next code lists the steps to run inference for the fp32 mannequin with bfloat16 quick math mode and int8 quantized mode utilizing the ONNX Runtime benchmarking script. The script downloads the fashions, exports them to ONNX format, quantizes them into int8 for int8 inference, and runs inference for various sequence lengths and batch sizes. Upon profitable completion of the script, it’ll print the inference throughput in queries/sec (QPS) and latency in msec together with the system configuration. Confer with the ONNX Runtime Benchmarking script for extra particulars.

# Set up Python
sudo apt-get replace
sudo apt-get set up -y python3 python3-pip

# Improve pip3 to the newest model
python3 -m pip set up --upgrade pip

# Set up onnx and onnx runtime
# NOTE: We used 1.17.1 as a substitute of 1.17.0 because it was the newest
# model accessible whereas amassing knowledge for this put up
python3 -m pip set up onnx==1.15.0 onnxruntime==1.17.1

# Set up the dependencies
python3 -m pip set up transformers==4.38.1 torch==2.2.1 psutil==5.9.8

# Clone onnxruntime repo to get the benchmarking scripts
git clone --recursive https://github.com/microsoft/onnxruntime.git
cd onnxruntime
git checkout 430a086f22684ad0020819dc3e7712f36fe9f016
cd onnxruntime/python/instruments/transformers

# To run bert-large fp32 inference with bfloat16 quick math mode
python3 benchmark.py -m bert-large-uncased -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run bert-base  fp32 inference with bfloat16 quick math mode
python3 benchmark.py -m bert-base-cased -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run roberta-base  fp32 inference with bfloat16 quick math mode
python3 benchmark.py -m roberta-base -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run gpt2  fp32 inference with bfloat16 quick math mode
python3 benchmark.py -m gpt2 -p fp32 --enable_arm64_bfloat16_fastmath_mlas_gemm

# To run bert-large int8 quantized inference
python3 benchmark.py -m bert-large-uncased -p int8

# To run bert-base int8 quantized inference
python3 benchmark.py -m bert-base-cased -p int8

# To run roberta-base int8 quantized inference
python3 benchmark.py -m roberta-base -p int8

# To run gpt2 int8 quantized inference
python3 benchmark.py -m gpt2 -p int8

Conclusion

On this put up, we mentioned how one can run ONNX Runtime inference on an AWS Graviton3-based EC2 occasion and how one can configure the occasion to make use of optimized GEMM kernels. We additionally demonstrated the ensuing speedups. We hope that you’ll give it a strive!

When you discover use instances the place comparable efficiency beneficial properties are usually not noticed on AWS Graviton, please open a problem on the AWS Graviton Technical Information GitHub to tell us about it.

In regards to the Writer

Sunita Nadampalli is a Software program Improvement Supervisor at AWS. She leads Graviton software program efficiency optimizations for Machine Studying and HPC workloads. She is enthusiastic about open supply software program improvement and delivering high-performance and sustainable software program options with Arm SoCs.