Accelerated PyTorch inference with torch.compile on AWS Graviton processors

Initially PyTorch used an keen mode the place every PyTorch operation that types the mannequin is run independently as quickly because it’s reached. PyTorch 2.0 launched torch.compile to hurry up PyTorch code over the default keen mode. In distinction to keen mode, the torch.compile pre-compiles the complete mannequin right into a single graph in a way that’s optimum for operating on a given {hardware} platform. AWS optimized the PyTorch torch.compile characteristic for AWS Graviton3 processors. This optimization ends in as much as 2x higher efficiency for Hugging Face mannequin inference (primarily based on geomean of efficiency enchancment for 33 fashions) and as much as 1.35x higher efficiency for TorchBench mannequin inference (geomean of efficiency enchancment for 45 fashions) in comparison with the default keen mode inference throughout a number of pure language processing (NLP), laptop imaginative and prescient (CV), and advice fashions on AWS Graviton3-based Amazon EC2 cases. Beginning with PyTorch 2.3.1, the optimizations can be found in torch Python wheels and AWS Graviton PyTorch deep studying container (DLC).

On this weblog put up, we present how we optimized torch.compile efficiency on AWS Graviton3-based EC2 cases, the right way to use the optimizations to enhance inference efficiency, and the ensuing speedups.

Why torch.compile and what’s the aim?

In keen mode, operators in a mannequin are run instantly as they’re encountered. It’s simpler to make use of, extra appropriate for machine studying (ML) researchers, and therefore is the default mode. Nevertheless, keen mode incurs runtime overhead due to redundant kernel launch and reminiscence learn overhead. Whereas in torch compile mode, operators are first synthesized right into a graph, whereby one operator is merged with one other to scale back and localize reminiscence reads and complete kernel launch overhead.

The aim for the AWS Graviton group was to optimize torch.compile backend for Graviton3 processors. PyTorch keen mode was already optimized for Graviton3 processors with Arm Compute Library (ACL) kernels utilizing oneDNN (also called MKLDNN). So, the query was, the right way to reuse these kernels in torch.compile mode to get the most effective of graph compilation and the optimized kernel efficiency collectively?

Outcomes

The AWS Graviton group prolonged the torch inductor and oneDNN primitives that reused the ACL kernels and optimized compile mode efficiency on Graviton3 processors. Beginning with PyTorch 2.3.1, the optimizations can be found within the torch Python wheels and AWS Graviton DLC. Please see the Working an inference part that follows for the directions on set up, runtime configuration, and the right way to run the exams.

To display the efficiency enhancements, we used NLP, CV, and advice fashions from TorchBench and essentially the most downloaded NLP fashions from Hugging Face throughout Query Answering, Textual content Classification, Token Classification, Translation, Zero-Shot Classification, Translation, Summarization, Characteristic Extraction, Textual content Technology, Text2Text Technology, Fill-Masks, and Sentence Similarity duties to cowl all kinds of buyer use instances.

We began with measuring TorchBench mannequin inference latency, in milliseconds (msec), for the keen mode, which is marked 1.0 with a pink dotted line within the following graph. Then we in contrast the enhancements from torch.compile for a similar mannequin inference, the normalized outcomes are plotted within the graph. You may see that for the 45 fashions we benchmarked, there’s a 1.35x latency enchancment (geomean for the 45 fashions).

Picture 1: PyTorch mannequin inference efficiency enchancment with torch.compile on AWS Graviton3-based c7g occasion utilizing TorchBench framework. The reference keen mode efficiency is marked as 1.0. (greater is best)

Much like the previous TorchBench inference efficiency graph, we began with measuring the Hugging Face NLP mannequin inference latency, in msec, for the keen mode, which is marked 1.0 with a pink dotted line within the following graph. Then we in contrast the enhancements from torch.compile for a similar mannequin inference, the normalized outcomes are plotted within the graph. You may see that for the 33 fashions we benchmarked, there may be round 2x efficiency enchancment (geomean for the 33 fashions).

Picture 2: Hugging Face NLP mannequin inference efficiency enchancment with torch.compile on AWS Graviton3-based c7g occasion utilizing Hugging Face instance scripts. The reference keen mode efficiency is marked as 1.0. (greater is best)

Working an inference

Beginning with PyTorch 2.3.1, the optimizations can be found within the torch Python wheel and in AWS Graviton PyTorch DLC. This part reveals the right way to run inference in keen and torch.compile modes utilizing torch Python wheels and benchmarking scripts from Hugging Face and TorchBench repos.

To efficiently run the scripts and reproduce the speedup numbers talked about on this put up, you want an occasion from the Graviton3 household (c7g/r7g/m7g/hpc7g) of {hardware}. For this put up, we used the c7g.4xl (16 vcpu) occasion. The occasion, the AMI particulars, and the required torch library variations are talked about within the following snippet.

Occasion: c7g.4xl occasion
Area: us-west-2
AMI: ami-05cc25bfa725a144a (Ubuntu 22.04/Jammy with 6.5.0-1017-aws kernel)

# Set up Python
sudo apt-get replace
sudo apt-get set up -y python3 python3-pip

# Improve pip3 to the most recent model
python3 -m pip set up --upgrade pip

# Set up PyTorch and extensions
python3 -m pip set up torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1

The generic runtime tunings applied for keen mode inference are equally relevant for the torch.compile mode, so, we set the next setting variables to additional enhance the torch.compile efficiency on AWS Graviton3 processors.

# Allow the quick math GEMM kernels, to speed up fp32 inference with bfloat16 gemm
export DNNL_DEFAULT_FPMATH_MODE=BF16

# Allow Linux Clear Large Web page (THP) allocations,
# to scale back the tensor reminiscence allocation latency
export THP_MEM_ALLOC_ENABLE=1

# Set LRU Cache capability to cache the primitives and keep away from redundant
# reminiscence allocations
export LRU_CACHE_CAPACITY=1024

TorchBench benchmarking scripts

TorchBench is a group of open supply benchmarks used to guage PyTorch efficiency. We benchmarked 45 fashions utilizing the scripts from the TorchBench repo. Following code reveals the right way to run the scripts for the keen mode and the compile mode with inductor backend.

# Set OMP_NUM_THREADS to variety of vcpus, 16 for c7g.4xl occasion
export OMP_NUM_THREADS=16

# Set up the dependencies
sudo apt-get set up -y libgl1-mesa-glx
sudo apt-get set up -y libpangocairo-1.0-0
python3 -m pip set up psutil numpy transformers pynvml numba onnx onnxruntime scikit-learn timm effdet health club doctr opencv-python h5py==3.10.0 python-doctr

# Clone pytorch benchmark repo
git clone https://github.com/pytorch/benchmark.git
cd benchmark
# PyTorch benchmark repo does not have any launch tags. So,
# itemizing the commit we used for amassing the efficiency numbers
git checkout 9a5e4137299741e1b6fb7aa7f5a6a853e5dd2295

# Setup the fashions
python3 set up.py

# Colect keen mode efficiency utilizing the next command. The outcomes shall be
# saved at .userbenchmark/cpu/metric-<timestamp>.json.
python3 run_benchmark.py cpu --model BERT_pytorch,hf_Bert,hf_Bert_large,hf_GPT2,hf_Albert,hf_Bart,hf_BigBird,hf_DistilBert,hf_GPT2_large,dlrm,hf_T5,mnasnet1_0,mobilenet_v2,mobilenet_v3_large,squeezenet1_1,timm_efficientnet,shufflenet_v2_x1_0,timm_regnet,resnet50,soft_actor_critic,phlippe_densenet,resnet152,resnet18,resnext50_32x4d,densenet121,phlippe_resnet,doctr_det_predictor,timm_vovnet,alexnet,doctr_reco_predictor,vgg16,dcgan,yolov3,pytorch_stargan,hf_Longformer,timm_nfnet,timm_vision_transformer,timm_vision_transformer_large,nvidia_deeprecommender,demucs,tts_angular,hf_Reformer,pytorch_CycleGAN_and_pix2pix,functorch_dp_cifar10,pytorch_unet --test eval --metrics="latencies,cpu_peak_mem"

# Gather torch.compile mode efficiency with inductor backend
# and weights pre-packing enabled. The outcomes shall be saved at
# .userbenchmark/cpu/metric-<timestamp>.json
python3 run_benchmark.py cpu --model BERT_pytorch,hf_Bert,hf_Bert_large,hf_GPT2,hf_Albert,hf_Bart,hf_BigBird,hf_DistilBert,hf_GPT2_large,dlrm,hf_T5,mnasnet1_0,mobilenet_v2,mobilenet_v3_large,squeezenet1_1,timm_efficientnet,shufflenet_v2_x1_0,timm_regnet,resnet50,soft_actor_critic,phlippe_densenet,resnet152,resnet18,resnext50_32x4d,densenet121,phlippe_resnet,doctr_det_predictor,timm_vovnet,alexnet,doctr_reco_predictor,vgg16,dcgan,yolov3,pytorch_stargan,hf_Longformer,timm_nfnet,timm_vision_transformer,timm_vision_transformer_large,nvidia_deeprecommender,demucs,tts_angular,hf_Reformer,pytorch_CycleGAN_and_pix2pix,functorch_dp_cifar10,pytorch_unet --test eval --torchdynamo inductor --freeze_prepack_weights --metrics="latencies,cpu_peak_mem"

On profitable completion of the inference runs, the script shops the ends in JSON format. The next is the pattern output:

{
"title": "cpu"
"environ": {
"pytorch_git_version": "d44533f9d073df13895333e70b66f81c513c1889"
},

"metrics": {
"BERT_pytorch-eval_latency": 56.3769865,
"BERT_pytorch-eval_cmem": 0.4169921875
}
}

Hugging Face benchmarking scripts

Google T5 Small Textual content Translation mannequin is among the round 30 Hugging Face fashions we benchmarked. We’re utilizing it as a pattern mannequin to display the right way to run inference in keen and compile modes. The extra configurations and APIs required to run it in compile mode are highlighted in BOLD. Save the next script as google_t5_small_text_translation.py .

import argparse
from transformers import T5Tokenizer, T5Model
import torch
from torch.profiler import profile, record_function, ProfilerActivity
import torch._inductor.config as config config.cpp.weight_prepack=True config.freezing=True

def test_inference(mode, num_iter):
tokenizer = T5Tokenizer.from_pretrained("t5-small")
mannequin = T5Model.from_pretrained("t5-small")

input_ids = tokenizer(
"Research have been proven that proudly owning a canine is sweet for you", return_tensors="pt"
).input_ids  # Batch dimension 1
decoder_input_ids = tokenizer("Research present that", return_tensors="pt").input_ids  # Batch dimension 1

    if (mode == 'compile'):         mannequin = torch.compile(mannequin)

with torch.no_grad():
for _ in vary(50):
outputs = mannequin(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

with profile(actions=[ProfilerActivity.CPU]) as prof:
with record_function("model_inference"):
for _ in vary(num_iter):
outputs = mannequin(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

print(prof.key_averages().desk(sort_by="self_cpu_time_total"))

def most important() -> None:
international m, args
parser = argparse.ArgumentParser(__doc__)
parser.add_argument(
"-m",
"--mode",
selections=["eager", "compile"],
default="keen",
assist="Which take a look at to run.",
)
parser.add_argument(
"-n",
"--number",
kind=int,
default=100,
assist="what number of iterations to run.",
)
args = parser.parse_args()
test_inference(args.mode, args.quantity)

if __name__ == "__main__":
most important()

Run the script with the next steps.

# Set OMP_NUM_THREADS to variety of vcpus to 4 as a result of
# the scripts are operating inference in sequence, and
# they do not want giant variety of vcpus
export OMP_NUM_THREADS=4

# Set up the dependencies
python3 -m pip set up transformers

# Run the inference script in Keen mode
# utilizing variety of iterations as 1 simply to point out the torch profiler output
# however for the benchmarking, we used 1000 iterations.
python3 google_t5_small_text_translation.py -n 1 -m keen

# Run the inference script in torch compile mode
python3 google_t5_small_text_translation.py -n 1 -m compile

On profitable completion of the inference runs, the script prints the torch profiler output with the latency breakdown for the torch operators. The next is the pattern output from torch profiler:


# Torch profiler output for the keen mode run on c7g.xl (4vcpu)
---------------    ------------  -----------  ------------  -----------  ------------  ------------
Identify                 Self CPU %   Self CPU     CPU complete %   CPU complete   CPU time avg    # of Calls
---------------    ------------  -----------  ------------  -----------  ------------  ------------
aten::mm            40.71%         12.502ms       40.71%      12.502ms     130.229us            96
model_inference     26.44%         8.118ms       100.00%      30.708ms      30.708ms             1
aten::bmm            6.85%         2.102ms         9.47%       2.908ms      80.778us            36
aten::matmul         3.73%         1.146ms        57.26%      17.583ms     133.205us           132
aten::choose         1.88%       576.000us         1.90%     583.000us       0.998us           584
aten::transpose      1.51%       464.000us         1.83%     563.000us       3.027us           186
---------------    ------------  -----------  ------------  -----------  ------------  -------------
Self CPU time complete: 30.708ms

# Torch profiler output for the compile mode run for a similar mannequin on the identical occasion
------------------------- ----------  -----------  ------------  ------------  ------------  ------------
Identify                      Self CPU %    Self CPU    CPU complete %    CPU complete   CPU time avg   # of Calls
------------------------- ----------  -----------  ------------  ------------  ------------  ------------
mkldnn::_linear_pointwise   37.98%       5.461ms        45.91%       6.602ms      68.771us            96
Torch-Compiled Area       29.56%       4.251ms        98.53%      14.168ms      14.168ms             1
aten::bmm                   14.90%       2.143ms        21.73%       3.124ms      86.778us            36
aten::choose                 4.51%     648.000us         4.62%     665.000us       1.155us           576
aten::view                   3.29%     473.000us         3.29%     473.000us       1.642us           288
aten::empty                  2.53%     364.000us         2.53%     364.000us       3.165us           115
-------------------------  ---------  -----------  ------------  ------------  ------------ -------------
Self CPU time complete: 14.379ms

What’s subsequent

Subsequent, we’re extending the torch inductor CPU backend assist to compile Llama mannequin, and including assist for fused GEMM kernels to allow torch inductor operator fusion optimization on AWS Graviton3 processors.

Conclusion

On this tutorial, we coated how we optimized torch.compile efficiency on AWS Graviton3-based EC2 cases, the right way to use the optimizations to enhance PyTorch mannequin inference efficiency, and demonstrated the ensuing speedups. We hope that you’ll give it a strive! If you happen to want any assist with ML software program on Graviton, please open a difficulty on the AWS Graviton Technical Information GitHub.

Concerning the Creator

Sunita Nadampalli is a Software program Improvement Supervisor and AI/ML skilled at AWS. She leads AWS Graviton software program efficiency optimizations for AI/ML and HPC workloads. She is keen about open supply software program improvement and delivering high-performance and sustainable software program options for SoCs primarily based on the Arm ISA.