Optimization is an important a part of writing excessive efficiency code, regardless of in case you are writing an internet server or computational fluid dynamics simulation software program. Profiling means that you can make knowledgeable choices relating to your code. In a way, optimization with out profiling is like flying blind: principally effective for seasoned professionals with professional data and fine-tuned instinct, however a recipe for catastrophe for nearly everybody else.
Following my preliminary collection CUDA by Numba Examples (see components 1, 2, 3, and 4), we are going to examine a comparability between unoptimized, single-stream code and a barely higher model which makes use of stream concurrency and different optimizations. We are going to study, from the ground-up, the way to use NVIDIA Nsight Methods to profile and analyze CUDA code. All of the code on this tutorial can be discovered within the repo cako/profiling-cuda-nsight-systems.
NVIDIA recommends as greatest apply to comply with the APOD framework (Assess, Parallelize, Optimize, Deploy). There are a selection of proprietary, open-source, free, and business software program for various kinds of assessments and profiling. Veteran Python customers could also be accustomed to primary profilers corresponding to cProfile
, line_profiler
, memory_profiler
(sadly, unmaintaned as of 2024) and extra superior instruments like PyInstrument and Memray. These profilers goal particular elements of the “host” corresponding to CPU and RAM utilization.
Nonetheless, profiling “system” (e.g., GPU) code — and its interactions with the host — requires specialised instruments offered by the system vendor. For NVIDIA GPUs, Nsight Methods, Nsight Compute, Nsight Graphics can be found for profiling completely different elements of computation. On this tutorial we are going to concentrate on utilizing Nsight Methods, which is a system-wide profiler. We are going to use it to profile Python code which interacts with the GPU through Numba CUDA.
To get began, you have to Nsight Methods CLI and GUI. The CLI could be put in individually and can be used to profile the code in a GPGPU-capable system. The total model consists of each CLI and GUI. Observe that each variations may very well be put in in a system with out a GPU. Seize the model(s) you want from the NVIDIA web site.
To make it simpler to visualise code sections within the GUI, NVIDIA additionally supplies the Python pip
and conda
-installable library nvtx
which we are going to use to annotate sections of our code. Extra on this later.
On this part we are going to set our growth and profiling atmosphere up. Beneath are two quite simple Python scripts: kernels.py
and run_v1.py
. The previous will comprise all CUDA kernels, and the latter will function the entry level to run the instance. On this instance we’re following the “cut back” sample launched in article CUDA by Numba Examples Half 3: Streams and Occasions to compute the sum of an array.
#%%writefile kernels.py
import numba
from numba import cudaTHREADS_PER_BLOCK = 256
BLOCKS_PER_GRID = 32 * 40
@cuda.jit
def partial_reduce(array, partial_reduction):
i_start = cuda.grid(1)
threads_per_grid = cuda.blockDim.x * cuda.gridDim.x
s_thread = numba.float32(0.0)
for i_arr in vary(i_start, array.measurement, threads_per_grid):
s_thread += array[i_arr]
s_block = cuda.shared.array((THREADS_PER_BLOCK,), numba.float32)
tid = cuda.threadIdx.x
s_block[tid] = s_thread
cuda.syncthreads()
i = cuda.blockDim.x // 2
whereas i > 0:
if tid < i:
s_block[tid] += s_block[tid + i]
cuda.syncthreads()
i //= 2
if tid == 0:
partial_reduction[cuda.blockIdx.x] = s_block[0]
@cuda.jit
def single_thread_sum(partial_reduction, sum):
sum[0] = numba.float32(0.0)
for factor in partial_reduction:
sum[0] += factor
@cuda.jit
def divide_by(array, val_array):
i_start = cuda.grid(1)
threads_per_grid = cuda.gridsize(1)
for i in vary(i_start, array.measurement, threads_per_grid):
array[i] /= val_array[0]
#%%writefile run_v1.py
import argparse
import warningsimport numpy as np
from numba import cuda
from numba.core.errors import NumbaPerformanceWarning
from kernels import (
BLOCKS_PER_GRID,
THREADS_PER_BLOCK,
divide_by,
partial_reduce,
single_thread_sum,
)
# Ignore NumbaPerformanceWarning
warnings.simplefilter("ignore", class=NumbaPerformanceWarning)
def run(measurement):
# Outline host array
a = np.ones(measurement, dtype=np.float32)
print(f"Previous sum: {a.sum():.3f}")
# Array copy to system and array creation on the system.
dev_a = cuda.to_device(a)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)
# Launching kernels to normalize array
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)
# Array copy to host
dev_a.copy_to_host(a)
cuda.synchronize()
print(f"New sum: {a.sum():.3f}")
def primary():
parser = argparse.ArgumentParser(description="Easy Instance v1")
parser.add_argument(
"-n",
"--array-size",
kind=int,
default=100_000_000,
metavar="N",
assist="Array measurement",
)
args = parser.parse_args()
run(measurement=args.array_size)
if __name__ == "__main__":
primary()
It is a easy script that may simply be run with:
$ python run_v1.py
Previous sum: 100000000.000
New sum: 1.000
We additionally run this code via our profiler, which simply entails calling nsys
with some choices earlier than the decision to our script:
$ nsys profile
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v1
python run_v1.py
GPU 0: Normal Metrics for NVIDIA TU10x (any frequency)
Previous sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-fb78.qdstrm'
[1/1] [========================100%] profile_run_v1.nsys-rep
Generated:
/content material/profile_run_v1.nsys-rep
You possibly can seek the advice of the Nsight CLI docs for all of the accessible choices to the nsys
CLI. For this tutorial we are going to at all times use those above. Let’s dissect this command:
profile
placesnsys
in profile mode. There are various different modes likeexport
andlaunch
.--trace cuda,osrt,nvtx
ensures we “hear” to all CUDA calls (cuda
), OS runtime library calls (osrt
) andnvtx
annotations (none on this instance). There are various extra hint choices corresponding tocublas
,cudnn
,mpi
,dx11
and a number of other others. Test the docs for all choices.--gpu-metrics-device=all
information GPU metrics for all GPUs, together with Tensor Core utilization.--cuda-memory-usage
tracks GPU reminiscence utilization of kernels. It might considerably decelerate execution and requires--trace=cuda
. We use it as a result of our scripts our fairly quick anyhow.
If the command exited efficiently, we can have a profile_run_v1.nsys-rep
within the present folder. We are going to open this file by launching the Nsight Methods GUI, File > Open
. The preliminary view is barely complicated. So we are going to begin by decluttering: resize the Occasions View
port to the underside, and reduce CPU
, GPU
and Processes
beneath the Timeline View
port. Now develop solely Processes > python > CUDA HW
. See Figures 1a and 1b.
First up, let’s discover our kernels. On the CUDA HW
line, you’ll discover inexperienced and pink blobs, and really small slivers of sunshine blue (see Determine 1b). When you hover over these you will notice tooltips saying, “CUDA Reminiscence operations in progress” for pink and inexperienced, and “CUDA Kernel Operating (89.7%)” for the sunshine blues. These are going to be the bread and butter of our profiling. On this line, we will inform when and the way reminiscence is being transferred (pink and inexperienced) and when and the way our kernels are operating (mild blue).
Let’s dig in somewhat bit extra on our kernels. It’s best to see three very small blue slivers, every representing a kernel name. We are going to zoom into the area by clicking and dragging the mouse from simply earlier than the beginning of the primary kernel name to only after the top of the final one, after which urgent Shift + Z. See Determine 2.
Now that now we have discovered our kernels, let’s see some metrics. We open the GPU > GPU Metrics
tabs for that. On this panel, can discover “Warp Occupancy” (beige) for compute kernels. One approach to optimize CUDA code is to make sure that the warp occupancy is as near 100% as potential for so long as potential. Because of this our GPU just isn’t idling. We discover that that is taking place for the primary and final kernels however not the center kernel. That’s anticipated as the center kernel launches a single thread. One ultimate factor to notice on this part is the GPU > GPU Metrics > SMs Energetic > Tensor Energetic / FP16 Energetic
line. This line will present whether or not the tensor cores are getting used. On this case it is best to confirm that they aren’t.
Now let’s briefly have a look at the Occasions View. Proper click on Processes > python > CUDA HW
and click on “Present in Occasions View”. Then kind the occasions by descending period. In Determine 3, we see that the slowest occasions are two pageable reminiscence transfers. Now we have seen in CUDA by Numba Examples Half 3: Streams and Occasions that pageable reminiscence transfers could be suboptimal, and we must always favor page-locked or “pinned” reminiscence transfers. If now we have gradual reminiscence transfers due to make use of of pageable reminiscence, the Occasions View is usually a nice location to establish the place these gradual transfers could be discovered. Professional tip: you’ll be able to isolate reminiscence transfers by proper clicking Processes > python > CUDA HW > XX% Reminiscence
as an alternative.
On this part we discovered the way to profile a Python program which makes use of CUDA, and the way to visualize primary data of this program within the Nsight Methods GUI. We additionally observed that on this easy program, we’re utilizing pageable as an alternative of pinned reminiscence, that one in all our kernels just isn’t occupying all warps, that the GPU is idle for fairly a while between kernels being run and that we’re not utilizing tensor cores.
On this part we are going to discover ways to enhance our profiling expertise by annotation sections in Nsight Methods with NVTX. NVTX permits us to mark completely different areas of the code. It will possibly mark ranges and instantaneous occasions. For a deeper look, test the docs. Beneath we create run_v2.py
, which, along with annotating run_v1.py
, additionally adjustments this line:
a = np.ones(measurement, dtype=np.float32)
to those:
a = cuda.pinned_array(measurement, dtype=np.float32)
a[...] = 1.0
Subsequently, along with the annotations, we at the moment are utilizing a pinned reminiscence. If you wish to study extra concerning the various kinds of recollections that CUDA helps, see the CUDA C++ Programming Information. It’s of relevance that this isn’t the one approach to pin an array in Numba. A beforehand created Numpy array can be created with a context, as defined within the Numba documentation.
#%%writefile run_v2.py
import argparse
import warningsimport numpy as np
import nvtx
from numba import cuda
from numba.core.errors import NumbaPerformanceWarning
from kernels import (
BLOCKS_PER_GRID,
THREADS_PER_BLOCK,
divide_by,
partial_reduce,
single_thread_sum,
)
# Ignore NumbaPerformanceWarning
warnings.simplefilter("ignore", class=NumbaPerformanceWarning)
def run(measurement):
with nvtx.annotate("Compilation", coloration="pink"):
dev_a = cuda.device_array((BLOCKS_PER_GRID,), dtype=np.float32)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)
# Outline host array
a = cuda.pinned_array(measurement, dtype=np.float32)
a[...] = 1.0
print(f"Previous sum: {a.sum():.3f}")
# Array copy to system and array creation on the system.
with nvtx.annotate("H2D Reminiscence", coloration="yellow"):
dev_a = cuda.to_device(a)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)
# Launching kernels to normalize array
with nvtx.annotate("Kernels", coloration="inexperienced"):
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)
# Array copy to host
with nvtx.annotate("D2H Reminiscence", coloration="orange"):
dev_a.copy_to_host(a)
cuda.synchronize()
print(f"New sum: {a.sum():.3f}")
def primary():
parser = argparse.ArgumentParser(description="Easy Instance v2")
parser.add_argument(
"-n",
"--array-size",
kind=int,
default=100_000_000,
metavar="N",
assist="Array measurement",
)
args = parser.parse_args()
run(measurement=args.array_size)
if __name__ == "__main__":
primary()
Evaluating the 2 information, you’ll be able to see it’s so simple as wrapping some GPU kernel calls with
with nvtx.annotate("Area Title", coloration="pink"):
...
Professional tip: you can too annotate capabilities by putting the @nvtx.annotate
decorator above their definition, mechanically annotate the whole lot by calling your script with python -m nvtx run_v2.py
, or apply the autoannotator selectively in you code by enabling or disabling nvtx.Profile()
. See the docs!
Let’s run this new script and open the ends in Nsight Methods.
$ nsys profile
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v2
python run_v2.py
GPU 0: Normal Metrics for NVIDIA TU10x (any frequency)
Previous sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-69ab.qdstrm'
[1/1] [========================100%] profile_run_v2.nsys-rep
Generated:
/content material/profile_run_v2.nsys-rep
Once more, we begin by minimizing the whole lot, leaving solely Processes > python > CUDA HW
open. See Determine 4. Discover that we now have a brand new line, NVTX
. On this line within the timeline window we must always see completely different coloured blocks comparable to the annotation areas that we created within the code. These are Compilation
, H2D Reminiscence
, Kernels
and D2H Reminiscence
. A few of these my be too small to learn, however can be legible in case you zoom into the area.
The profiler confirms that this reminiscence is pinned, making certain that our code is really utilizing pinned reminiscence. As well as, H2D Reminiscence
and D2H Reminiscence
at the moment are taking lower than half of the time that they had been taking earlier than. Typically we will anticipate higher efficiency utilizing pinned reminiscence or prefetched mapped arrays (not supported by Numba).
Now we are going to examine whether or not we will enhance this code by introducing streams. The concept is that whereas reminiscence transfers are occurring, the GPU can begin processing the information. This permits a stage of concurrency, which hopefully will make sure that we’re occupying our warps as absolutely as potential.
Within the code under we are going to break up the processing of our array into roughly equal components. Every half will run in a separate stream, together with transferring information and computing the sum of the array. Then, we synchronize all streams and sum their partial sums. At this level we will then launch normalization kernels for every stream independently.
We need to reply a couple of questions:
- Will the code under really create concurrency? Might we be introducing a bug?
- Is it sooner than the code which makes use of a single stream?
- Is the warp occupancy higher?
#%%writefile run_v3_bug.py
import argparse
import warnings
from math import ceilimport numpy as np
import nvtx
from numba import cuda
from numba.core.errors import NumbaPerformanceWarning
from kernels import (
BLOCKS_PER_GRID,
THREADS_PER_BLOCK,
divide_by,
partial_reduce,
single_thread_sum,
)
# Ignore NumbaPerformanceWarning
warnings.simplefilter("ignore", class=NumbaPerformanceWarning)
def run(measurement, nstreams):
with nvtx.annotate("Compilation", coloration="pink"):
dev_a = cuda.device_array((BLOCKS_PER_GRID,), dtype=np.float32)
dev_a_reduce = cuda.device_array((BLOCKS_PER_GRID,), dtype=dev_a.dtype)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype)
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_reduce)
single_thread_sum[1, 1](dev_a_reduce, dev_a_sum)
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK](dev_a, dev_a_sum)
# Outline host array
a = cuda.pinned_array(measurement, dtype=np.float32)
a[...] = 1.0
# Outline areas for streams
step = ceil(measurement / nstreams)
begins = [i * step for i in range(nstreams)]
ends = [min(s + step, size) for s in starts]
print(f"Previous sum: {a.sum():.3f}")
# Create streams
streams = [cuda.stream()] * nstreams
cpu_sums = [cuda.pinned_array(1, dtype=np.float32) for _ in range(nstreams)]
devs_a = []
with cuda.defer_cleanup():
for i, (stream, begin, finish) in enumerate(zip(streams, begins, ends)):
cpu_sums[i][...] = np.nan
# Array copy to system and array creation on the system.
with nvtx.annotate(f"H2D Reminiscence Stream {i}", coloration="yellow"):
dev_a = cuda.to_device(a[start:end], stream=stream)
dev_a_reduce = cuda.device_array(
(BLOCKS_PER_GRID,), dtype=dev_a.dtype, stream=stream
)
dev_a_sum = cuda.device_array((1,), dtype=dev_a.dtype, stream=stream)
devs_a.append(dev_a)
# Launching kernels to sum array
with nvtx.annotate(f"Sum Kernels Stream {i}", coloration="inexperienced"):
for _ in vary(50): # Make it spend extra time in compute
partial_reduce[BLOCKS_PER_GRID, THREADS_PER_BLOCK, stream](
dev_a, dev_a_reduce
)
single_thread_sum[1, 1, stream](dev_a_reduce, dev_a_sum)
with nvtx.annotate(f"D2H Reminiscence Stream {i}", coloration="orange"):
dev_a_sum.copy_to_host(cpu_sums[i], stream=stream)
# Guarantee all streams are caught up
cuda.synchronize()
# Combination all 1D arrays right into a single 1D array
a_sum_all = sum(cpu_sums)
# Ship it to the GPU
with cuda.pinned(a_sum_all):
with nvtx.annotate("D2H Reminiscence Default Stream", coloration="orange"):
dev_a_sum_all = cuda.to_device(a_sum_all)
# Normalize through streams
for i, (stream, begin, finish, dev_a) in enumerate(
zip(streams, begins, ends, devs_a)
):
with nvtx.annotate(f"Divide Kernel Stream {i}", coloration="inexperienced"):
divide_by[BLOCKS_PER_GRID, THREADS_PER_BLOCK, stream](
dev_a, dev_a_sum_all
)
# Array copy to host
with nvtx.annotate(f"D2H Reminiscence Stream {i}", coloration="orange"):
dev_a.copy_to_host(a[start:end], stream=stream)
cuda.synchronize()
print(f"New sum: {a.sum():.3f}")
def primary():
parser = argparse.ArgumentParser(description="Easy Instance v3")
parser.add_argument(
"-n",
"--array-size",
kind=int,
default=100_000_000,
metavar="N",
assist="Array measurement",
)
parser.add_argument(
"-s",
"--streams",
kind=int,
default=4,
metavar="N",
assist="Array measurement",
)
args = parser.parse_args()
run(measurement=args.array_size, nstreams=args.streams)
if __name__ == "__main__":
primary()
Let’s run the code and gather outcomes.
$ nsys profile
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v3_bug_4streams
python run_v3_bug.py -s 4
GPU 0: Normal Metrics for NVIDIA TU10x (any frequency)
Previous sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-a666.qdstrm'
[1/1] [========================100%] profile_run_v3_bug_4streams.nsys-rep
Generated:
/content material/profile_run_v3_bug_4streams.nsys-rep
This system ran and yielded the proper reply. However once we open the profiling file (see Determine 6), we discover that there are two streams as an alternative of 4! And one is principally fully idle! What’s happening right here?
There’s a bug within the creation of the streams. By doing
streams = [cuda.stream()] * nstreams
we are literally making a single stream and repeating it nstreams
occasions. So why are we seeing two streams as an alternative of 1? The truth that one just isn’t doing a lot computation ought to be an indicator that there’s a stream that we’re not utilizing. This stream is the default stream, which we’re not utilizing in any respect in out code since all GPU interactions are given a stream, the stream we created.
We will repair this bug with:
streams = [cuda.stream() for _ in range(nstreams)]
# Guarantee they're all completely different
assert all(s1.deal with != s2.deal with for s1, s2 in zip(streams[:-1], streams[1:]))
The code above may even guarantee they’re actually completely different streams, so it might have caught the bug had we had it within the code. It does so by checking the stream pointer worth.
Now we will run the fastened code with 1 stream and eight streams for comparability. See Figures 7 and eight, respectively.
$ nsys profile
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v3_1stream
python run_v3.py -s 1
GPU 0: Normal Metrics for NVIDIA TU10x (any frequency)
Previous sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-de65.qdstrm'
[1/1] [========================100%] profile_run_v3_1stream.nsys-rep
Generated:
/content material/profile_run_v3_1stream.nsys-rep
$ nsys profile
--trace cuda,osrt,nvtx
--gpu-metrics-device=all
--cuda-memory-usage true
--force-overwrite true
--output profile_run_v3_8streams
python run_v3.py -s 8
GPU 0: Normal Metrics for NVIDIA TU10x (any frequency)
Previous sum: 100000000.000
New sum: 1.000
Producing '/tmp/nsys-report-1fb7.qdstrm'
[1/1] [========================100%] profile_run_v3_8streams.nsys-rep
Generated:
/content material/profile_run_v3_8streams.nsys-rep
Once more, each give right outcomes. By opening the one with 8 streams we see that sure, the bug has been fastened (Determine 7). Certainly, we now see 9 streams (8 created + default). As well as, we see that they’re working on the identical time! So now we have achieved concurrency!
Sadly, if we dig a bit deeper we discover that the concurrent code just isn’t essentially sooner. On my machine the essential part of each variations, from begin of reminiscence switch to the final GPU-CPU copy takes round 160 ms.
A possible offender is the warp occupancy. We discover that the warp occupancy is considerably higher within the single-stream model. The features we’re getting on this instance in compute are seemingly being misplaced by not occupying our GPU as effectively. That is seemingly associated to the construction of the code which (artificially) calls approach too many kernels. As well as, if all threads are stuffed by a single stream, there is no such thing as a acquire in concurrency, since different streams need to be idle till assets liberate.
This instance is vital as a result of it reveals that our preconceived notions of efficiency are simply hypotheses. They have to be verified.
At this level of APOD, now we have assessed, parallelized (each via threads and concurrency) and so the following step could be to deploy. We additionally observed a slight efficiency regression with concurrency, so for this instance, a single-stream model would seemingly be the one deployed. In manufacturing, the following step could be to comply with the following piece of code which is greatest suited to parallelization and restarting APOD.
On this article we noticed the way to arrange, use and interpret outcomes from profiling Python code in NVIDIA Nsight Methods. C and C++ code could be analyzed very equally, and certainly a lot of the materials on the market makes use of C and C++ examples.
We additionally present how profiling can enable us to catch bugs and efficiency check our packages, making certain that the options we introduce really are enhancing efficiency, and if they aren’t, why.