How can I dissect different latencies with nsight systems?

Please take alook at the screen shot I’ve attached of a profiling result of Nsight Systems of a Pytorch program. There, you can see that a single launch of element-wise opeartion is highlighted.

There are many types of latencies and overheads when a deep learning framework uses cuDNN/ cuBLAS. To name the ones I know:

  1. kernel launch overhead (the latency of launching a grid),
  2. memory overhead (the latency of reading the input tensors and writing output to DRAM),
  3. CPU & GPU communication overhead (overhead of CPU enquqing a job to GPU)
  4. CPU overhead (just the sheer overhead of the wrappers around a CUDA kernel on the host code).

And I have questions regarding how to assign each type of latency to different parts of the kernel launch shown in the screenshot.

  1. Other than the 4 types of latencies mentioned, is there any other significant source of latency?
  2. How can I tell which proportion of the total duration of the kernel was due to which latency (even approximately would be fine)? For example, the blue bar in the GPU section indicates that for that duration actual computation was going on. Does this include memory read & write operation performed by the kernel? or is it purely the duration of arithmtic?
  3. In the CUDA API row, the “KernelPointwis…” bar is pretty long. What type of latency is included in this? Is it just the CPU & GPU communication latency?
  4. There is a big chunk of emtpy space after the computation is over. What is this latency?

Hi isaaclee2313,

I’m confirming a few definitions. I’ll get back to you ASAP.

Sorry for the delay.
Bob

Hey Bob,

Thanks so much! I’m planning on including this detail in my tutorial for the upcomming GTC 2020, so I would really appreciate it if I could get it sometimes before 3/22.

Thanks again

There’s a lot to unpack in your definitions of latency and overhead in CUDA. It’s a complicated topic. I think about them as follows:

I would define “overhead” as the time it takes to perform some operation which you’d ideally want to take zero time, and this ends up limiting the rate at which you can do that operation. For example, consider the overhead of the CPU code for launching a CUDA kernel. If the launch API call takes 10 µs on your system, you will only be able to launch at most 100,000 kernels per second. So “CPU overhead” (your #4) would be the full duration of the kernel launch API call (the blue ranges on the CUDA API row). Also, “memory overhead” (your #2) would be the time range from the API call to enqueue the commands to copy the input data to the GPU’s memory until the copy is finished. There really isn’t any reason to treat “CPU/GPU communication” (#3) separately from “CPU overhead” (#4), because the preparation of the command buffer is most of the CPU overhead. Enqueuing the buffer into the FIFO for the GPU to dequeue takes very little time, and on Linux the GPU snoops on the FIFO memory and gets interrupted immediately to consume commands as they are enqueued.

Your #1 isn’t something I’d call overhead. I would define “latency” as the time between requesting an asynchronous task and beginning to execute it. For example, CUDA kernel launch latency could be defined as the time range from the beginning of the launch API call to the beginning of the kernel execution. I see about 20 µs of launch latency in your screenshot between the beginning of the launch call (in the “CUDA API” row) and the beginning of the kernel execution (in the “CUDA Tesla V100-SXM” row). Note that this definition includes the full overhead of the launch API call. We could also define launch latency as starting from the time inside the driver API call near the end where the command is actually enqueued for the GPU – this removes the CPU overhead from the definition of launch latency. These are both reasonable ways to define launch latency, so it’s important when discussing “latency” to be clear about which two events you are treating as the beginning and end of the latency.

Latency is not always bad in asynchronous systems. Imagine you make 10 short API calls to launch a dependent sequence of 10 large kernels. You’d expect minimal latency from the first API call to the first kernel execution. But the 10th kernel in the sequence might have a very large launch latency by our definition, because the CPU enqueued the 10 launch commands and returned very quickly, while the GPU has to complete the first 9 kernels before it can start the 10th. The long gap between the 10th API call and the 10th launch would be seen as a high latency, but an intentional one. The point of using asynchronous launches from the CPU is to allow the CPU to send commands to the GPU and then perform other tasks while the GPU executes the commands. Note that the “memory overhead” can be pipelined with kernel launches, because the GPU can simultaneously be executing a kernel while uploading the input data for the next kernel and downloading the output from the previous kernel.

  1. Other than the 4 types of latencies mentioned, is there any other significant source of latency?

CUDA CPU overhead – the CUDA code that enqueues a launch command for the GPU

  1. If the CUDA runtime API is used, it forwards the call to the CUDA driver API.
  2. The driver validates parameters, constructs command buffers containing parameter values and the launch command, and enqueues the commands into the GPU's FIFO.
  3. Any mutex-lock contention that occurs in the driver if doing multi-threaded launching.
  4. If the GPU's command FIFO is full, the launch API call blocks until there is room.
  5. Nsight Systems adds a bit of overhead in order to capture trace data. Events may appear longer in the timeline than they would actually take when the app runs without the tool.

Launch blockage – the time it takes for the GPU to retrieve the command from the FIFO and begin executing it

  1. The GPU may have a different context active, and require a context switch to get back to working on your application, for example if it is rendering the PC's desktop.
  2. GPU FIFO consumption may be blocked if it previously had a command to wait for outstanding work to complete.
  3. CUDA provides "streams" so you can asynchronously launch a sequence of kernels and memcpys that must execute in order. The GPU automatically waits for the prior item in a stream to complete before starting the next one.
  4. The GPU has an internal priority queue and may need to finish higher priority kernels before it can start a lower priority kernel.

GPU overhead – the time it takes for the GPU to begin executing the kernel once it is ready to run (this is typically negligible)

  1. Distribution of CUDA blocks across all the cores on the chip.
  2. Loading of the first instructions into instruction-cache and constants into constant-cache.
  1. How can I tell which proportion of the total duration of the kernel was due to which latency (even approximately would be fine)? For example, the blue bar in the GPU section indicates that for that duration actual computation was going on. Does this include memory read & write operation performed by the kernel? or is it purely the duration of arithmetic?

Nsight Systems is just a tool for seeing when events start and end on a timeline. All the work done by the kernel, i.e. arithmetic and memory access instructions, occur within the blue bar. In order to understand what is happening inside the kernel, use Nsight Compute. That tool isolates individual kernels and does deep-dive analysis on them – it will take longer to run, but you get more detailed information.

  1. In the CUDA API row, the “KernelPointwis…” bar is pretty long. What type of latency is included in this? Is it just the CPU & GPU communication latency?

This is CUDA CPU overhead – see the description of #1 above. You can see if you are hitting context-switch delays by collecting “GPU Context Switch” data – the context switch timeline row will be green when your context is active and other colors when it’s not. You can see if you are hitting mutex contention within the driver by collecting “OS Runtime” data, which will show the pthread_mutex_lock calls. If you submit tiny kernels in a very fast loop, you might fill up the GPU command FIFO and have to block until there is space available. If you are not experiencing any of those issues and the launch API call is still taking longer than expected, it is likely just tool overhead from capturing all the trace data.

  1. There is a big chunk of empty space after the computation is over. What is this latency?
    This is code in the higher-level libraries doing work in between making CUDA API calls. Any code in the cuDNN or cuBLAS layers could be filling this time. Also, given that you appear to be using Python, keep in mind that any transitions between Python and C will be quite slow.

Another example of a gap on the CPU timeline can be found when profiling the vectorAdd CUDA Toolkit example. See the screenshot from Nsight Systems below.

Just above the CUDA API timeline, the thread’s state indicates that it is busy. Therefore, the CPU is executing some other operations. In Nsight Systems, CPU sampling, OS Runtime API tracing, or adding NVTX instrumentation and tracing NVTX can help you figure out what the CPU is doing between CUDA API calls. On the GPU timeline, the GPU might be executing another context (i.e. it might have context switched) or it might be idle while it waits for more work to be scheduled. Context switch data was collected in this report and shows that the GPU did not context switch during the time shown in this screenshot. Sampling data was also collected. Notice the orange/yellow marks below the thread state timeline. Each mark represents the point when a CPU IP/backtrace sample was collected. When this screenshot was captured, the mouse (not shown) was hovering on the sampling mark just above the left side of the tooltip. The tooltip shows the CPU IP/backtrace for that thread at that moment. Looking at the vectorAdd source code, you can easily see the application was checking the results of the GPU’s calculation. To learn more about NVIdia Tools eXtension API, checkout this link.

Below are some screenshots of the Nsight Systems timeline showing the important part of the vectorAdd CUDA Toolkit sample. Note the “dGPU (Quadro GP100)” row shows GPU context switching. During this time, the consistently-green “Run” range indicates the GPU did not switch away from vectorAdd’s context.

The below picture illustrates HtoD memory overhead in vectorAdd. Even though the cudaMemcpy API was used, this was an asynchronous HtoD copy from pageable memory because the CPU function returns before the GPU work completes. Using asynchronous copy operations and overlapping kernel launches with copies allows you to hide the latency of the copies. “Latency hiding” is an important technique in GPU programming. Check out this link for more information about CUDA memcpy behavior.

The below picture showing the gap between kernel launch API and the execution of the kernel on the GPU illustrates Nsight Systems overhead, typically less than a microsecond as seen here. It also shows that in this case, the GPU takes about a microsecond to begin executing the kernel after the launch API call has finished.

The below picture illustrates CPU overhead – the full duration of the launch API call.

The next picture shows the matrixMul CUDA toolkit sample, where the CPU enqueues a memcpy (in red) and kernel (in blue) into a stream, and then calls cudaStreamSynchronize (in green) to block until the enqueued work in that stream completes. Note that kernel launch latency here is due to the GPU having to execute the stream tasks in order, so the kernel execution doesn’t begin until the preceding memcpy finishes. Also note that the long call to cudaStreamSynchronize shows the CPU waiting for the GPU work to complete. The CPU could have done other work before calling cudaStreamSynchronize, enabling the CPU and GPU to execute in parallel.

Regarding the kernel launch latency, matrixMul also shows the CPU launching a sequence of kernels asynchronously without waiting for them, and the GPU executing them in order. The next three pictures show how the kernel launch latency increases for each launch, because the CPU API calls are short, and the GPU kernel executions are longer. This increasing kernel launch latency does not indicate inefficiency – it shows the CPU getting ahead of the GPU, so the CPU will be free to do other tasks while the GPU executes work in the queue. These pictures show that matrixMul is doing a good job of keeping work queued up for the GPU – there are no gaps in GPU work for that part of the timeline. The first challenge of GPU programming is avoiding CPU bottlenecks and keeping the GPU busy – you want the GPU to be the bottleneck. Once you have optimized the CPU part of your program to where Nsight Systems shows the GPU is the bottleneck, it’s time to use Nsight Compute to profile individual kernels and investigate how to make them more efficient, or to use faster/more GPUs to make the GPU part of the program execute faster.



Special thanks to my colleague JasonC for help in answering these questions.

1 Like