CUDA kernels of Pytorch model on AGX orin has huge time gaps

Hello! I have been profiling a deep learning model using nsight systems on the AGX Orin (the latest jetpack installed). A snapshot of the results if below. The convolutions prior to the Forward_rest_sliced nvtx region look good and utilize the GPU seamlessly. However, in the second half of the Forward_rest_sliced region, where I run multiple convolutions one after another without blocking, the timing between them is so huge. I did the profiling without setting CUDA_LAUNCH_BLOCKING flag but manually synchronized the CUDA calls in the code before and after the mentioned nvtx region. What could be the cause of this, and how can I solve it?

Would it be possible for you to give us the .nsys-rep file associated with the screen shot? It is a little hard to tell much from a screen shot. If not, we can see what we can see.

Sure, here it is. BTW, there is an update. I realized that there are cudaStreamSynchronize calls happening in that region. I am not sure why it is happening, PyTorch shouldn’t do that as far as I know.
report1.nsys-rep (45.0 MB)

@skottapalli could you look into this?

Thanks for sharing the report, Ahmet. From what I can see in the report file, there is a CudaMemcpyAsync call on the CPU just before the 2nd half of the Forward_rest_sliced region. The call is supposed to be async and return within a few microseconds, but it is taking 4.3ms in your report as it is waiting for the memcpy operation to complete on the GPU (see screenshot 1). It happens because the device to host memcpy is being done on pageable memory (see screenshot 2). This is slowing the CPU thread down in feeding work to the GPU i.e. it is causing GPU starvation. See the “Pinned host memory” section under How to Optimize Data Transfers in CUDA C/C++ | NVIDIA Technical Blog for more explanation on why using pageable memory causes this.

You could change settings in Pytorch to use Pinned memory to see if it helps your CPU threads avoid GPU starvation. See When to set pin_memory to true? - vision - PyTorch Forums

Hi! Thanks for the feedback. I changed the code and now don’t see such memory copy happening but the problem persists. Could you take look at the newly attached file? I don’t understand why there is a considerable time gap between the kernels executed in the forward_rest_sliced region. I call torch.cuda.synchronize() before and after executing that nvtx region and CUDA_LAUNCH_BLOCKING is set to 0.
problem.nsys-rep (24.3 MB)

I can’t tell from the report what the OS threads are doing between kernel launches in the NVTX region Forward_rest_sliced. Can you try turning on OSRT traces and turn on nvtx in Pytorch so that we get more visibility into what’s happening on the threads?

Is the synchronize call before and after the nvtx region necessary? It looks like it may be contributing to the gaps on the GPU during the nvtx region. If you are doing this just to map nvtx region to the kernels it launched, then there is a better way to see it. Check the NVTX row on the GPU timeline rows.

The CUDA kernel coverage reaches close to the ideal 99% just before the synchronize call. The sync call forces the CPU thread to wait for all work to complete on the GPU and then starts feeding work to the GPU, but it is not launching kernels fast enough to fill up GPU’s capacity.

See Automatic differentiation package - torch.autograd — PyTorch 1.12 documentation for turning on NVTX annotations that are built into PyTorch. The page talks about the old tool nvprof, disregard and use nsys instead. Turning this on does cause high overhead because of the numerous annotations, but it might give more insight into what the CPU threads are doing between kernel launches.

Thank you so much! I did the things you recommended and the time gaps disappeared when I remove the synchronization calls before and after the Forward_rest_sliced region. I did that synchronization to be able to see CUDA HW row and the NVTX row in a synchronized manner. The nsys cannot trace the program when I enable tracing with either OSRT or cudnn. Nevertheless, the mystery is solved. I also learned that masking operations force CPU-GPU synchronization.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.