Awkward CUDA function call patterns under Nsight timeline

sediloh426 · March 24, 2021, 2:41am

Hi All,

The question is a combination of 1080Ti, CUDA 11.1, Tensorflow 1.15, and Nsight Visual Studio Edition 2020.3.

I was studying the problem of inference speed difference between CUDA 10.2 and CUDA 11.1 (preliminary tests showed CUDA 11.1 slowed down by almost 50%), by observing the timelines of a simple Tensorflow model inference using Nsight Visual Studio Edition. Yet there are some awkward behaviors found in common.

Here are some of the observations I found common in both CUDA 10.2 and 11.1:

Row “Compute” seems to be the timeline spent at GPU device level. Somehow there are idle periods in which no thread and device in the profile is busy. For example, in the screenshot attached, there is an idle time of about 0.84 ms between functions “EigenMetaKernel” and “ShuffleInTensor3Simple”.
There seems to be delays between functions executing at “Compute” level and the levels above. cudaLaunchKernel calls at “Runtime API” level appear at “Compute” level with long time lags, and “Driver API” memory copy calls are executed before “Compute” level operations are complete.
“Runtime API” level timeline also has a lot of idle periods in which no thread and device in the whole profile is busy.

I cannot explain the observations found, even with Tensorflow’s RunMetadata logging to cross-check the timelines. Am I misinterpreting the profile? Could anyone help explain some of the behaviors?

njuffa · March 24, 2021, 5:26am

A reproducable application-level slowdown of existing code by 20% or more due solely to a change in CUDA version is an excellent reason to file a bug with NVIDIA. The reproducer code should be as simple as possible to minimize iterating with NVIDIA’s “intake team” iterating with the filer on repro.

Topic		Replies	Views
Gaps in CUDA Trace Profiling Linux Targets	4	894	November 10, 2022
CUDA/Nsight unstable and inconsistent performance. CUDA Programming and Performance	3	1365	August 29, 2019
Slower in profiling CUDA Programming and Performance	1	530	October 31, 2016
Performance is much better when profling with NSight than when running production code CUDA Programming and Performance	2	3548	August 13, 2014
Nsight 3.0 much slower than previous version Nsight Visual Studio Edition	18	3207	June 15, 2013
Weird profiling results from nsight system CUDA Programming and Performance	0	502	August 15, 2020
Updated Nsight Systems and lost CUDA API trace Profiling Embedded Targets	11	2451	February 1, 2022
NSIGHT 5.2 on Visual Studio 2012, randomly does not show some streams on timeline CUDA Programming and Performance	0	580	January 31, 2017
CUDA 11.1 vs CUDA 10.0 significant slowdown CUDA-GDB cuda , pytorch	1	827	July 7, 2021
Only CUDA Context 0 is shown on Nsight timeline with RTX 2080 Ti Nsight Visual Studio Edition	5	1353	February 21, 2019

Awkward CUDA function call patterns under Nsight timeline

Related topics