Awkward CUDA function call patterns under Nsight timeline

Hi All,

The question is a combination of 1080Ti, CUDA 11.1, Tensorflow 1.15, and Nsight Visual Studio Edition 2020.3.

I was studying the problem of inference speed difference between CUDA 10.2 and CUDA 11.1 (preliminary tests showed CUDA 11.1 slowed down by almost 50%), by observing the timelines of a simple Tensorflow model inference using Nsight Visual Studio Edition. Yet there are some awkward behaviors found in common.

Here are some of the observations I found common in both CUDA 10.2 and 11.1:

  1. Row “Compute” seems to be the timeline spent at GPU device level. Somehow there are idle periods in which no thread and device in the profile is busy. For example, in the screenshot attached, there is an idle time of about 0.84 ms between functions “EigenMetaKernel” and “ShuffleInTensor3Simple”.
  2. There seems to be delays between functions executing at “Compute” level and the levels above. cudaLaunchKernel calls at “Runtime API” level appear at “Compute” level with long time lags, and “Driver API” memory copy calls are executed before “Compute” level operations are complete.
  3. “Runtime API” level timeline also has a lot of idle periods in which no thread and device in the whole profile is busy.

I cannot explain the observations found, even with Tensorflow’s RunMetadata logging to cross-check the timelines. Am I misinterpreting the profile? Could anyone help explain some of the behaviors?

A reproducable application-level slowdown of existing code by 20% or more due solely to a change in CUDA version is an excellent reason to file a bug with NVIDIA. The reproducer code should be as simple as possible to minimize iterating with NVIDIA’s “intake team” iterating with the filer on repro.