Missing kernels in NSight Profiling

Some kernels do not seem to be properly profiled when profiling with NSight. They do not show up in the kernel launches, source view etc when doing "Profile CUDA Application. The output console does actually mention it is profiling these kernels. These kernels do also show in the kernel launches of “Trace Application”. Using a filter (Kernels to Profile) does not make a difference. To me this seems like a bug or compatibility issue of CUDA. Does anyone share this experience?

Windows 7 64 bit Enterprise
Visual Studio 2013 (64 bit C++ project)
CUDA 7.5
Tesla K40

Does it behave as described in: https://devtalk.nvidia.com/default/topic/831171/kernel-profiling-missing/#4611770?

If so, there was a bug identified and they were working on it. As stated in the posting, it seems some of the symptoms might be avoidable.

Thanks for your reply. I might have misunderstood the problem you are experiencing; but we do not have any asynchronous GPU code, nor use multiple streams. This is an issue we have had with the previous CUDA build we used as well (CUDA 5.5). It is however, still apparent for us after updating to CUDA 7.5. It is odd that the console reports profiling for each kernel that is profiled, but does not seem to make them appear in the NSight results (other than in Trace Application).

My issue may not be related. I am not sure that asynchronous code or multiple streams is necessary. It was discovered that profile data could be missing within any of the streams.

The workaround I found was caused mostly by my usage behaviour. Our application takes a long time to get to the kernel execution (up to minutes to load and get started). My behaviour was to launch the profiler, immediately cancel the profile (to avoid minutes of empty profile data), load the rest of the program, and select start to continue capturing profile data.

It turns out that if I just let the profiler (and not “pause” it), most of the profile data would be recorded and reported.

The actual behaviour I was seeing was as follows: I had a sequence of kernels (ignore the fact that it was asynchronous or used multiple streams): A, B, C, D, E, F

The launch behaviour for a given stream would be as follows:

A, B, C, D, E, F,     A, B, C, D, E, F,     A, B, C, D, E, F,     A, B, C, D, E, F,     A, B, C, D, E, F,     A, B, C, D, E, F, ...

but the profile reporting would show up like (or even much more sparse):

A,  ,  ,  , E,  ,      ,  ,  ,  ,  , F,      ,  ,  ,  ,  ,  ,     A, B, C,    , F,      ,  ,  , D,  , F,      ,  ,  ,  ,  , F, ...

These issues are not related. The code for Profile mode (abort’s problem) is completely different than for Trace mode (krazanmp’s problem).

Abort, I’ve never seen Profiling mode miss kernels like this, i.e. you are seeing the console show that profiling is in fact running on them, but they are missing from the report. A few questions:

  • Are you getting any kernel launches in the report at all? Or none?
  • Just to confirm, you say when you run Trace mode you see the kernels show up in the CUDA Kernel Launches page, but when you run Profiling mode they do not?
  • That’s good that you see the console output showing profiling progress. If you are seeing some kernels in the report but not all, can you confirm you ARE seeing console output for kernels that DO NOT show up in the report? The console output includes the kernel’s name.
  • Are you sure your kernels are working? I.e. not causing a launch failure or assert() call to fail?
  • Can you send me your report? If you right-click the tab for the report document in Visual Studio and hit Open Containing Folder, you can zip up all the files in that directory and attach the zip file here, and I will be able to step-debug Nsight loading your report.