nSight Performance Analysis stalls when using cudaStreamSynchronize

Hi All

I am trying to make a performance analysis on my program using the Performance Analysis.

My system is
HP ZBook with P2000 GPU
CUDA 10.1
Visual Studio 2019 ver. 16.4.4
nSight 2019.3
Display Driver Ver. 442.19

Just running my application inside VS2019 or alone works perfectly and functions as expected.

I run the analyzer with trace settings in System and CUDA, just default sub-selections.

The problem is when I use a cudaStreamSynchronize call. If I activate the trace setting under CUDA->Kernel Launches and Memory Operations, the profiler stalls my program. Sometimes it logs the system calling the stream-synchronize function, but mostly not.

Workarounds (semi).

  1. Don’t log the kernels and memory (But I really need those)
  2. Change cudaStreamSynchronize to a cudaStreamQuery and subsequent std::this_thread::sleep_for(100us) (this impacts performance!)

Non - workarounds

  1. Change cudaStreamSynchronize to a cudaStreamQuery and subsequent std::this_thread::yield(), which is quite surprising.
  2. Running Visual Studio as admin

My suspicion is that the profiler needs the thread running the program to “let go”. But why this is I don’t know.

Can anybody help me on this?

EDIT : If I don’t log at start, this works as expected!