I have an application where multiple CUDA streams are used to achieve more concurrency.
Nsight Systems doesn’t provide any information about kernels running in streams (in default stream either), saying that memory operations (memset in my case) take 100% of time what is obviously wrong.
Nsight Systems shows the following (mind there are no warnings or errors):
The output is the same either I profile via Nsight Systems GUI or with “nsys profile -t cuda ./myapp” command and then import a report file in GUI.
Ubuntu 18.04, GeForce RTX 2070 (the same situation is on Tesla V100), Driver Version: 418.67, CUDA Version: 10.1.
UPDATE: the same situation is with the app that uses default stream for all the calculations (one of older versions of the app). So, multiple streams are not the case, kernels are just not traced.