When using NSIGHT to profile CUDA, we find that the activity in some streams is not always shown on the timeline. Depending on the execution, we can or can’t see it.
We use events with timing disabled, to sinchronice different streams without blocking the CPU thread, on an iterative aplication. We repeat the same set of kernels and transfers every N miliseconds.
On the image above, streams 10 and 12 are executing kernels but it is not shown on the timeline.
We say that we know that the kernels are executing, because of the following situation:
1 We use a device pointer with 13 values that we set to 0 at every iteration with a cudaMemset.
2 This device pointer is only modified by the kernels that we don’t see on the NSIGHT timeline.
3 Each of this kernels, writes in a different position of the array.
3 After the kernel execution and sincronization, we download the results to a host pointer.
5 After downloading, we print all the results that are equal to 0.
6 We only see printed the two first iterations wich makes sense, because the algorithm is starting, but the rest of the iteration, none of the values are 0.
7 The only way that this values can be different than 0, is because the kernels that NSIGHT is not showing, are executing, and modifing this values.
This kernels do not appear on the kernel launch section of nsight either.
NSIGHT does visualize the kernel launches if we execute all of them in a single stream.
We would like to know if this issue is known.
Our development system has the following configuration:
Windows 10 Pro, Version 1607, OS Build 14393.693
Visual Studio 2012
GPU Driver version 376.62
CUDA SDK 7.5
Two NVIDIA Quadro M4000
The issue is happening on the second GPU, where there are no screens plugged and no rendering active.