I am using NSight 3.0 VSE, and doing a remote analysis. The target system is a Windows 2012 Server containing two Tesla K10 cards. The host system is Windows 7, VS 2008. Both, target and host, are 64bit.
To analyse my kernel, I ran one of my test apps on the remote machine and selected 12 Experiments to be run on the second (in terms of call order) of my two kernels (I filter it using the “Kernels to Profile” input field in the experiment settings of the nvact window). One of the experiments is “Instruction Count” from the “Source-Level Experiments” group. The activity type was “Profile CUDA Application”.
The analysis runs fine, but in the results I encounter a strange thing: In the results for the “CUDA Instruction Count” experiment I see code lines which should never be reached by the kernel to be analysed (but by the first one). The code line is the return statement of a device function which will be called by the first kernel.
I assume this is a bug, or is it some pointer to a problem in my code?
Edit: I am using Cuda 5.0 and compiling solely for sm_30.