I am using NSight 3.0 VSE, and doing a remote analysis. The target system is a Windows 2012 Server containing two Tesla K10 cards. The host system is Windows 7, VS 2008. Both, target and host, are 64bit.
To analyse my kernel, I ran one of my test apps on the remote machine and selected 12 Experiments to be run on the second (in terms of call order) of my two kernels (I filter it using the “Kernels to Profile” input field in the experiment settings of the nvact window). One of the experiments is “Instruction Count” from the “Source-Level Experiments” group. The activity type was “Profile CUDA Application”.
The analysis runs fine, but in the results I encounter a strange thing: In the results for the “CUDA Instruction Count” experiment I see code lines which should never be reached by the kernel to be analysed (but by the first one). The code line is the return statement of a device function which will be called by the first kernel.
I assume this is a bug, or is it some pointer to a problem in my code?
Georg
Edit: I am using Cuda 5.0 and compiling solely for sm_30.
The Source-Level Experiments collect information per SASS instruction and roll the information up to PTX and C Source Code. If the kernel did not have line information or if the optimization significantly modified the code the tool will not provide a good roll-up of SASS instruction statistics to higher level source lines.
In your case I’m not sure if you compiler is generating poor line information for your kernel or if line information was not generated?
Did you enable “Generate Line Information” in your Release configuration?
If you run the experiment on a Debug kernel you should get very accurate line information. This can be useful for certain types of debugging such as looking at control flow statistics, memory access patterns, and find the source that generate double precision instructions (your other question).