Incorrect variable assignment to unsigned long long

When I use nSight to examine variables in this kernel, the result of the assignment to s[0] is shown incorrectly.

__global__ void Check(void)
{
	unsigned long long s[8];
	s[0] = 0x1000;

Locals window shows s[0]=0 after stepping past the assignment above. I expect 0x1000.

-	s	0x0000000000fffc10  {0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, ...}	unsigned long long[8] __local__
		[0]	0x0000000000000000	__local__ unsigned long long&

The nvcc command line for the build is

1>..."C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin\nvcc.exe" -gencode=arch=compute_30,code=\"sm_30,compute_30\" --use-local-env --cl-version 2017 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio017\Community\VC\Tools\MSVC4.11.25503\bin\HostX86\x64" -x cu  ... -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\include"  -G  --keep --keep-dir x64\Debug -maxrregcount=0  --machine 64 --compile -cudart static  -g   -DWIN64 -D_WINDOWS -D_DEBUG -DTablebaseBuild -Xcompiler "/EHsc /W3 /nologo /Od /FS /Zi /RTC1 /MTd " -o x64\Debug\PosX_CUDA.cu.obj "C:\ProjectDir\PosX_CUDA.cu"

(I have elided some library includes for other parts of the C++ code)
This is a debug build.

Disassembly of the CUDA code corresponding to this line looks reasonable

0x002c98b0  [0052] tmp3:  
0x002c98b0  [0053] mov.b32 %r3, %r8;  
0x002c98c8  [0054] tmp4:  
0x002c98c8  [0055] mov.u32 %r28, %r3;  
0x002c98d8  [0056] tmp5:  
0x002c98d8  [0058] BB0_1:  
0x002c98d8  [0060] mov.u32 %r4, %r28;  
0x002c98f0  [0061] tmp6:  
0x002c98f0  [0062] setp.lt.s32	%p1, %r4, 64;  
0x002c98f8  [0063] not.pred %p2, %p1;  
0x002c9908  [0064] @%p2 bra BB0_4;  
0x002c9910  [0065] bra.uni BB0_2;  
0x002c9918  [0067] BB0_2:  
0x002c9918  [0068] mov.u64 %rd1, 4096;

Am I mis-using nSight, is there some error in the build, or the code? I expect to see s[0]=0x1000.

GeForce GTX 1050 Ti / CUSD runtime 9.0 / nSight 5.4 / Visual Studio Community 2017 15.3 / 64-bit Windows 7

  1. locals can be optimized into registers. In the case of a local array, this is also/still true if the array indexing can be deduced by the compiler.

  2. If the local is optimized into a register, I’m not sure what the behavior of the locals window would be. Ideally it would be to show the register value, but I’m not 100% sure of that.

  3. The code you are showing is PTX code, not SASS. I generally would consider PTX to be unreliable for this sort of work. Having said that, this line:

mov.u64 %rd1, 4096;

seems obviously corresponding to the source line in question:

s[0] = 0x1000;

And it seems to be only loading a register (no local store operation, yet) So I would refer you back to item 2 above.

A few other comments:

  • it’s hard to be certain, but it appears to me you may be viewing the local memory window as opposed to the local variable window. A careful read of the relevant doc section:

http://docs.nvidia.com/gameworks/index.html#developertools/desktop/nsight/view_memory.htm%3FTocPath%3DDeveloper%2520Tools%7CDesktop%2520Developer%2520Tools%7CNVIDIA%2520Nsight%2520Visual%2520Studio%2520Edition%7CNVIDIA%2520Nsight%2520Visual%2520Studio%2520Edition%25205.4%7CCUDA%2520Debugger%7CInspect%2520State%7C_____2

suggests to me there may be a difference in behavior. If you are looking at the local memory window instead of the local variable window, you might check the other. (in the doc section above, compare “View Memory” vs. “View Variables”)

  • For detailed questions about nsight VSE behavior, there is also a separate forum section:

https://devtalk.nvidia.com/default/board/84/nsight-visual-studio-edition/

@txbob: The disassembly I showed was produced by the nSight’s Visual Studio disassembler.

In my real code, I was modifying variables in global memory, and looking at the values after copying back to the host. Those results were incorrect, which led me to the local array “s” test that I showed.

When using Visual Studio to debug optimized C++ code, the locals window tells you when a variable has been optimized away, or not in a place where the debugger can find it. So I was expecting the same from nSight, but perhaps that’s asking too much. It was a debug build, and for the C++ compiler, all variables are available to the debugger. But I don’t know about how nvcc optimizes, or whether that can be turned off.

As far as memory vs locals window, they both show the same thing. The difference is that the locals window has the variable name and an interpretation of the memory corresponding to the type of variable, while the memory window just shows raw bytes. I ran the pgm again and looked at the corresponding memory and saw that the memory listed as the array s in the locals window was not changed. However, other memory 0x218 bytes away does change from 0 to 0x1000. There are 64 threads running, although the nSight manual says that stepping freezes the other threads.

I will check the nSight VSE forum. NVidia’s documentation is pretty sketchy.

The problem may be the interaction of focus and breakpoints. nSight is telling me the focus is (0,0,0) (0,0,0). I assumed the debugger would set the focus to the thread that reaches a breakpoint, but that may not be true.

Using the CUDA Info window set to Warps gives me some information, but I don’t understand what it means.