"Local" memory statistics

When I compile my application with the verbose option on, it returns that I use no local memory.

However, when I profile with the memory statistics experiments on, the “overview” tab shows a lot of local memory traffic and the “local” tab also.
In the “caches” tab, the “local hit rate” is 100% for loads and 99.95% for stores. I would have expected it to be 0%.

Does anyone have an idea where this traffic would be coming from?

Would you mind posting the output from the compiler and your kernel source? Are you by chance using any arrays in your kernels? If not specifically indexed at compile time, arrays will use local memory.

There are several likely cases:

  1. The size of the local variables declared by the kernel exceeds the number of registers so the variables are stored in local memory.
  2. The kernel requires a lot of temporary registers requiring register spilling.
  3. (1) or (2) occurred because you executed a debug build of a kernel which backing stores to local memory all local variables.

Hi and thanks for the reply
Unfortunately this is commercial code so I’m not at liberty to copy anything. I realize this limits your analysis abilities.

Register usage is at 46 (with maxregister set to 0) and blocksize of 128, so there should be no spilling (?).

The kernel is not using any array, but it has a struct with 4 members and another with 12 members. Are those considered indexable arrays?

Also, the “local memory per thread” column in the nsight profile “cuda launchees” window also states it uses “0” byte.

Hmm, at 46, your registers shouldn’t be spilling into local memory (although it could still occur as Greg noted above). I was unable to reproduce this myself. Would you mind posting screenshots of the overview, local, and caches tabs? If this is too sensitive for this public forum, I can send you a private message with my contact information.

Which GPU, Nsight version, and GPU driver are you using? And, which CUDA Compute Capability version are you targetting?

-Jeff

Thanks for the reply.
Target hardware is sm_20 (tesla c2075).
Nsight is 3.0.0.12296
Driver is 306.14

http://qupload.com/images/profilersc.png

Ailleur,

Nsight VSE 3.0 will show all memory transactions correlated to your source code. In order to see this information you have to capture the CUDA Memory Transactions experiment. This was captured in your linked screenshots.

  1. Navigate to the CUDA Launches page
  2. In the top table select the kernel of interest
  3. In the bottom left pane select CUDA Source Profiler\CUDA Memory Transactions
  4. In the bottom right pane in the Memory Transactions table click on the filter Icon on the Memory Type column and select Local and Generic, Local.
  5. The table will now show all local memory accesses. You can click on the Line # cell to jump to the source line in the CUDA Source View to investigate what parts of your code are accessing the local memory.

Right, I had seen that, but only had a view of the SASS (?) code and its a few thousands line long, didnt feel up to the challenge! But obviously once I compiled in debug mode and did a new trace I had access to the original source code.
By the way, that is pretty damn cool.

Back to business.
I have local stores in the signature of the kernel. The source code line associated with the signature is linked to 33 local stores. That, I do not understand.
A line as simple as

unsigned int kvoxel(0u);

is associated to a local load.

float3 onevariable = othervariable.float3member;

is associated with 3 local loads and 3 local stores.

by the way, when you right click the source view window, visual studio crashes

But right during that crash, I even had a local store operation within the math_functions.cu file .

float invlen = 1.0f/sqrtf(v.x*v.x+v.y*v.y+v.z*v.z);

is associated to 6 local loads

int myMedium(-1);

is accounted for by 1 local load and 7 lines of SASS code (unoptimized but still…)
http://qupload.com/images/simple.png
http://qupload.com/images/simpleyyy.png

So, I do not understand!

Well, now I somewhat understand.

That code was using the old ‘volatile’ trick, which the programming guide clearly says could hurt performances “in the future” :) After removing volatile, the local memory usage is gone.

I however do not fully understand as the examples above did not have anything to do with the volatile variable (which I used for the idx variable). I’m also quite sure I wasn’t reading 2.18gb worth of idx values as per the screenshot a few messages up.

In any case, thanks for your help.