why my kernel uses local memory?

The latest profiler (nvvp 7.5) showed me hot spots in memory dependency. After checking the assembly codes, I found that “LDL” instructions was used in some of the register variable operations.

I then used -Xptxas -v,-abi=no option with nvcc and printed the local memory info, I got the following report:

ptxas warning : 'option -abi=no' might get deprecated in future
ptxas info    : 0 bytes gmem, 18704 bytes cmem[2]
ptxas info    : Compiling entry function '_Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_' for 'sm_20'
ptxas info    : Used 59 registers, 136 bytes cmem[0], 64 bytes cmem[16], 96 bytes lmem

using -Xptxas -v, I also see there was no register spilling:

ptxas info    : 0 bytes gmem, 18704 bytes cmem[2]
ptxas info    : Compiling entry function '_Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_' for 'sm_20'
ptxas info    : Function properties for _Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_
    96 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 61 registers, 136 bytes cmem[0], 60 bytes cmem[16]

if I compile it for sm_52, the register number increases to 64.

From what I read online, the max register/thread for sm_20 and sm_52 are more than 59 and 64, respectively.

Then, my question is, why nvcc uses lmem to store some of the registers in my kernel (thus LDL instruction)? how can I find more details about this?

This cannot diagnosed with any certainty without seeing the source code and the nvcc command line used to build it.

As you noted, one of the uses of local memory is storage for spilled registers. Thread-local arrays above a certain size or that use indexing that isn’t compile-time constant also need to be stored in local memory, as registers are not indexable and a scarce resource. The ABI’s function calling conventions may involve passing functions arguments on the stack, with the stack being allocated in local memory. You would likely see this with functions taking many arguments and that are either compiled separately, or for which inlining has been suppressed.

I am not sure how the two snippets of compiler output belong together, but I note that the compiler reports 96 bytes of stack frame in one and 96 bytes of lmem usage in the other. Since the stack is located in LMEM it stands to reason that these are the same 96 bytes.

The software I am working on is an open-source one. The full source code can be downloaded with this svn command:

http://mcx.sourceforge.net/cgi-bin/index.cgi?Download#Anonymous_SVN_Access

or use the “Download Snapshot” link on this page

http://sourceforge.net/p/mcx/svn/HEAD/tree/mcextreme_cuda/trunk/

after downloading, cd mcx/src folder, run “make”. This should compile the program and print the lmem info automatically.

The compiling info I attached above was from a local branch. So, the svn code output is slightly different (but similar)

fangq@wazu:mcx/src$ make
nvcc -c -Xptxas -v,-abi=no -g  -m64 -Xcompiler -fopenmp -DUSE_ATOMIC -use_fast_math -DSAVE_DETECTORS -DUSE_CACHEBOX -use_fast_math -arch=sm_20 -DMCX_TARGET_NAME='"Fermi MCX"' -o mcx_core.o  mcx_core.cu
ptxas warning : 'option -abi=no' might get deprecated in future
ptxas info    : 0 bytes gmem, 18704 bytes cmem[2]
ptxas info    : Compiling entry function '_Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_' for 'sm_20'
ptxas info    : Used 63 registers, 136 bytes cmem[0], 72 bytes cmem[16], 68 bytes lmem
ptxas info    : Compiling entry function '_Z12mcx_test_rngPfPj' for 'sm_20'
ptxas info    : Used 20 registers, 48 bytes cmem[0], 8 bytes cmem[16], 20 bytes lmem

I am only interested in the “mcx_main_loop” kernel.

http://sourceforge.net/p/mcx/svn/HEAD/tree/mcextreme_cuda/trunk/src/mcx_core.cu#l484

any help on interpreting the lmem usage here are appreciated!

also, if you want to test this in the profiler, open a new profiling session, select “mcx/bin/mcx” as the executable, set “mcx/examples/quicktest/” as the work folder, and

-A -g 10 -n 1e7 -f qtest.inp -s qtest -r 1 -a 0 -b 0 -G 1

as the argument. the number after -G specifies which GPU to use. use “mcx -L” to list all available GPUs.

thanks for the comment. I also read about this when I explore on this topic. Is a struct considered a non-constant indexed array?

what about this line?

The v here is a struct with 4 float members. Does accessing its members by index forced v in the local memory?

You wrote MCX?

I re-wrote this code about 6 months ago and added fluorescence for multiple flurophores as well as simple shapes like cylinders, spheres, cubes etc. Also got about a 50-100x speedup over this current MCX version for most cases using the same Maxwell GPU.

Why did you not use cuRAND()?

Discovered some nice optimizations and feel free to PM me if you are interested.

Thanks for making this open source.

yes

wow, 50-100x sounds a lot! have you guys published this work? love to know more about it.

it’s on my TODO list

https://github.com/fangq/mcx/issues/1

but the MCX’s built-in Logistic-Lattice RNG is light-weight and fast. I know it has some minor quality issues, but wasn’t motivated to upgrade.

will follow up with PM. glad to learn more about what you have discovered.

When I referred to array I meant array, not struct. With a bit of work you should be able to find out which of your code’s variables wind up in local memory. You can then form a hypothesis as to why they were placed in local memory.

The higher level question is: Does it matter where those variables are allocated?

After testing various settings, I found the array-cast approach indeed put the struct into the local memory. An example of those mentioned earlier is

MCXdir v; // v.{x,y,z,nscat}, all members are float
float Icos=fabs(((float*)v)[__float2int_rn(flipdir)-1]);

in this case, the variable v is forced to be stored in the local memory.

I converted several of these structs to the shared memory. Now I see LDS instead of LDL in the assembly. The PC sample profiling results indicate that memory dependency was reduced from 35% to 11%. However, the overall speed-up was only about 10% (for Maxwell, no improvement for Fermi), not as much as I had hoped. Still, it is somewhat encouraging.

My related question is, based on the below slides (pp. 45-47), there is a penalty using shared memory vs a real register.

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

Given that my hardware still have extra space for registers (980Ti allows 255 registers/thread), is there a way to convert array-cast access to constant indexed array/struct? something like this?

float Icos= (flipdir==1.f ? v->x : ((flipdir==2.f) ? v->y : v->z));

what about this more complex case, do I have to use branches?

((float*)(v))[flipdir]*=-1.f;

I suppose local memory has the lowest speed; shared memory is also relatively slow compared to register, based on the above slides. So, if all possible, I’d like to maximize my utility of registers.

Local memory (a more precise name would be “thread local memory”) is just a per-thread mapped portion of the GPU’s global memory. That means it has high access latency. Not every use of local memory, however, has a measurable performance impact at application level, it depends very much on where in the code those accesses occur.

It seems that you have now been able to establish which variables were placed in local memory, and characterized the performance impact of that across different GPU architectures. Sounds like good progress. In general, I would suggest letting the profiler guide you as to where to expend optimization effort.