Access to auto variables spilled to local memory not shown in profiler local {load,store}=0 even ami


Question: Do the “local load” and “local store” results of the CUDA profiler include accesses to spilled register values in the local memory?

Background info:
I have a CUDA program that I am sure that it causes register spilling to local memory since

of registers/thread * # of threads/block * # block per SM (i.e. cta launched) > number of registers per SM

However, the “local load” and “local store” columns of the profiler show 0.

Then, I put a dummy array in the kernel code and forced the kernel to traverse through the array. Non-zero values are shown at the “local load” and “local store” columns. So, the profiler and performance counters are supposed to be functioning for local memory accesses.

Why are the accesses to spilled data from the register in the local memory not shown?

Thank you