Question: Do the “local load” and “local store” results of the CUDA profiler include accesses to spilled register values in the local memory?
I have a CUDA program that I am sure that it causes register spilling to local memory since
of registers/thread * # of threads/block * # block per SM (i.e. cta launched) > number of registers per SM
However, the “local load” and “local store” columns of the profiler show 0.
Then, I put a dummy array in the kernel code and forced the kernel to traverse through the array. Non-zero values are shown at the “local load” and “local store” columns. So, the profiler and performance counters are supposed to be functioning for local memory accesses.
Why are the accesses to spilled data from the register in the local memory not shown?