problems of local memory and shared memory

Hi guys. My program has cost too much registers, so I found many parameters are located at the local memory and the program runs too slow. So I planned to put some temporal parameters into the shared memory to save the rare registers and maintain the high speed.
However, I found that every time I used the shared memory to receive some temporal results or read data from the global memory, no matter the static or dynamic shared memory, the “local load” and “local store” in Profiler are always increased. Hence, the program runs more slower.
Do you guys know why it is that? Whether the writer/read process of shared memory needs more other registers, and it exceeds the numbers of registers then fills the local memory?? Thank you!

You should dump resource usage first by

nvcc -Xptxas -v -arch=sm_20 [source code]

I think that you still have local memory.