cuda ray tracing speed

I use cuda for ray tracing.
The algorithm I use is kd-shortstack.
I found that if using stack in the local memory the speed is faster than that in the share memory.
I wonder why?
is local memory faster than share memory?
sorry for my poor english.

It’s not, local memory is as slow as global memory. Perhaps by not using much shared memory, you’ve allowed more threads to execute at once and this outweighed the memory accesses latency. Or perhaps you had something wrong in your code the first time, a lot of things may have happened.