Local memories seem to be faster than the code that only uses the global memory.
However, from the CUDA manual, I remember that local memories accesses are the same as global memory accesses…
Is it that some of the local memories are done in the registers ?