I have 2 kernels that do exactly the same thing. One of them allocates shared memory statically while the other allocates the memory dynamically at run time. I am using the shared memory as 2D array. So for the dynamic allocation, I have a macro that computes the memory location. Now, the results generated by 2 kernels are exactly the same. However, the timing results I got from both kernel are 3 times apart! The static memory allocation is much faster. I am sorry that I can’t post any of my code. Can someone give a justification for this?
I can’t think of any reason that you would need a macro to compute the memory location. The code in the macro should be the thing that’s slowing your kernel down.
Static and dynamic shared memory have very little difference. When one kernel has only a single shared memory symbol, whether it’s static of dynamic makes no difference at all.