I’ve implemented two small functions in GPU, both have same functionality. Difference is that I break first function in 1D thread block as 256 threads / block and second function uses 2D thread block as 16 * 16 threads / block. And first uses 1D shared memory (256) and second uses 2D shared memory (16 X 16).
Second function takes almost double time of first one. Why?