Problem with 1D & 2D thread block.

I’ve implemented two small functions in GPU, both have same functionality. Difference is that I break first function in 1D thread block as 256 threads / block and second function uses 2D thread block as 16 * 16 threads / block. And first uses 1D shared memory (256) and second uses 2D shared memory (16 X 16).

Second function takes almost double time of first one. Why?

It would be easy to comment if you can post some code…