Threadblock size of 32x16 gets better performance than 16x32 Also one case where 16x16 is better tha

I have a CUDA algorimth for processing 3D volumes. I work with volume sizes of 384x384x384, 256x256x256, 192x192x192 and 128x128x128. The program work just great but I’m getting different performances depending on the threadblock size. I have tested them with sizes of 32x16, 16x32 and 16x16.

    In all the cases, I get better performance with 32x16 than using 16x32. If they are the very same size, what is the difference?

    In every volume I get better performance using 32x16 blocks, except with the 128x128x128 volume, where using the 16x16 blocksize is better.

As I decrease the volume size, the performance difference between 32x16 and 16x16 is also decreased until the point where with volume size of 128x128x128, 16x16 blocks win.

The kernel function uses 29 registers, it uses only global memory, there are no branches, just one loop; and memory is correctly coalesced. The card is a NVIDIA GTX295. Does someone know an explanation for the different perfomances?

max 32 warps, just my guess

As far as I understand, 32x16 and 16x32 ends up with the very same block size (just not the same shape). How is that related with the warp size?