Some block sizes cause huge performance loss Explanation?

Hi,

i have evaluated some of my kernels just for fun.
It makes a 2D texture fetch from a cudaArray and writes the results into global memory.
The size of the cudaArray is 3872x2592 and the texture is of type float.

I just chose different kernel sizes and measured the performance.

Can somebody explain to me the huge performance differences between some kernel sizes, especially when flipping x and y values, e.g. 32x1 and 1x32?
I’m just curious :)

Here are the results:

X/Y-Dim time in ms

1x64 27,426782
64x1 2,543981
1x128 37,969942
128x1 2,422973
1x256 44,608599
256x1 2,476707
1x512 53,529848
512x1 3,162500
2x32 18,674770
32x2 2,423880
2x64 19,494978
64x2 2,356946
2x128 22,072669
128x2 2,391780
2x256 26,533794
256x2 3,040203
4x16 15,643937
16x4 2,426352
4x32 17,852010
32x4 2,324094
4x64 17,716129
64x4 2,362986
4x128 16,990306
128x4 3,001795
8x8 15,620242
8x16 16,027394
16x8 2,377708
8x32 16,674307
32x8 2,366984
8x64 16,616040
64x8 3,069006
16x16 2,466906
16x32 3,245966
32x16 3,059281

Thank you!

Wide, low blocks (ie, like 64x2) seem to be faster than narrow, high ones (like 2x64). I remember someone from NVidia stating this as well.

In this case, isn’t it due to global memory coalescing? And maybe the texture cache is larger in X than in Y direction.

Thanks, wumpus.
Sounds plausible.
I couldn’t find anything about it in the programming guide so I’m hoping some NVIDIA engineering verify this.

And this is also not only the case for texture accesses.
I get similar performance differences when simply reading from an array.