3D Transpose ... and memory coalescence


I have to transpose a 3D volume in a specified direction.
You can think of it as a cube that have to be rotated to the left, or to the front.

I’m using the same implementation as the 2D transpose in the sdk , and using a block of 8x8x8 (= 512 which is the max).

As the blocks are of size 8, I was wondering if the reads and writes are coalesced, and if not, is there a way to coalesce this ?

Another general question :

if the width of a 2D image is not a multiple of 16, the begining of the memory the blocks accees won’t be (begin + n16) but (begin + n16 + m*width), so HalfWarpBaseAdress-BaseAdress won’t be a multiple of 16, is that right ? So will the reads and writes still be coalesced ?


This is what cu(da)MallocPitch is for. If you do a pitch allocation, the function will pad your image to a width in bytes that guarantees coalescing as you move from one row to the next.

It slightly complicates the image addressing, but the performance benefits are worth it.

There is lots of precedent for this concept with “rowBytes” on Mac, “pitch” in DirectX, “image stride” in IPP.

I think that you can express 3D transpose as a set of 2D transposes of the slices of the volume. In that case you might use 2D blocks. It also depends on the orientation of the axis that corresponds to contiguous addresses.

The same thing was discussed a while ago: http://forums.nvidia.com/index.php?showtopic=50446&hl=