Has anyone developped a 3d transpose kernel?
I need to write to an array in a XYZ fashion, which is not coalesced, but if i could write it ZYX it would be.
So what i need is to be able to write in ZYX then do a fast coalesced transpose to bring it back in XYZ form so that the rest of the program can execute.
I realize its analog to the transpose in the SDK but with threadim.z<=64 and griddim.z=0, it looks like an adress calculation hell.
So if anyone has already got one working, would you be so kind as to post it?
edit… thinking about it for more than 2 seconds, im thinking Y 2d transpose would accomplish what im looking for