i have implemented a 2D/3D rotation for 32^3 - 512^3 but i have some problems with my performance.
is there any faster methode then a normal rotation (means for each pixel do the multiplication and the interpolation with the 6 connected pixels). because at the moment i need round about 1.51sec for one rotation (256256256 float cube) and that is really really slow i think. i am on a fed6 maschien with a Quadro Fx 5600 in it.
Maybe the article “A FAST ALGORITHM FOR GENERAL RASTER ROTATION” from the Graphics Gems I book helps. http://www-users.cs.umn.edu/~baoquan/papers/rot.pdf and http://www-users.cs.umn.edu/~baoquan/papers/rot2p.pdf could help.
1.51sec is a lot of time. I suspect all your memory reads and/or writes are scattered (not coalesced). Maybe optimizing the algorithm towards coalesced memory access will increase your speed.
Also it is said that one of the next CUDA releases will support 3D textures (which as I understand it should support the interpolation implicitly). That’s another option.
i’ll have a look at them but this looks like a normal cube rotation. so i explain my problem a little bit more.
i have a cube with 512512512 floats. in this cube there are my data points (512512512) of them. and this data should be rotated. so i am now starting threads and each thread calculates one line of the cube . is there a better way is the question now. if i do this with the shearing method i have to hold a cube bigger than 512512512 to do not lost to much data i think or am i wrong.
after some brainstorming i have made it to run 646464 - 0.004 sec and 256256256 - 0.204 sec. are these good times or should it be possible to run faster than that. (rotation + trilinear interpolation)
ok i have done some work on it and now i have 0.004 sec for 646464 and 0.204 sec for a 256256256. the 512 cube is out because the cufft does not handle such a big cube on one fx5600.
looks better to me. do u think there could be some improvement?