I’m going for it!
The Background:
Considering I learned C last friday (I’m fluent in MATLAB and FORTRAN), I think I’ve made a lot of progress… just please bear with me when I do the occasional stupid thing! I implemented a 3D CUFFT routine and have wrapped it all up into a matlab mexfile, and made it do what I want.
However, it’s too slow  I’m doing many fairly small FFTs (64^3) so although the CUFFT routine itself is parallelised within the GPU, I get little benefit because the execution is essentially serial. Comparably, FFTW works faster with a single core implementation  I presume because the size of the data is small enough to fit on the CPU’s L2 cache.
So, I’ve realised that what I need is a Batched 3D FFT routine, so I can do my many 64^3 FFTs. I’ve also realised that I’m going to have to manup and write it myself!
I’ve learned a lot from Jim Hardwick’s [topic=“34241”]excellent Batched 2D CUFFT implementation[/topic] and the various discussions on transposing data (e.g. [topic=“87178”]LSChien’s 3d indexing posts[/topic] ).
The Big Ask:
It’d be really great if there are any experienced users who can cast their eyes over my ‘plan’ (2 minutes!!) to see if there are any major problems ahead, before I dive in.
Also, I’m getting hung up on the 3D array transpose operations. Has anybody got any code to do them already? What do people think of my tiling idea? I’ve got a decent grasp of 2D transforms (with the help of transposeNew in the SDK) but I’m worried about efficient ways of indexing the 3D data into 2D slices before making a series of 2D transposes.
The Plan:

Interleave data so I’m doing complextocomplex transforms. This isn’t quite what I want, but making things complex makes it simpler!! :blink:(!!) .

Create a linear array of the data, N1N2N3*Nbatches

Execute N2N3Nbatch batched 1D FFT transforms, inplace, along the first dimension (individual fft size is N1).

Transpose (permute) the 3D arrays An(i,j,k) > An(j,k,i) using Nbatches tiles  so that the batches of data remain separate, where a tile is of size N1N2N3

Execute N1N3Nbatch batched 1D FFT transforms, inplace, along the ‘first’ dimension (the j direction after the first transpose). Individual FFT size is N2.

Transpose (permute) the 3D arrays An(j,k,i) > An(k,i,j) again using tiling.

Execute N1N2Nbatch batched 1D FFT transforms, inplace, along the ‘first’ dimension (the k direction after the second transpose). Individual FFT size is N3.

Transpose (permute) the 3D arrays An(k,i,j) > An(i,j,k) again using tiling to return result to the correctly ordered output.
The Goal:
When finished, I aim to post code here (I’m not commercially constrained at all) for a batched 3D CUFFT implementation, with an API consistent with CUFFT.
This will include code for reasonably efficient 3D array permutation, which could be very useful for others who aren’t interested in the FFT bit.
– Thanks in advance for any comments… Am I loopy? Is there anything you would do differently? Will this be useful for others?
Cheers!
Tom Clark
Edit 1: 1359 GMT, 11 Dec 2009
I’d messed up my description of the ‘direction’ of the ffts. Revised to make sense.