To answer the first question: not entirely. I have replaced the cuFFT calls to calls to Volkov’s FFTxxx and performance was improved significantly. My code, which is a sequence of 3 x (kernel, FFT) executed in 15.8ms using cuFFT and 8.9ms using Volkov’s FFT. (The job of the kernels is to shuffle data around in order to create arrays of input vectors to the FFT batch). The 3 FFTs (60x8x8 FFT512 + 512x60x8 FFT8 + 512x60x8 FFT8) by themselves take 1.4ms now, in comparison to 8ms using cuFFT.

My understanding is that a FFTxxx( float2 *work, int batch ) distributes the calculations of the FFTs of the batch of xxx long vectors onto its own threads and blocks, using a FFTxxx_device( float2 *work) kernel. In this kernel, the work pointer gets recalculated and then some magic happens which is beyond my comprehension. For example, I do not quite see how the FFT on a certain vector from the batch is distributed accros these threads, or which threads work on that vector from the batch?

The reason I need further improvement is to get the total processing time below 4ms. In essence I have a three dimensional data structure of KxMxN, say a volume x-y-z. I need to perform an FFT on 1) all MxN vectors of length K (say in x direction), followed by 2) on all KxM vectors of length N (say in z direction), and followed by 3) on all KxN vectors of length M (say in z direction). IN order to do that, I need to rearrange data, and my data shuffling kernels take 6.1ms which is the reason why I need either:

A include the FFTxxx in my kernel,

take the contents of the kernel for a particular FFTxxx and paste it into my kernel, the execution of which is determined by my data structures as far as number of threads and blocks are concerned. But I have the impression that does not work.

B include the data shuffle in the FFTxxx kernels.

Here I would have to understand how a vector is composed in the FFTxxx kernel, or how to reassemble a certain, contiguous vector from a global data structure.

Conclusion: I need to understand Volkov’s code, or as the Austrian writer Egon Fridell put it: If you steal a race horse and you want to ride it, your riding skills have to be at the height of the person who trained it.

Regards,

peter