Cuda FFT running faster than simple copies and magnitude squared

Two very simple kernels - one to fill some data on the device (for the FFT to process) and another that calculates the magnitude squared of the FFT data. What’s odd is that our kernel routines are taking 50% longer than the FFT. My only suspicions are in how we allocated num threads per block and num blocks.

tpb = 1024; // thread per block
fftSize = 1024; // no relationship to tbp … just conincidence
int numBatches = 40000;
dataLength = fftSize * NumBatches; // ex 1024 * 40000

// Allocate the ComplexShortDevice and cufftComplexDevice buffers of size dataLength
// Allocate floatDeviceBuffer for mag squared output of size dataLength
// Copy complex short host data to the complexShortDeviceBuffer

// fill the array – note that num blocks = dataLength/tpb = 40000 in this example
fillFFTArray<<<(int)ceil(dataLength/ tpb ), tpb , 0, stream1 >>>cufftComplexDeviceBuffer, complexShortDeviceBuffer ,dataLength);

// Run the fft
cufftExecC2C(plan1, cufftComplexDeviceBuffer, cufftComplexDeviceBuffer, CUFFT_FORWARD) );

// Calc mag squared
calcMagnitudeSquared <<<(int)ceil(dataLength/tbp), tbp,0,stream1 >>> (floatDeviceBuffer, cufftComplexDeviceBuffer, dataLength);

global void fillFFTArray (cufftComplex *fltData, ComplexShort *shData, int count) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < count) {
fltData[i].x = shData[i].x;
fltData[i].y = shData[i].y;

global void calcMagnitudeSquared (float *realData, cufftComplex *cmplxData,int count ) {

           int            i = blockIdx.x * blockDim.x + threadIdx.x;

           if (i < count) {
                          realData[i] = (cmplxData[i].x * cmplxData[i].x) + (cmplxData[i].y * cmplxData[i].y);


How are you measuring the timing?

cudaEventRecord before and after the FFT call. But I also ran NVIDIA’s Visual Profiler and it came up with nearly identical results … more time spent in the two simple Kernel routines than in the FFT.

An FFT can be pretty much bandwidth limited. Which means that if your kernels, which are reading and writing the same size data as the FFT, are not well optimized for bandwidth utilization, it might be plausible that they run slower than the FFT.

Certainly some suggestions could be made about how to slightly improve the kernels you have shown, but I don’t know what if any effect they might have without trying it.

Your FFT may also be “benefited” by the fact that its data may already be cached, which may not be the case for some of the data that your kernels touch.

If you want to provide a short, complete code, that demonstrates your time calculation as well, that someone could copy, paste, compile and run, without having to add anything or change anything, then perhaps someone would be willing to take a deeper look. Just a suggestion.