Batched 2D FFT implementation

A batch would almost certainly be faster for that size FFT. But each of the 3096 arrays must be completely independent since the FFT is run in parallel.

Looping cufft is trivial:

float2* arrays;

cudaMalloc((void**)&arrays, sizeof(float2) * 64 * 64 * 3096);

float2* currentArray = arrays;

for(int i = 0; i < 3096; i++)

{

  cufftExecC2C(fftplan, currentArray, currentArray, CUFFT_FORWARD);

  currentArray += 64 * 64;

}

Does anyone know about any hard limits for 2D FFT calls in a loop? I am also calling a 2D FFT 3096 times in a loop and the values are wrong when compared to the original C program. I have scaled down the number of calls to 43 and I am still getting incorrect values. I am calling a 4096 point FFT as per JimH has mentioned above.

[codebox]

    int Nx = 64;

    int Ny = 64;

    int sizez  = 3096;

    cufftComplex *inz1_d, *outz1_d;

cudaMalloc((void**) &inz1_d, sizeof(cufftComplex) * Nx * Ny * sizez);

cudaMalloc((void**) &outz1_d, sizeof(cufftComplex) * Nx * Ny * sizez);

    //FFT1

cufftHandle plan_forwardz1;

cufftPlan2d(&plan_forwardz1, Nx, Ny, CUFFT_C2C);



for(i = 0; i < 43; ++i)

{

	cufftExecC2C(plan_forwardz1, inz1_d, outz1_d, CUFFT_FORWARD);

	inz1_d += 4096;

	outz1_d += 4096;

}

[/codebox]

Hi, I just wanted to take a look at jimh’s the batched FFT code but somehow I can’t download it. When I click on either of the links I just get an empty page.

Does anyone know what could be the problem?

Thanks and cheers,

Szilard

Sorry for necroposting.

I used this batch FFT code recently and, since I needed 3D batch transform too, I updated it slightly. Maybe someone will find it useful (I still cannot understand why nVidia did not include batched 2D and 3D transforms to CUFFT).

Changes:

  • added ability to create 3D batched plans
  • fixed batchfftDestroy() signature to make it the same as cufftDestroy() (it took a pointer to plan as a parameter, now it takes the plan itself)
  • replaced void* pointers by typed ones in batchfftExecute() (because it takes only cufftComplex arrays anyway)
  • moved transpose() logic to separate file (it is useful by itself)

I tried this batched 2D FFT on CUDA 2.2 and Quadro FX4600. And I found some errors. My data sets are 500 matrix (each one is 256x256).
I only did forward FFT and inverse FFT. Then I compared the results with the original data sets. They should be exactly same.
I run this program 32 times on the same data sets. And I found sometimes I can got correct results but sometimes not. In other words, the results changed even for the same data sets.

I found the problem may be in the transpose function. I tried the first version of transpose function(the slow version without share memory). I did the same tests and I can get correct results.

Thanks.

This is not the case for (64x64) 4096 point 2D FFT, but I think __umul24(n, size) won’t work for big problem sizes, can someone check ?

I am seeing some discrepancies when using the batched 2D FFT implementation. I am using 3 batched 2D FFt transforms (2 forward and 1 inverse)

I see the inverse FFT always messed up.

I am calling the inverse transform like this:

batchfftHandle batchplan3;

	batchfftPlan2d(&batchplan3, pix1, pix2, CUFFT_C2C, count);

	batchfftExecute(batchplan3, f3_d, out1_d, -1);

	batchfftDestroy(&batchplan3);

Ideal output should look like:

xcout[0]  131552918896640.000000  -23512.000000 

 xcout[1]  131532299698176.000000  -30208.000000 

 xcout[2]  131527501414400.000000  -43808.000000 

 xcout[3]  131527518191616.000000  -61582.000000 

 xcout[4]  131527585300480.000000  -95952.000000 

 xcout[5]  131527618854912.000000  -124826.000000 

 xcout[6]  131527652409344.000000  -129283.500000 

 xcout[7]  131527618854912.000000  -113349.000000 

 xcout[8]  131527602077696.000000  -54648.000000

but the inverse FFT using the batch call looks like this:

xcout[0]  0.000000  4.496094 

 xcout[1]  0.000000  4.498047 

 xcout[2]  0.000000  4.495117 

 xcout[3]  0.000000  4.495117 

 xcout[4]  0.000000  4.496094 

 xcout[5]  0.000000  4.495117 

 xcout[6]  0.000000  4.495117 

 xcout[7]  0.000000  4.498047 

 xcout[8]  0.000000  4.495117

any suggestions??

I am seeing some discrepancies when using the batched 2D FFT implementation. I am using 3 batched 2D FFt transforms (2 forward and 1 inverse)

I see the inverse FFT always messed up.

I am calling the inverse transform like this:

batchfftHandle batchplan3;

	batchfftPlan2d(&batchplan3, pix1, pix2, CUFFT_C2C, count);

	batchfftExecute(batchplan3, f3_d, out1_d, -1);

	batchfftDestroy(&batchplan3);

Ideal output should look like:

xcout[0]  131552918896640.000000  -23512.000000 

 xcout[1]  131532299698176.000000  -30208.000000 

 xcout[2]  131527501414400.000000  -43808.000000 

 xcout[3]  131527518191616.000000  -61582.000000 

 xcout[4]  131527585300480.000000  -95952.000000 

 xcout[5]  131527618854912.000000  -124826.000000 

 xcout[6]  131527652409344.000000  -129283.500000 

 xcout[7]  131527618854912.000000  -113349.000000 

 xcout[8]  131527602077696.000000  -54648.000000

but the inverse FFT using the batch call looks like this:

xcout[0]  0.000000  4.496094 

 xcout[1]  0.000000  4.498047 

 xcout[2]  0.000000  4.495117 

 xcout[3]  0.000000  4.495117 

 xcout[4]  0.000000  4.496094 

 xcout[5]  0.000000  4.495117 

 xcout[6]  0.000000  4.495117 

 xcout[7]  0.000000  4.498047 

 xcout[8]  0.000000  4.495117

any suggestions??

I haven’t checked it very thoroughly, but putting in a __syncthreads() in the transpose kernel after writing the odata array values might prevent the computed shared memory values being overwritten before updating odata.

I haven’t checked it very thoroughly, but putting in a __syncthreads() in the transpose kernel after writing the odata array values might prevent the computed shared memory values being overwritten before updating odata.

Perhaps you can’t use it, but just in case: the new version of CUFFT has built-in support for batched 2d (and 3d) FFTs.

Perhaps you can’t use it, but just in case: the new version of CUFFT has built-in support for batched 2d (and 3d) FFTs.

Hi All,
I’ve managed to use Jim’s batched fft routines to implement my Maximum Entropy image reconstructions on GTX280 and Tesla C1060. Initially I was hoping to get speedup over CPU by having FPGA designed but the GPGPUs really provided orders of magnitude improvements which I’m really happy with!

However since using this code, I’ve had questions relating to whether it can be adapted such that the user passes the number of images and dimensions but additionally pass a vector of pointers/addresses for each of the images. Currently the image data has to be stored contiguously. Because the Maximum Entropy calculation involves an iterative approach all images won’t converge at the same iteration number. So for example if I divide a 128x128 image in to 64 sub-images of 16x16 pixels, if say 10 subimages’ calculation converge at iteration 12 then the remaining 54 subimages have to be packed consecutively to use the batch fft routines here.

If one could pass the 54 start locations of each of the subimages remaining in the original data block 64 consecutive images that would save much in processing time I think!? Then one doesn’t have to re-pack the 54 subimages etc…

Since the cufft batch routines don’t have this capability I presume this is not readily achievable but might be possible I guess? Any comments appreciated.

Shawn