A batch would almost certainly be faster for that size FFT. But each of the 3096 arrays must be completely independent since the FFT is run in parallel.
Does anyone know about any hard limits for 2D FFT calls in a loop? I am also calling a 2D FFT 3096 times in a loop and the values are wrong when compared to the original C program. I have scaled down the number of calls to 43 and I am still getting incorrect values. I am calling a 4096 point FFT as per JimH has mentioned above.
[codebox]
int Nx = 64;
int Ny = 64;
int sizez = 3096;
cufftComplex *inz1_d, *outz1_d;
cudaMalloc((void**) &inz1_d, sizeof(cufftComplex) * Nx * Ny * sizez);
cudaMalloc((void**) &outz1_d, sizeof(cufftComplex) * Nx * Ny * sizez);
//FFT1
cufftHandle plan_forwardz1;
cufftPlan2d(&plan_forwardz1, Nx, Ny, CUFFT_C2C);
for(i = 0; i < 43; ++i)
{
cufftExecC2C(plan_forwardz1, inz1_d, outz1_d, CUFFT_FORWARD);
inz1_d += 4096;
outz1_d += 4096;
}
Hi, I just wanted to take a look at jimh’s the batched FFT code but somehow I can’t download it. When I click on either of the links I just get an empty page.
I used this batch FFT code recently and, since I needed 3D batch transform too, I updated it slightly. Maybe someone will find it useful (I still cannot understand why nVidia did not include batched 2D and 3D transforms to CUFFT).
Changes:
added ability to create 3D batched plans
fixed batchfftDestroy() signature to make it the same as cufftDestroy() (it took a pointer to plan as a parameter, now it takes the plan itself)
replaced void* pointers by typed ones in batchfftExecute() (because it takes only cufftComplex arrays anyway)
moved transpose() logic to separate file (it is useful by itself)
I tried this batched 2D FFT on CUDA 2.2 and Quadro FX4600. And I found some errors. My data sets are 500 matrix (each one is 256x256).
I only did forward FFT and inverse FFT. Then I compared the results with the original data sets. They should be exactly same.
I run this program 32 times on the same data sets. And I found sometimes I can got correct results but sometimes not. In other words, the results changed even for the same data sets.
I found the problem may be in the transpose function. I tried the first version of transpose function(the slow version without share memory). I did the same tests and I can get correct results.
I haven’t checked it very thoroughly, but putting in a __syncthreads() in the transpose kernel after writing the odata array values might prevent the computed shared memory values being overwritten before updating odata.
I haven’t checked it very thoroughly, but putting in a __syncthreads() in the transpose kernel after writing the odata array values might prevent the computed shared memory values being overwritten before updating odata.
Hi All,
I’ve managed to use Jim’s batched fft routines to implement my Maximum Entropy image reconstructions on GTX280 and Tesla C1060. Initially I was hoping to get speedup over CPU by having FPGA designed but the GPGPUs really provided orders of magnitude improvements which I’m really happy with!
However since using this code, I’ve had questions relating to whether it can be adapted such that the user passes the number of images and dimensions but additionally pass a vector of pointers/addresses for each of the images. Currently the image data has to be stored contiguously. Because the Maximum Entropy calculation involves an iterative approach all images won’t converge at the same iteration number. So for example if I divide a 128x128 image in to 64 sub-images of 16x16 pixels, if say 10 subimages’ calculation converge at iteration 12 then the remaining 54 subimages have to be packed consecutively to use the batch fft routines here.
If one could pass the 54 start locations of each of the subimages remaining in the original data block 64 consecutive images that would save much in processing time I think!? Then one doesn’t have to re-pack the 54 subimages etc…
Since the cufft batch routines don’t have this capability I presume this is not readily achievable but might be possible I guess? Any comments appreciated.