CUFFT error: 3D batched C2R transforms With simple test code

Hi,

I’m having problems trying to execute 3D batched C2R transforms with CUFFT under some circumstances. I have made some simple code to reproduce the problem.

The goal is to compute 2000 transforms of size 14x14x256. I get a CUFFT_EXEC_FAILED error every time cufftExecC2R is executed, but if I change ‘z’ dimension from 256 to 258 everything runs fine.

Test code is as follows. Nothing is executed before.

cufftHandle plan;

int x=14,y=14,z=256;

int fftDims [] = {x, y, z};

cufftComplex* idata;

cufftReal* odata;

int idataEls = x*y*(z/2+1);

int odataEls = x*y*z;

int batch = 2000;

cudaMalloc(&idata,idataEls*sizeof(cufftComplex)*batch);

cudaError cErr = cudaThreadSynchronize();

if(cErr != cudaSuccess){

	std::cout << "Error allocating gpu memory\n";

	exit(-1);

}

cudaMalloc(&odata,odataEls*sizeof(cufftReal)*batch);

cErr = cudaThreadSynchronize();

if(cErr != cudaSuccess){

	std::cout << "Error allocating gpu memory\n";

	exit(-1);

}

std::cout << "Memory used: " << idataEls*sizeof(cufftComplex)*batch+odataEls*sizeof(cufftReal)*batch << " bytes\n";

cudaMemset(idata,2,idataEls*sizeof(cufftComplex)*batch);

cErr = cudaThreadSynchronize();

if(cErr != cudaSuccess){

	std::cout << "Error with cudaMemset\n";

	exit(-1);

}

cufftResult fftError = cufftPlanMany(&plan,3,fftDims, NULL,1,0, NULL,1,0,CUFFT_C2R, batch);

cErr = cudaThreadSynchronize();

if(fftError != CUFFT_SUCCESS || cErr != cudaSuccess){

	std::cout << "Error creating gpu FFT plan\n";

	exit(-1);

}

fftError = cufftSetCompatibilityMode(plan,CUFFT_COMPATIBILITY_FFTW_ALL);

cErr = cudaThreadSynchronize();

if(fftError != CUFFT_SUCCESS || cErr != cudaSuccess){

	std::cout << "Error setting gpu FFT plan compatibility\n";

	exit(-1);

}

fftError = cufftExecC2R(plan,idata,odata);

cErr = cudaThreadSynchronize();

if(fftError != CUFFT_SUCCESS || cErr != cudaSuccess){

	std::cout << "Error executing gpu FFT plan\n";

	exit(-1);

}

I work with a GeForce GTX 470 card (1248 MBytes of RAM) and CUDA 4.0

Regards.

Can you check if in-place transform works (after disabling allocation for odata):

fftError = cufftExecC2R(plan,idata,idata);

Checked.

It uses half the memory but the behavior remains the same, not working for z=256 but working for z=258.

What if you try first with batch=1 and no CUFFT_COMPATIBILITY_FFTW_ALL? then change batch to 2000. I have an iterative code in both 2D and 3D and it works without problem, ut I can not see the error in your code.

I’ve tried disabling FFTW compatibility:

For batch=1 it runs fine for every size, but then it’s not a batched operation but single 3D transform.
For batch=2000 the behavior remains the same, working only for z=258.

I think it’s not related with FFTW compatibility. For batch=1 and FFTW compatibility enabled, it also works fine for every size.

Note that CUFFT Library has some temporary space usage (which is allocated at planning time). And this memory size varies depending on the size of the transform. For the problem size you are trying (14x14x256), the temporary space is almost as large as the input data size; hence altogether filling the 1.2GB memory. However, you mentioned that in-place transform still fails despite the additional GPU memory availability (assuming that one of the cudaMallocs was removed).

You can work around the failure by calling the execute twice, splitting into two batches as follows:
cufftResult fftError = cufftPlanMany(&plan,3,fftDims, NULL,1,0, NULL,1,0,CUFFT_C2R, batch/2);
fftError = cufftExecC2R(plan,idata,odata);
update the idata and odata point to the second half and call again
fftError = cufftExecC2R(plan,idata,odata);

In any case, you should file a bug for this to be tracked by NVIDIA.

for some of my cases my program will not run ven if it appears that everything fits in the gpu memory.

Does it run for some other values of batch?

That is right. It’s related with the free amount of memory. I’ve made some experiments with cudaMemGetInfo(…) and in-place transforms, and I got these results:

batch → 2000
Total memory: 1309081600 bytes
idata: 404544000 bytes
Free memory before planning: 775577600 bytes
Free memory after planning: 775577600 bytes
cufftExecC2R fails to execute, but planning instruction returns CUFFT_SUCCESS instead of CUFFT_ALLOC_FAILED.

batch → 1800
Total memory: 1309081600 bytes
idata: 364089600 bytes
Free memory before planning: 816209920 bytes
Free memory after planning: 93609984 bytes
Planning with this batch size returns CUFFT_SUCCESS, and cufftExecC2R executes fine also. Temporary space is almost twice as idata, but it works.

So it seems that cufftPlanMany is returning success while is a fail allocating memory…

I think this solves my problem, but looks like there is some kind of bug with cufftPlanMany or something.

Thank you very much for your support.

What about 2001 :)? I guess there is a limit to the size of batch.
Free memory before planning: 775577600 bytes
Free memory after planning: 775577600 bytes
nothing happens here for 2000.

Right, nothing happens with batch = 2000. cufftPlanMany should return an error because the plan is not allocated, but returns success instead.

with batch 2001 I got this:

idata: 404746272 bytes
Free memory before planning: 737284096 bytes
Free memory after planning: 737284096 bytes

There is not enough space for the planner to allocate the memory it needs (around idata*2 bytes).