I have cuda 5.0, a quadro3000M and am doing C2C forward FFT with cuFFT. I have a CPU reference implementation.
Up to 2^14, I am fine. 2^15, I have trouble. I’m allocating a buffer of 100 * 32768 * sizeof(cuComplex) to perform 100 2^15 FFTs. I fill all the .x’s with 1, all the .y with 0.
I’ve verified the data on the device is present (if I remove my cuFFT call and pull the data back to host I see it is all there.)
AND, there are no errors with my allocations reported.
printf("Details of this test\n");
printf(" FFT size chosen: %d (logn %d)\n", fft_n_elems, fft_logn);
printf(" \n");
printf(" max mem available: %d\n", global_mem_bytes);
printf(" %d buffers will have %d bytes each\n", num_buffers, bytes_per_buffer);
printf(" which is %d distinct fft's performed\n", ffts_per_buffer);
gpuDeviceInit(0); //(init call from SDK)
checkCudaErrors(cufftPlan1d(&plan, fft_n_elems, CUFFT_C2C, ffts_per_buffer));
cufftExecC2C(plan, d_in, d_in, FFT_FORWARD);
cudaDeviceSynchronize();
checkCudaErrors(cufftDestroy(plan));
output
Details of this test
FFT size chosen: 32768 (logn 15)
max mem available: 2146631680
1 buffers will have 26214400 bytes each
which is 100 distinct fft's performed
Function GPU Malloc Input executed: 222.608994
Function Copy input cpu->gpu executed: 5406.423004
Function GPU FFT executed: 11396.368988
GPU implementation time per = 113.963690
CUDA error at nvidia_fft_cuda.cpp:184 code=4(cudaErrorLaunchFailure) "cudaMemcpy(h_out, d_in, bytes_per_buffer, cudaMemcpyDeviceToHost)"
Function Transfer data back to host executed: 23.687988
If I remove the cufft use, the memcpy off the device does not fail, and I can read the data from d_in moved back to host just fine. As is, I return nothing back from the memcpy…
Does the code check the status of every single CUDA and CUFFT API call? Is it possible that there is an out-of-memory condition further upstream in the code that is currently not caught, that is, an unchecked cudaMalloc() call? This could cause the kernel to fail due to out-of-bounds access. Since kernel launches are asynchronous by default, an error inside the kernel would be reported at the next synchronous CUDA API call, which is cudaMemcpy() in this case.
I don’t use CUFFT; I believe it also makes some memory allocations internally. I would expect CUFFT to report an out-of-memory condition through an appropriate error status, rather than crashing the actual kernels kicked off by a CUFFT API call. So it is important to also check the status of every CUFFT API call.
I run checkCudaErrors (from the SDK) on all the cudaMallocs and cudamemcpy’s, also on all the cuFFT calls (planning and exec). I have no errors during anything until I run a memory copy back to the host at the end. I’m only allocating about 26MB of space for my test (100327688byte-per-point).
I’m kind of suspecting drivers just because I’m forced into using RedHat 6.1. I have the 304.54 driver, but i even tried backing out to 295.33 and cuda 4 to no avail.
I’ll condense my code into a simple test sample and post it, maybe someone can run it quick.
I must just be doing something dumb with my pointer arithmetic, but I don’t see it.
Here’s the critical section of my dumb downed code. can I not increment the pointers as such with cuFFT? It only appears to start failing at large sizes. I get erratic results even on small FFT’s at the second iteration now that I look at it closer.
// create GPU buffers
if( cudaMalloc((void**)&d_in,bytes_per_buffer) ) { printf("there was an error in malloc
"); }
// Move initialized data to GPU
if ( cudaMemcpy(d_in, h_in, bytes_per_buffer, cudaMemcpyHostToDevice) ) { printf("there was an error in memcpy to GPU
"); }
cudaDeviceSynchronize();
// Planning
if ( cufftPlan1d(&plan, fft_n_elems, CUFFT_C2C, 1) ) { printf("there was an error in cufftplan
"); }
// testing
cufftResult_t err;
for (int i=0; i<100; i++)
{
err = cufftExecC2C(plan, d_in+fft_n_elems*i, d_in+fft_n_elems*i, CUFFT_FORWARD);
if (err) { printf("there was an error during fft %d
", i); exit (0); }
cudaDeviceSynchronize();
}
//Retrieve result, which goes to h_out regardless..
if ( cudaMemcpy(h_out, d_in, bytes_per_buffer, cudaMemcpyDeviceToHost) ) { printf("there was an error in memcpy back to host
"); } // get result
cudaDeviceSynchronize();