I have cuda 5.0, a quadro3000M and am doing C2C forward FFT with cuFFT. I have a CPU reference implementation.
Up to 2^14, I am fine. 2^15, I have trouble. I’m allocating a buffer of 100 * 32768 * sizeof(cuComplex) to perform 100 2^15 FFTs. I fill all the .x’s with 1, all the .y with 0.
I’ve verified the data on the device is present (if I remove my cuFFT call and pull the data back to host I see it is all there.)
AND, there are no errors with my allocations reported.
printf("Details of this test\n"); printf(" FFT size chosen: %d (logn %d)\n", fft_n_elems, fft_logn); printf(" \n"); printf(" max mem available: %d\n", global_mem_bytes); printf(" %d buffers will have %d bytes each\n", num_buffers, bytes_per_buffer); printf(" which is %d distinct fft's performed\n", ffts_per_buffer); gpuDeviceInit(0); //(init call from SDK) checkCudaErrors(cufftPlan1d(&plan, fft_n_elems, CUFFT_C2C, ffts_per_buffer)); cufftExecC2C(plan, d_in, d_in, FFT_FORWARD); cudaDeviceSynchronize(); checkCudaErrors(cufftDestroy(plan));
Details of this test FFT size chosen: 32768 (logn 15) max mem available: 2146631680 1 buffers will have 26214400 bytes each which is 100 distinct fft's performed Function GPU Malloc Input executed: 222.608994 Function Copy input cpu->gpu executed: 5406.423004 Function GPU FFT executed: 11396.368988 GPU implementation time per = 113.963690 CUDA error at nvidia_fft_cuda.cpp:184 code=4(cudaErrorLaunchFailure) "cudaMemcpy(h_out, d_in, bytes_per_buffer, cudaMemcpyDeviceToHost)" Function Transfer data back to host executed: 23.687988
If I remove the cufft use, the memcpy off the device does not fail, and I can read the data from d_in moved back to host just fine. As is, I return nothing back from the memcpy…