cuFFT 2^15+ issues?

I have cuda 5.0, a quadro3000M and am doing C2C forward FFT with cuFFT. I have a CPU reference implementation.

Up to 2^14, I am fine. 2^15, I have trouble. I’m allocating a buffer of 100 * 32768 * sizeof(cuComplex) to perform 100 2^15 FFTs. I fill all the .x’s with 1, all the .y with 0.
I’ve verified the data on the device is present (if I remove my cuFFT call and pull the data back to host I see it is all there.)

AND, there are no errors with my allocations reported.

printf("Details of this test\n");
    	printf("  FFT size chosen: %d (logn %d)\n", fft_n_elems, fft_logn);
    	printf("  \n");
    	printf("  max mem available: %d\n", global_mem_bytes);
    	printf("  %d buffers will have %d bytes each\n", num_buffers, bytes_per_buffer);
    	printf("  which is %d distinct fft's performed\n", ffts_per_buffer);

 gpuDeviceInit(0); //(init call from SDK)
checkCudaErrors(cufftPlan1d(&plan, fft_n_elems, CUFFT_C2C, ffts_per_buffer));
cufftExecC2C(plan, d_in, d_in, FFT_FORWARD);


Details of this test
  FFT size chosen: 32768 (logn 15)
  max mem available: 2146631680
  1 buffers will have 26214400 bytes each
  which is 100 distinct fft's performed
Function GPU Malloc Input executed: 222.608994
Function Copy input cpu->gpu executed: 5406.423004
Function GPU FFT executed: 11396.368988
GPU implementation time per = 113.963690
CUDA error at nvidia_fft_cuda.cpp:184 code=4(cudaErrorLaunchFailure) "cudaMemcpy(h_out, d_in, bytes_per_buffer, cudaMemcpyDeviceToHost)" 
Function Transfer data back to host executed: 23.687988

If I remove the cufft use, the memcpy off the device does not fail, and I can read the data from d_in moved back to host just fine. As is, I return nothing back from the memcpy…

Does the code check the status of every single CUDA and CUFFT API call? Is it possible that there is an out-of-memory condition further upstream in the code that is currently not caught, that is, an unchecked cudaMalloc() call? This could cause the kernel to fail due to out-of-bounds access. Since kernel launches are asynchronous by default, an error inside the kernel would be reported at the next synchronous CUDA API call, which is cudaMemcpy() in this case.

I don’t use CUFFT; I believe it also makes some memory allocations internally. I would expect CUFFT to report an out-of-memory condition through an appropriate error status, rather than crashing the actual kernels kicked off by a CUFFT API call. So it is important to also check the status of every CUFFT API call.

I run checkCudaErrors (from the SDK) on all the cudaMallocs and cudamemcpy’s, also on all the cuFFT calls (planning and exec). I have no errors during anything until I run a memory copy back to the host at the end. I’m only allocating about 26MB of space for my test (100327688byte-per-point).

I’m kind of suspecting drivers just because I’m forced into using RedHat 6.1. I have the 304.54 driver, but i even tried backing out to 295.33 and cuda 4 to no avail.

I’ll condense my code into a simple test sample and post it, maybe someone can run it quick.

I must just be doing something dumb with my pointer arithmetic, but I don’t see it.

Here’s the critical section of my dumb downed code. can I not increment the pointers as such with cuFFT? It only appears to start failing at large sizes. I get erratic results even on small FFT’s at the second iteration now that I look at it closer.

// create GPU buffers
	if( cudaMalloc((void**)&d_in,bytes_per_buffer) ) { printf("there was an error in malloc 
"); }

	// Move initialized data to GPU
	if ( cudaMemcpy(d_in, h_in, bytes_per_buffer, cudaMemcpyHostToDevice) ) { printf("there was an error in memcpy to GPU
"); }

	// Planning
	if ( cufftPlan1d(&plan, fft_n_elems, CUFFT_C2C, 1) ) { printf("there was an error in cufftplan 
"); }

	// testing
	cufftResult_t err;
	for (int i=0; i<100; i++)
		err = cufftExecC2C(plan, d_in+fft_n_elems*i, d_in+fft_n_elems*i, CUFFT_FORWARD);
		if (err) { printf("there was an error during fft %d
", i); exit (0); }

	//Retrieve result, which goes to h_out regardless..
	if ( cudaMemcpy(h_out, d_in, bytes_per_buffer, cudaMemcpyDeviceToHost) ) { printf("there was an error in memcpy back to host
"); } // get result