CUFFT small issue with Double and Float Precision on Plan Many

Hi,

This is my first post so let me know if I have to edit to make my problem clear.
I am writing a program that has to computer hundreds of FFT computations.
I am setting up the plan using the cufftPlanMany call.
I encounter an issue when my BATCH is large but only occurs with double precision.

I was wondering if someone as experience something similar and how to prevent it.
This is the smallest version of the code. This was based on the simpleCUFFT of the samples folder
and the examples available in the CUFFT manual pdf.
This is just a 1D fft to be done over several batches.

The following code can easily be change to Double Precision
double -> float
cuDoubleComplex -> cuComplex
cufftDoubleComplex -> cufftComplex
CUFFT_Z2Z -> CUFFT_C2C

I compile for a GeForce GTX 480.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <cuda_runtime.h>
#include <cufft.h>
#include <helper_functions.h>
#include <helper_cuda.h>

#define NX 1024
#define BATCH 16
#define NRANK 1

void runTest(int argc, char **argv);

/* Based on the SDK compare function, but made for Complex */
bool sdkCompareCOMPLEX( const cuComplex* reference, const cuComplex* data,
        const unsigned int len, const float epsilon );

int main(int argc, char **argv)
{
    runTest(argc, argv);
}

void runTest(int argc, char **argv){
    cufftHandle plan;
    cufftComplex *data;
    cufftComplex *data2;
    cufftComplex *data3;
    cuComplex *h_data;
    cuComplex *h_data3;
    int n[NRANK] = {NX};
    bool bTestResult

    int ii, ll, ind, idist;

    h_data  = (cuComplex*)malloc(sizeof(cuComplex)*NX*BATCH); 
    h_data3 = (cuComplex*)malloc(sizeof(cuComplex)*NX*BATCH); 

   /* Create cuda arrays on GPU*/
    cudaMalloc(( void** )&data,sizeof(cufftComplex)*NX*BATCH);
    if (cudaGetLastError() != cudaSuccess){
        printf("Cuda error: Failed to allocate \n");
        return;
    } 

    cudaMalloc(( void** )&data2,sizeof(cufftComplex)*NX*BATCH);
    if (cudaGetLastError() != cudaSuccess){
        printf("Cuda error: Failed to allocate \n");
        return;
    } 

    cudaMalloc(( void** )&data3,sizeof(cufftComplex)*NX*BATCH);
    if (cudaGetLastError() != cudaSuccess){
        printf("Cuda error: Failed to allocate \n");
        return;
    }

    /* Fill the data */
    idist=NX;
    for(ll=0; ll<BATCH; ll++){
    for(ii=0; ii<NX; ii++){
              ind=ll*idist + ii*1;
              h_data[ind].x=rand()/ (float) RAND_MAX;
              h_data[ind].y=rand()/ (float) RAND_MAX;
              h_data3[ind].x=rand()/ (float) RAND_MAX;
              h_data3[ind].y=rand()/ (float) RAND_MAX;
    }
    }
   

    /* Print the data */
    /...

    /* Create a 3D FFT plan. */
    cufftResult ans;
    ans=cufftPlanMany (&plan,NRANK,n ,
                NULL,0,NX,
                NULL,1,NX,
                CUFFT_C2C,BATCH);

    if (ans!= CUFFT_SUCCESS){
        fprintf(stderr,"CUFFT error: Plan creation failed");
        return;
    }

    /* Copy the data */
    cudaMemcpy(data,h_data, 
                sizeof(cuComplex)*NX*BATCH,cudaMemcpyHostToDevice);

    /* Use the CUFFT plan t o transform the signal inplace. */
    if (cufftExecC2C(plan,data,data2,CUFFT_FORWARD)!= CUFFT_SUCCESS){
       fprintf(stderr,"CUFFT error: ExecC2C Forward failed");
       return;
    }

    /* Inverse transform the signal inplace. */
    if (cufftExecC2C(plan,data2,data3,CUFFT_INVERSE)!= CUFFT_SUCCESS){
       fprintf(stderr,"CUFFT error: ExecC2C Inverse failed");
       return;
    }

    /* Copy the data back  */
    cudaMemcpy(h_data3,data3, 
                sizeof(cuComplex)*NX*BATCH,cudaMemcpyDeviceToHost);


    if (cudaThreadSynchronize() != cudaSuccess){
       fprintf(stderr,"Cuda error: Failed to synchronize \n");
       return;
    }

    /* Print the data */
    // ...


    /* Normalize Data */
    idist=NX;
    for(ll=0; ll<BATCH; ll++){
    for(ii=0; ii<NX; ii++){
              ind=ll*idist + ii*1;
              h_data3[ind].x /= (NX);
              h_data3[ind].y /= (NX);
    }
    }

    /* check result */
    bTestResult = sdkCompareCOMPLEX((cuComplex *)h_data, (cuComplex *)h_data3, 
    NX*BATCH, 1e-5f);

    if (bTestResult==true)
    	printf("Worked well \n");

    if (bTestResult==false)
    	printf("Something went wrong \n");

    /* Destroy the CUFFT Plan. */
    cufftDestroy(plan);
    cudaFree(data);
    cudaFree(data);
    cudaFree(data2);
    cudaFree(data3);
    free(h_data);
    free(h_data3);
}



bool sdkCompareCOMPLEX( const cuComplex* reference, const cuComplex* data,
        const unsigned int len, const float epsilon )
{
    assert( epsilon >= 0);

    float error = 0.0;
    float ref = 0.0;
    float diff;
    float normRef;
    float normError;
    bool result;

    for( unsigned int i = 0; i < len; ++i) {

        diff = reference[i].x - data[i].x;
        error += diff * diff;
        diff = reference[i].y - data[i].y;
        error += diff * diff;
        ref += reference[i].x * reference[i].x;
        ref += reference[i].y * reference[i].y;
    }

    normRef = sqrtf(ref);
    if (fabs(ref) < 1e-7) {
        return false;
    }
    normError = sqrtf(error);
    error = normError / normRef;
    result = error < epsilon;

    return result;

}

The float version of the code still hasn’t failed me… maybe I haven’t tried all the combinations yet.

The issue is that when I have presents itself when: BATCH set to 16 and N=1024=2^10
the code tells me there is an error in one of the batches,
normally a consistent error in the same batch on every run

I thought it has something to do with the memory, but I don’t seem to have reached a limit yet for allocation.
If I run with BATCH set to 8 and N=2^11, the double precision code works without errors.
Notice that the amount of memory should be about the same.
But if I lower my batch number once again to 4, and N=2^12…
then the code has a consistent error, again the same batch value.
So, it is hard for me to predict when a plan many is doing the right thing or not.

How can I tell the amount of memory necessary for a FFT run?
How can I know which BATCH number is correct for a given N?

PS: I am running this with the cuda 5.5 tool kit, but I will try to move this code to another computer with the newest 6.0 library and a better card. Results pending…