3D CUFFT fails

loriordan · February 8, 2012, 4:01pm

Hi folks, first time poster here so I’ll try to give sufficient info.

Utilising a split-step FFT setup for evaluation of ground-state of quantum system. I can successfully utilise a gridsize of 128x256x256 (8388608) cufftDoubleComplex on my code, but have been unable to go any higher. This has been running on a 1GB 560 Ti, so we figured it may be a memory issue and upgraded to a 580GTX with 3GB. However, the same issue arises. The code involves computing a forward Z2Z FFT, multiplying by complex values, then performing an inverse Z2Z FFT. The forward FFT executes successfully (it seems), however the return type of the inverse FFT is given as 6 (EXEC_FAILED). I am currently stumped by this, and if anybody has any insight it would be greatly appreciated.

I have included some code and sample output:

Note: wfc is “signal” on the host, and fft, buffer1, buffer2 are three allocated memory blocks on the device, gSize is the size of grid (ie 128256256)

.

.

.

for(step=0; step<MAX_STEPS; step++) {

                local_sum = 0.0; sum = 0.0;

cudaMemGetInfo( &avail, &total );

                used = total - avail;

                printf("9 MEMORY U/A/T: %u / %u / %u\n",used,avail,total);

cudaMemcpy(fft, wfc, sizeof(cufftDoubleComplex)*alloc_local, cudaMemcpyHostToDevice);

                result = cufftExecZ2Z(plan_f,fft,fft,CUFFT_FORWARD);

cudaMemGetInfo( &avail, &total );

                used = total - avail;

                printf("10 MEMORY U/A/T: %u / %u / %u\n",used,avail,total);

isError(result,"Z2Z 1");

                scalarDiv<<<alloc_local/256,256>>>(fft,pow(gSize,0.5),fft); //Normalise

                cMult<<<alloc_local/256,256>>>(buffer1,fft,fft); //EKp complex Mult

                //Complex to Complex Transform (Inverse) 2

                result = cufftExecZ2Z(plan_b,fft,fft,CUFFT_INVERSE);

                isError(result,"Z2Z 2");

                scalarDiv<<<alloc_local/256,256>>>(fft,pow(gSize,0.5),fft); //Normalise

                cMult<<<alloc_local/256,256>>>(buffer2,fft,fft); //EVr complex Mult

                //Complex to Complex Transform (Forward) 3

                result = cufftExecZ2Z(plan_f,fft,fft,CUFFT_FORWARD);

                isError(result,"Z2Z 3");

                scalarDiv<<<alloc_local/256,256>>>(fft,pow(gSize,0.5),fft);

                cMult<<<alloc_local/256,256>>>(buffer1,fft,fft);

                //Complex to Complex Transform (Inverse) 4

                result = cufftExecZ2Z(plan_b,fft,fft,CUFFT_INVERSE);

                isError(result,"Z2Z 4");

                scalarDiv<<<alloc_local/256,256>>>(fft,pow(gSize,0.5),fft);

                cudaMemcpy(wfc, fft, sizeof(cufftDoubleComplex)*alloc_local, cudaMemcpyDeviceToHost);

Aside:

For the same grid size of 256x256x256, the GTX 580 gives:

9 MEMORY U/A/T: 1695776768 / 1524989952 / 3220766720

10 MEMORY U/A/T: 1695776768 / 1524989952 / 3220766720

Error has occurred for method Z2Z 2 with return type 6

whereas the 560 gives:

9 MEMORY U/A/T: 1004765184 / 68517888 / 1073283072

10 MEMORY U/A/T: 1004765184 / 68517888 / 1073283072

Error has occurred for method Z2Z 2 with return type 6

Thanks in advance,

Lee.

mfatica · February 8, 2012, 4:11pm

The error is probably in:

scalarDiv<<<alloc_local/256,256>>>(fft,pow(gSize, 0.5),fft); //Normalise
cMult<<<alloc_local/256,256>>>(buffer1,fft,fft); //EKp complex Mult

With a grid of 128x256x256, it will work, but with a grid 256^3, the grid size will exceed the maximum allowed dimension.
Could you post the two kernels?
It will also be a good idea to combine them together, they are bandwidth limited and you are reading/writing the same array twice.

loriordan · February 8, 2012, 4:25pm

Thanks for the response. Here are the two kernels:

__global__ void scalarDiv(double2* in, double factor, double2* out){

        double2 result;

        int tid = blockIdx.y*gridDim.x + blockIdx.x*blockDim.x + threadIdx.x;

        result.x = in[tid].x/factor;

        result.y = in[tid].y/factor;

        out[tid] = result;

}

__global__ void cMult(double2* in1, double2* in2, double2* out){

        double2 result;

        int tid = blockIdx.y*gridDim.x + blockIdx.x*blockDim.x + threadIdx.x;

        result.x = in1[tid].x*in2[tid].x - in1[tid].y*in2[tid].y;

        result.y = in1[tid].x*in2[tid].y + in1[tid].y*in2[tid].x;

        out[tid] = result;

}

If they are exceeding the maximum allowed dimension, would that be my chosen method for determining the thread ID (tid) above?

Combining them will be possible, thanks. Can take care of that immediately.

loriordan · February 8, 2012, 4:38pm

I think I have found the source of my problems, I was using a single scalar value for alloc_local, not a dim3 specifying the other axes. Will give this a test run and hope for the best. Thanks again.

mfatica · February 8, 2012, 7:05pm

Your indexing seems wrong.

To handle a generic case, pass the length of the array to the kernel and check if you are still in-bound.

__global__ void scalarDiv(double2* in, double factor, double2* out, int N ){

        double2 result;

         int idx  = threadIdx.x + blockIdx.x * blockDim.x;

        for (int i=idx; i <N; i+=blockIdx.x * blockDim.x)

        {

        result.x = in[i].x/factor;

        result.y = in[i].y/factor;

        out[i] = result;

         }

}

pasoleatis · February 9, 2012, 9:45am

Thanks for the response. Here are the two kernels:

__global__ void scalarDiv(double2* in, double factor, double2* out){

        double2 result;

        int tid = blockIdx.y*gridDim.x + blockIdx.x*blockDim.x + threadIdx.x;

        result.x = in[tid].x/factor;

        result.y = in[tid].y/factor;

        out[tid] = result;

}

__global__ void cMult(double2* in1, double2* in2, double2* out){

        double2 result;

        int tid = blockIdx.y*gridDim.x + blockIdx.x*blockDim.x + threadIdx.x;

        result.x = in1[tid].x*in2[tid].x - in1[tid].y*in2[tid].y;

        result.y = in1[tid].x*in2[tid].y + in1[tid].y*in2[tid].x;

        out[tid] = result;

}

If they are exceeding the maximum allowed dimension, would that be my chosen method for determining the thread ID (tid) above?

Combining them will be possible, thanks. Can take care of that immediately.

Hello,

You can submit the kernels using <<<blocks,threads>>>, where blocks is a dim3 variable (bx,by,bz) in which bx<65000, by<65000 and bz<65000.

If you use this <<<alloc_local/256,256>>> than the variable blocks=dim3(alloc_local,1,1). You can give the values like this:

blocks.x=bx

blocks.y=by

blocks.z=bz

or you can try blocks=dim3(bx,by,bz) ( I am not sure if this works, but I think I sow it somewhere).

Just a note. When you end in problems you can debug the program fast, before upgrading the card:

use cuda safe calls to identify the call which crashes the program
check the amount o memory available on the card after allocations
use cuda debugegr and cuda memcheck

I spent many hours loooking for a bug in the cufft until I realized I was doing something wrong in between the cufft calls.