Reusing CUFFT plan yields incorrect results

Hi,

When I reuse CUFFT plans, sometimes I get incorrect results.

The following code compute the same thing 5 times, but the

CUFFT_R2C results are different on each iteration.

For some values of NZ (e.g. 2, 4), the computation is always wrong.

For some values of NZ (e.g. 8, 16), the computation is always correct.

Is this a bug, or did I misunderstand something?

I’m using CUDA 4.0 (V0.2.1221) on m1060 and m2070 GPUs.

#include "cufft.h"

const int NX = 2;

const int NY = 2;

const int NZ = 4;

const int N = NX * NY * (NZ/2+1);

cufftReal host[NX][NY][NZ+2];

cufftComplex *device;

void print_host() {

    for (int x=0; x<NX; x++) {

        for (int y=0; y<NY; y++) {

            for (int z=0; z<NZ+2; z++) {

                printf("%g ", host[x][y][z]);

            }

            printf("\n");

        }

        printf("\n");

    }

    printf("\n");

}

void print_device() {

    cudaMemcpy(**host, device, sizeof(cufftComplex)*N, cudaMemcpyDeviceToHost);

    print_host();

}

int main() {

    // Create plans

    cufftHandle plan_r2c, plan_c2r;

    cufftPlan3d(&plan_r2c, NX, NY, NZ, CUFFT_R2C);

    cufftPlan3d(&plan_c2r, NX, NY, NZ, CUFFT_C2R);

for (int iter=0; iter<5; iter++) {

        // Input data

        for (int x=0; x<NX; x++) {

            for (int y=0; y<NY; y++) {

                for (int z=0; z<NZ+2; z++) {

                    // z == NZ and NZ+1 are the padding required by CUFFT_R2C.

                    // It shouldn't matter what values I put there.

                    host[x][y][z] = (z<NZ) ? x+y+z : 1.E30;

                }

            }

        }

        print_host();

// Copy to device

        cudaMalloc((void**)& device, sizeof(cufftComplex)*N);

        cudaMemcpy(device, **host, sizeof(cufftComplex)*N, cudaMemcpyHostToDevice);

// Inplace transform

        cufftExecR2C(plan_r2c, (cufftReal*)device, device);

        print_device();

cufftExecC2R(plan_c2r, device, (cufftReal*)device);

        print_device();

        printf("-----------------------\n\n");

        cudaFree(device);

    }

// Free up resources

    cufftDestroy(plan_r2c);

    cufftDestroy(plan_c2r);

}