CUDA bug for 0 thread size Launch a kernel with 0 threads and CUDA fft will fail

I believe I have uncovered a bug with CUDA / CUDA FFT. It turns out if you launch a kernel with 0 threads, the CUDA FFT routine will fail.

For a variety of reasons I typically launch a kernel with an integral product of block and grid sizes and then I launch whatever doesn’t fit

as a kernel with a ‘residual’ size. That residual size is zero often enough if the the block and grid size product equals my problem size.

This has worked for me so far with no complaints about the empty kernel calls until I tried launching an FFT. An empty kernel call ruins the FFT.

Here is a simple code example of my fft.cu.

[codebox]

#include <stdio.h>

#include <math_functions.h>

#include “cublas.h”

#include “cufft.h”

#include “cutil.h”

global void vec_set(float *v, unsigned off)

{

unsigned j = threadIdx.x + blockIdx.x * blockDim.x + off ;

unsigned j2 = j << 1 ;

v[j2] = 1.0 ;

v[j2+1] = 0.0 ;

}

int main(int argc, char *argv)

{

char ch ;

cublasInit() ;

#define NX 256

#define BATCH 1

dim3 block ;

dim3 grid ;

dim3 residual ;

block.x = NX ;

block.y = BATCH ;

grid.x = 1 ;

residual.x = 0 ;

residual.y = block.y ;

cufftHandle plan ;

cufftComplex *data ;

cudaMalloc((void**)&data, sizeof(cufftComplex)NXBATCH);

/* Create a 1D FFT plan. */

cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH);

/* run a kernel */

vec_set<<<grid,block>>>((float *)data, 0) ;

/* The next line breaks the fft code. It launches a kernel with 0 threads */

vec_set<<<1,residual>>>((float *)data, block.x * grid.x) ;

/* Use the CUFFT plan to transform the signal in place. */

cufftExecC2C(plan, data, data, CUFFT_FORWARD);

/* Inverse transform the signal in place. */

cufftExecC2C(plan, data, data, CUFFT_INVERSE);

cublasShutdown() ;

printf(“Goodbye Cruel World\n”) ;

ch = getchar() ;

}

[/codebox]

Launching a kernel with zero block size or grid size should cause nothing to get executed and certainly shouldn’t break the FFT code.

It’s a bug!

Here is the error output:

cufft: ERROR: D:/Bld/rel/gpgpu/toolkit/r2.1/cufft/src/execute.cu, line 1070

cufft: ERROR: CUFFT_EXEC_FAILED

cufft: ERROR: D:/Bld/rel/gpgpu/toolkit/r2.1/cufft/src/cufft.cu, line 151

cufft: ERROR: CUFFT_EXEC_FAILED

Check the return values of cudaGetLastError() and cudaThreadSynchronize() immediately after the second call. I am unwilling to say “bug” just quite yet.