I’m by no means an experienced programmer and relatively new to CUDA, so go easy on me :) I’ve gotten alot of answers from this forum on other problems I’ve had, but this is my first post.
I’ve been working on a project for a while now, and recently decided to expand its functionality. This new functionality includes taking multiple 1D fourier transforms using the cufft library, which in turn needs the complex data to be stored in an interleaved pattern (cufftComplex = float2 as far as I know). My original data is not stored in this pattern, it is basically an indexed array for size 2*N where are the real parts are stored in the first N indices and the imaginary part in the last half. Therefore (atleast for now) I have a kernel which takes as input this indexed array and outputs its corresponding cufftComplex representation, so that I may use the cufft library.
However when executing the program operating on the cufftComplex datatype, NaN values seem to pop up seemingly random in the program. For now the cufftComplex results are not used, and the functions that manipulate them only read from values that are actually used in the program ( they do not write). I’ve turned off all compiler optimizations.
The only logical explanation I’ve come up with is that some indices go out of bounds somewhere and start overwriting other parts of allocated memory. However I’ve double, triple and quadruple checked all the indices and I see no indication that this could happen. In addition, if I comment out the “rearranger function” and only call cufftExec1d, NaN-values still appear. The same happens if I comment out the cufftExec1d and only call the rearrange function.
Is there something I’m missing regarding the cufftComplex datatype?
Allocation is done through :
cudaMalloc((void **)&inverse, sizeof(cufftComplex)*N);
cudaMalloc((void **)&p, sizeof(float)2N);
and the “rearranger function” is:
global void make_interleaved( cufftComplex *out, float *in)
int tw, bx, by;
tw = threadIdx.x;
bx = blockIdx.x; by = blockIdx.y;
int nw, nx, ny, N; nw = device->nw; nx = device->nx; ny = device->ny; N = device->N; unsigned int index = tw+bx*nw+by*nw*nx; float2 num; num.x = in[index]; num.y = in[index+N]; out[index] = num;
This kernel is executed with gridDim = (nx, ny, 1) and blockDim = ( nw, 1, 1).
N = nwnxny, and the first complex number is hence given by z0 = p+i*p[N]
As an inexperienced programmer and forum user for that matter any input would be highly appreciated.