Hello developers,

I am stuck on a index calculation problem and I am not able to see the source of the problem. The code works fine for blockSize = [16,16,1] or [32,16,1] or [16,16,2]

but for example it doesn’t work for blockSize = [8,8,8] or [16,4,4] or [16,8,4] etc. If somebody can help me it would be perfect!

The program calculates the FFT of a 3D volume with a ramp filter. The following code runs a complex multiplication in the frequency domain. The fftDimension is 513x484x16 and the kernelDimension is 513x1:

// Thread Block Dimensions

int tBlock_x = 8;

int tBlock_y = 8;

int tBlock_z = 8;

int blocksInX = (fftDimension.x - 1) / tBlock_x + 1;

int blocksInY = (fftDimension.y - 1) / tBlock_y + 1;

int blocksInZ = (fftDimension.z - 1) / tBlock_z + 1;

// Cuda capability < 2.0 -> Grid Dimension = 2

dim3 dimGrid = dim3(blocksInX, blocksInY*blocksInZ);

dim3 dimBlock = dim3(tBlock_x, tBlock_y, tBlock_z);

multiply_kernel <<< dimGrid, dimBlock >>> ( deviceProjectionFFT,

fftDimension,

deviceKernelFFT,

blocksInY,

1.0f/(float)blocksInY );

******* K E R N E L *******

**global**

void

multiply_kernel(cufftComplex *projFFT, int3 fftDimension, cufftComplex *kernelFFT, unsigned int Blocks_Y, float invBlocks_Y)

{

unsigned int blockIdx_z = __float2uint_rd(blockIdx.y * invBlocks_Y);

unsigned int blockIdx_y = blockIdx.y - __umul24(blockIdx_z, Blocks_Y);

unsigned int i = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

unsigned int j = __umul24(blockIdx_y, blockDim.y) + threadIdx.y;

unsigned int k = __umul24(blockIdx_z, blockDim.z) + threadIdx.z;

if (i >= fftDimension.x || j >= fftDimension.y || k >= fftDimension.z)

return;

long int proj_idx = i + (j + k * fftDimension.y ) * fftDimension.x;

cufftComplex result;

result.x = projFFT[proj_idx].x * kernelFFT[i].x - projFFT[proj_idx].y * kernelFFT[i].y;

result.y = projFFT[proj_idx].y * kernelFFT[i].x + projFFT[proj_idx].x * kernelFFT[i].y;

projFFT[proj_idx] = result;

}

Thank you very much! ;-)