Index calculation problem! GridSize or/and BlockSize?!?!

Hello developers,

I am stuck on a index calculation problem and I am not able to see the source of the problem. The code works fine for blockSize = [16,16,1] or [32,16,1] or [16,16,2]

but for example it doesn’t work for blockSize = [8,8,8] or [16,4,4] or [16,8,4] etc. If somebody can help me it would be perfect!

The program calculates the FFT of a 3D volume with a ramp filter. The following code runs a complex multiplication in the frequency domain. The fftDimension is 513x484x16 and the kernelDimension is 513x1:

// Thread Block Dimensions
int tBlock_x = 8;
int tBlock_y = 8;
int tBlock_z = 8;
int blocksInX = (fftDimension.x - 1) / tBlock_x + 1;
int blocksInY = (fftDimension.y - 1) / tBlock_y + 1;
int blocksInZ = (fftDimension.z - 1) / tBlock_z + 1;

// Cuda capability < 2.0 → Grid Dimension = 2
dim3 dimGrid = dim3(blocksInX, blocksInY*blocksInZ);
dim3 dimBlock = dim3(tBlock_x, tBlock_y, tBlock_z);

multiply_kernel <<< dimGrid, dimBlock >>> ( deviceProjectionFFT,
fftDimension,
deviceKernelFFT,
blocksInY,
1.0f/(float)blocksInY );

******* K E R N E L *******

global
void
multiply_kernel(cufftComplex *projFFT, int3 fftDimension, cufftComplex *kernelFFT, unsigned int Blocks_Y, float invBlocks_Y)
{
unsigned int blockIdx_z = __float2uint_rd(blockIdx.y * invBlocks_Y);
unsigned int blockIdx_y = blockIdx.y - __umul24(blockIdx_z, Blocks_Y);
unsigned int i = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
unsigned int j = __umul24(blockIdx_y, blockDim.y) + threadIdx.y;
unsigned int k = __umul24(blockIdx_z, blockDim.z) + threadIdx.z;

if (i >= fftDimension.x || j >= fftDimension.y || k >= fftDimension.z)
return;

long int proj_idx = i + (j + k * fftDimension.y ) * fftDimension.x;

cufftComplex result;
result.x = projFFT[proj_idx].x * kernelFFT[i].x - projFFT[proj_idx].y * kernelFFT[i].y;
result.y = projFFT[proj_idx].y * kernelFFT[i].x + projFFT[proj_idx].x * kernelFFT[i].y;
projFFT[proj_idx] = result;
}

Thank you very much! ;-)

Does it work if you replace your fragile construct for approximate integer division

__float2uint_rd(blockIdx.y * invBlocks_Y)

by a real integer div?

tera, thank very much!

Your tip solved the problem.

Cheers

BTW tera, I was thinking about your fix, but I do not see the reason why it should change the result. Could you share your knowledge with me?

Thanks in advance!

To be precise, the construct is not fragile, but just wrong, because it rounds to the closest integer. While some rounding is needed to make up for the approximate nature of the floating point inverse, you only want to round upwards if you are already within invBlocks_Y/2 of the next integer.

I.e. you want this expression:

unsigned int blockIdx_z = int((blockIdx.y +0.5f) * invBlocks_Y);