Hello developers,
I am stuck on a index calculation problem and I am not able to see the source of the problem. The code works fine for blockSize = [16,16,1] or [32,16,1] or [16,16,2]
but for example it doesn’t work for blockSize = [8,8,8] or [16,4,4] or [16,8,4] etc. If somebody can help me it would be perfect!
The program calculates the FFT of a 3D volume with a ramp filter. The following code runs a complex multiplication in the frequency domain. The fftDimension is 513x484x16 and the kernelDimension is 513x1:
// Thread Block Dimensions
int tBlock_x = 8;
int tBlock_y = 8;
int tBlock_z = 8;
int blocksInX = (fftDimension.x - 1) / tBlock_x + 1;
int blocksInY = (fftDimension.y - 1) / tBlock_y + 1;
int blocksInZ = (fftDimension.z - 1) / tBlock_z + 1;
// Cuda capability < 2.0 → Grid Dimension = 2
dim3 dimGrid = dim3(blocksInX, blocksInY*blocksInZ);
dim3 dimBlock = dim3(tBlock_x, tBlock_y, tBlock_z);
multiply_kernel <<< dimGrid, dimBlock >>> ( deviceProjectionFFT,
fftDimension,
deviceKernelFFT,
blocksInY,
1.0f/(float)blocksInY );
******* K E R N E L *******
global
void
multiply_kernel(cufftComplex *projFFT, int3 fftDimension, cufftComplex *kernelFFT, unsigned int Blocks_Y, float invBlocks_Y)
{
unsigned int blockIdx_z = __float2uint_rd(blockIdx.y * invBlocks_Y);
unsigned int blockIdx_y = blockIdx.y - __umul24(blockIdx_z, Blocks_Y);
unsigned int i = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
unsigned int j = __umul24(blockIdx_y, blockDim.y) + threadIdx.y;
unsigned int k = __umul24(blockIdx_z, blockDim.z) + threadIdx.z;
if (i >= fftDimension.x || j >= fftDimension.y || k >= fftDimension.z)
return;
long int proj_idx = i + (j + k * fftDimension.y ) * fftDimension.x;
cufftComplex result;
result.x = projFFT[proj_idx].x * kernelFFT[i].x - projFFT[proj_idx].y * kernelFFT[i].y;
result.y = projFFT[proj_idx].y * kernelFFT[i].x + projFFT[proj_idx].x * kernelFFT[i].y;
projFFT[proj_idx] = result;
}
Thank you very much! ;-)