Hello,
I am having an issue indexing into a 3D volume… I have tried everything and the behavior I am experiencing is surprising (to me at least).
The code below is a stripped down version of what I am working with – I have removed any code that is not relevant to the issue.
Also… the indexing into the volume works under emulation mode, but not on the hardware ::shrug::
I am at a complete loss at this point.
Kernel looks something like this:
__global__
void kernel_1 (float *dev_vol, unsigned int Blocks_Y, float invBlocks_Y)
{
unsigned int blockIdx_z = __float2uint_rd(blockIdx.y * invBlocks_Y);
unsigned int blockIdx_y = blockIdx.y - __umul24(blockIdx_z, Blocks_Y);
unsigned int i = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
unsigned int j = __umul24(blockIdx_y, blockDim.y) + threadIdx.y;
unsigned int k = __umul24(blockIdx_z, blockDim.z) + threadIdx.z;
/* // (Method 1) Index per block
long int one_block = blockDim.x * blockDim.y * blockDim.z;
long int offset_x = blockIdx.x * (one_block);
long long int offset_y = blockIdx.y * (one_block * gridDim.x);
long int block_index = threadIdx.x + (blockDim.x * threadIdx.y) + (blockDim.x * blockDim.y * threadIdx.z);
long int vol_idx = offset_x + offset_y + block_index;
*/
// (Method 2) Index per grid
long int vol_idx = i + ( j*(gridDim.x)*blockDim.x ) + ( k*(gridDim.x)*blockDim.x*(Blocks_Y)*blockDim.y );
// Test indexing
dev_vol[vol_idx] = (float)vol_idx;
}
Calling stub function looks something like this:
int CUDA_stub (Volume *vol, myOptions *options)
{
// Thead Block Dimensions
int tBlock_x = 5;
int tBlock_y = 4;
int tBlock_z = 4;
// Used to build "3D Grid"
int blocksInX; dim3 dimGrid;
int blocksInY; dim3 dimBlock;
int blocksInZ;
FILE *debug_file;
// CUDA device pointers
float* dev_vol; // Holds voxels on device
// First, we need to allocate memory on the host device
// for the 3D volume of voxels that will hold our reconstruction.
cudaMalloc( (void**)&dev_vol, vol->dim[0] * vol->dim[1] * vol->dim[2] * sizeof(float) );
cudaMemset( (void *) dev_vol, 0, vol->dim[0]*vol->dim[1]*vol->dim[2]*sizeof(float));
// Each element in the volume (each voxel) gets 1 thread
blocksInX = (vol->dim[0]+tBlock_x-1)/tBlock_x;
blocksInY = (vol->dim[1]+tBlock_y-1)/tBlock_y;
blocksInZ = (vol->dim[2]+tBlock_z-1)/tBlock_z;
dimGrid = dim3(blocksInX, blocksInY*blocksInZ);
dimBlock = dim3(tBlock_x, tBlock_y, tBlock_z);
kernel_1<<< dimGrid, dimBlock >>>(dev_vol, blocksInY, 1.0f/(float)blocksInY);
// Copy reconstructed volume from device to host
cudaMemcpy( vol->img, dev_vol, vol->dim[0] * vol->dim[1] * vol->dim[2] * sizeof(float), cudaMemcpyDeviceToHost );
// Debug
printf("Writing debug file...\n");
debug_file = fopen("debug_out.txt", "a+");
int i;
float* fimg = (float *)vol->img;
for(i=0; i < vol->dim[0]*vol->dim[1]*vol->dim[2]; i++)
{
if(fimg[i] != 0)
fprintf(debug_file, "[%i]\t%f\n",i , fimg[i]);
}
// Cleanup
cudaFree( dev_vol );
return 0;
}
For some reason, vol_idx always returns zero for me.
However, the individual parts of the expression seem to return the correct values when evaluated independently… in other words:
long int vol_idx = i;
returns the expected index
long int vol_idx = ( j*(gridDim.x)*blockDim.x );
also returns the expected offset
long int vol_idx = ( k*(gridDim.x)*blockDim.x*(Blocks_Y)*blockDim.y );
also seems to return the expected offset.
When I add them all together to get the absolute index, however, all I get is zeros for each and every combination of i, j, & k.
The commented out scheme (in the kernel) that indexes block by block exhibits similar behavior.
Can anybody see where I have gone wrong?
Best regards,
tshack