Indexing into 3D volume Simply can't find my mistake.


I am having an issue indexing into a 3D volume… I have tried everything and the behavior I am experiencing is surprising (to me at least).

The code below is a stripped down version of what I am working with – I have removed any code that is not relevant to the issue.

Also… the indexing into the volume works under emulation mode, but not on the hardware ::shrug::

I am at a complete loss at this point.

Kernel looks something like this:


void kernel_1 (float *dev_vol, unsigned int Blocks_Y, float invBlocks_Y)


	unsigned int blockIdx_z = __float2uint_rd(blockIdx.y * invBlocks_Y);

	unsigned int blockIdx_y = blockIdx.y - __umul24(blockIdx_z, Blocks_Y);

	unsigned int i = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

	unsigned int j = __umul24(blockIdx_y, blockDim.y) + threadIdx.y;

	unsigned int k = __umul24(blockIdx_z, blockDim.z) + threadIdx.z;

/* 	// (Method 1) Index per block

	long int one_block = blockDim.x * blockDim.y * blockDim.z;

	long int offset_x  = blockIdx.x * (one_block);

	long long int offset_y  = blockIdx.y * (one_block * gridDim.x);

	long int block_index = threadIdx.x + (blockDim.x * threadIdx.y) + (blockDim.x * blockDim.y * threadIdx.z);

	long int vol_idx = offset_x + offset_y + block_index;


	// (Method 2) Index per grid

	long int vol_idx = i + ( j*(gridDim.x)*blockDim.x ) + ( k*(gridDim.x)*blockDim.x*(Blocks_Y)*blockDim.y );

	// Test indexing

	dev_vol[vol_idx] = (float)vol_idx;


Calling stub function looks something like this:

int CUDA_stub (Volume *vol, myOptions *options)


	// Thead Block Dimensions

	int tBlock_x = 5;

	int tBlock_y = 4;

	int tBlock_z = 4;

	// Used to build "3D Grid"

	int blocksInX;	dim3 dimGrid;

	int blocksInY;  dim3 dimBlock;

	int blocksInZ;

	FILE *debug_file;

	// CUDA device pointers

	float* dev_vol;				// Holds voxels on device

	// First, we need to allocate memory on the host device

	// for the 3D volume of voxels that will hold our reconstruction.

	cudaMalloc( (void**)&dev_vol, vol->dim[0] * vol->dim[1] * vol->dim[2] * sizeof(float) );

	cudaMemset( (void *) dev_vol, 0, vol->dim[0]*vol->dim[1]*vol->dim[2]*sizeof(float));	

		// Each element in the volume (each voxel) gets 1 thread

		blocksInX = (vol->dim[0]+tBlock_x-1)/tBlock_x;

		blocksInY = (vol->dim[1]+tBlock_y-1)/tBlock_y;

		blocksInZ = (vol->dim[2]+tBlock_z-1)/tBlock_z;

		dimGrid  = dim3(blocksInX, blocksInY*blocksInZ);

		dimBlock = dim3(tBlock_x, tBlock_y, tBlock_z);

		kernel_1<<< dimGrid, dimBlock >>>(dev_vol, blocksInY, 1.0f/(float)blocksInY);

	// Copy reconstructed volume from device to host

	cudaMemcpy( vol->img, dev_vol, vol->dim[0] * vol->dim[1] * vol->dim[2] * sizeof(float), cudaMemcpyDeviceToHost );

// Debug

	printf("Writing debug file...\n");

	debug_file = fopen("debug_out.txt", "a+");

	int i;

	float* fimg = (float *)vol->img;

	for(i=0; i < vol->dim[0]*vol->dim[1]*vol->dim[2]; i++)


		if(fimg[i] != 0)

			fprintf(debug_file, "[%i]\t%f\n",i , fimg[i]);


	// Cleanup

	cudaFree( dev_vol );	

	return 0;


For some reason, vol_idx always returns zero for me.

However, the individual parts of the expression seem to return the correct values when evaluated independently… in other words:

long int vol_idx = i;

returns the expected index

long int vol_idx = ( j*(gridDim.x)*blockDim.x );

also returns the expected offset

long int vol_idx = ( k*(gridDim.x)*blockDim.x*(Blocks_Y)*blockDim.y );

also seems to return the expected offset.

When I add them all together to get the absolute index, however, all I get is zeros for each and every combination of i, j, & k.

The commented out scheme (in the kernel) that indexes block by block exhibits similar behavior.

Can anybody see where I have gone wrong?

Best regards,


I honestly havent read through all that, this is what i use:

const int idx = (blockIdx.y*blockDim.x*gridDim.x)+blockIdx.x*blockDim.x+threadIdx.x;

	const int z = idx/(DATA_W*DATA_H);

	const int y = (idx - z * DATA_W * DATA_H) / DATA_W;

	const int x =  (idx - z * DATA_W * DATA_H - y * DATA_W);

Down side is that there is a bunch of integer ops, but it works.

If your volume is not an exact multiple of the thread block size, you probably need something like this to avoid indexing outside the bounds of the volume:

unsigned int i = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

unsigned int j = __umul24(blockIdx_y, blockDim.y) + threadIdx.y;

unsigned int k = __umul24(blockIdx_z, blockDim.z) + threadIdx.z;

if ((i >= volumeSize.x) || (j >= volumeSize.y) || (k >= volumeSize.z))


I had exactly the same bug.

Hi Simon,

Thanks! You hit it right on the nose.

I had ended up getting it to work by increasing the size of my malloc to match the number of threads as opposed to matching the number of voxels in the volume.

I like your solution more since it doesn’t result in running extra calculations that will just get thrown away.

BTW, I am fairly new at CUDA (obviously)… does introducing the conditional you propose result in serialization of the thread block if some of the threads in the block are inside the volume and some are outside the volume?

Best Regards,


Yes, this will cause some thread divergence. I haven’t measured it, but since it only happens for a small percentage of the blocks I don’t think it would cause much performance degradation.