Any Kernel Optimization Suggestions? Is there anything to improve?

I have implemented a Kernel that computes the smallest value for a voxel from its neighbours (18 Pixels arround the one pixel). This is depending on a mask which controls if a calculation is necessary or not. So I have 36 (actual distance and difference between two pixels) texture fetches for each pixel and 18 if else conditions. I have abbreviated the code to 3 calls of each “type”.

Only a small fraction of voxels need an update and since the mask is itself updated during compution I don’t know the access pattern to the memory, so shared memory is not beneficial. Instead I’m using tex1Dfetch for data that will change during computation and tex3D (cudaArray) for that will be only read.

Are there further optimzations that are worth spending the time or I’m out of options.

The CPU version of this code is quite fast and it took me only a fraction of time to implement it.

The CUDA implementation is only about 2.5 times faster the the CPU version which uses OpenMP on a Xeon 3540 (quad core) with

about 3 GHz and 8 MB cache. I using only the Microsoft compiler and not the faster Intel compiler.

__global__ void

min(float *src_dist, float *dst_dist, unsigned char *src_mask, unsigned char *dst_mask, unsigned char *input_img,

					   		 unsigned int dim_x, unsigned int dim_y, unsigned int dim_z)

{

	unsigned int ix = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int iy = blockIdx.y*blockDim.y + threadIdx.y;

	unsigned int stride_z = __umul24(dim_x,dim_y);

	unsigned int index;

	unsigned int min_index;

	float min_value;

	float input_image_111;

	if(ix > 0 && ix < dim_x && iy > 0 && iy < dim_y)

	{

		for(unsigned int iz = RADIUS; iz < dim_z-RADIUS; iz++)

		{

			index = __umul24(iz,stride_z) + __umul24(iy,dim_x) + ix;

			if(src_mask[index] == 1)

			{

			src_mask[index] = 0;

			min_index = index;

			min_value = tex1Dfetch(tex_dist_2, index);

			input_image_111 = tex3D(tex_image_3D, ix, iy, iz);

			float dist_100=tex1Dfetch(tex_dist_2, index-dim_x-1)+abs(input_image_111-tex3D(tex_image_3D, ix-1, iy-1, iz));

			min_value = min(min_value, dist_100);

			float dist_201=tex1Dfetch(tex_dist_2, index+stride_z-dim_x)+abs(input_image_111-tex3D(tex_image_3D, ix, iy-1, iz+1));

			min_value = min(min_value, dist_201);

			float dist_021=tex1Dfetch(tex_dist_2, index-stride_z+dim_x)+abs(input_image_111-tex3D(tex_image_3D, ix, iy+1, iz-1));

			min_value = min(min_value, dist_021);

			if(min_value < dist_100){dst_mask[index-dim_x-1] = 1;}else{min_index = index-dim_x-1;}

															 ...

			if(min_value < dist_201){dst_mask[index+stride_z-dim_x] = 1;}else{min_index = index+stride_z-dim_x;}

												...

			if(min_value < dist_021){dst_mask[index-stride_z+dim_x] = 1;}else{min_index = index-stride_z+dim_x;}

			}

		}

	}

}

I think that your kernel is purely memory-bound, and each entry is used only once.

1-D memory fetch is coalesced, I have no comment on how to optimized memory access but

for computation of “index”, I suggest

unsigned int iz = RADIUS;

index = __umul24(iz,stride_z) + __umul24(iy,dim_x) + ix;

for(; iz < dim_z-RADIUS; iz++, index+=stride_z )

{

	if(src_mask[index] == 1)

	{

		src_mask[index] = 0;

		....

}

Thank you for your suggestion!!!

I will try that on Monday.