Volumetric data processing with Cuda

Hi, I am new with CUDA. I implemented a 3D volume smoothing filter with CUDA, but it seems slower than my OpenMP version.
Is this due to too many global memory access on GPU? or maybe this is not a suitable problem for GPU computation? Can any one please help me understand this?
Thanks in advance.

It likely is due to your memory access patterns. Make sure to produce coalesced memory accesses and to reuse data read from global memory as much as possible.

On 2.x devices the cache helps with both of these points. But the cache is quite small compared to the large number of threads it serves, so don’t just rely on the cache.

Post some general details of the volumetric algorithm/code otherwise we probably can’t help you to see what might be wrong…

Thanks for your response. Below is what I did for the test. I took the simplest volume data process as a example. My volume data size normally is less than 512X512X512

on host:

int volumeSmoothingMeanWrapper(unsigned char *a, int windowSize, int x_size, int y_size, int z_size)

{

int ret = 0;

dim3 grid(x_size, y_size, 1);

dim3 block(z_size,1,1);

unsigned char *D_s, *D_t;

int t_size =  x_size*y_size*z_size*sizeof(unsigned char);

cudaError_t err = cudaMalloc((void **)&D_s, t_size);

err = cudaMalloc((void **)&D_t, t_size);

err = cudaMemcpy(D_s, a, t_size,  cudaMemcpyHostToDevice); 

err = cudaMemcpy(D_t, a, t_size,  cudaMemcpyHostToDevice); 

cudavolumeSmoothingMean<<<grid, block>>>(D_s, D_t, windowSize, x_size, y_size, z_size);

err = cudaThreadSynchronize();

err = cudaMemcpy(a, D_t, t_size,  cudaMemcpyDeviceToHost); 

err = cudaGetLastError();

ret = (int)err;

cudaFree(D_s);

cudaFree(D_t);

return ret;

}

on device:

global void volumeSmoothingMean(unsigned char *D_s, unsigned char *D_t, int windowSize, int x_size, int y_size, int z_size)

{

int start_x, end_x;

start_x = blockIdx.x-windowSize/2;

if (start_x <0)

	start_x = 0;

end_x = blockIdx.x+windowSize/2;

if(end_x > x_size)

	end_x = x_size ;

int start_y, end_y;

start_y = blockIdx.y-windowSize/2;

if (start_y <0)

	start_y = 0;

end_y = blockIdx.y+windowSize/2;

if(end_y > y_size)

	end_y = y_size;

int start_z, end_z;

start_z = threadIdx.x-windowSize/2;

if (start_z <0)

	start_z = 0;

end_z = threadIdx.x+windowSize/2;

if(end_z > z_size)

	end_z = z_size;

float total = 0.f;

float window_voxels = 0;

for(int i=start_x; i<end_x; i++)

	for(int j=start_y; j<end_y; j++)

		for(int k=start_z; k<end_z; k++)

		{

			total += (float)D_s[gridDim.x*gridDim.y*k + gridDim.x*j + i];

			window_voxels += 1.f;

		}		

D_t[gridDim.x*gridDim.y*threadIdx.x+gridDim.x*blockIdx.y+blockIdx.x] = (unsigned char)(total/window_voxels+0.5f);

}

Thanks for your help. Another problem is sometimes it crashes when Windowsize goes up to 7 or 9(I only have one GPU for both rendering and computing). I did some search here. Looks like this happens for single GPU for both rendering and computing. If it requires me put another graphic card for computing only, I really need to think twice if we should go for CUDA. Thanks again. By the way, my tests were with Geforce GTX 280 and GT 550M.

CUDA Programming Guide has more on this.

Launching a kernel with a configuration which is required is called a “execution configuration”.

It works like this:

<<< GridSize in dim3 format, BlockSize in dim3 format, PerBlockDynamicMemoryInBytes in size_t format (integer format) (optiona, defaults to zero), cuda stream (optional, default to zero) >>>

So the grid size specifies how many blocks there are in the grid.
So the block size specifies how many threads there are per block.

So far your kernel call seems ok.

Perhaps you should give us an example of call code.

The number of threads per block should be high, and the grid as low as possible I think.

Perhaps you should try swapping these:

dim3 grid(x_size, y_size, 1);
dim3 block(z_size,1,1);

into this:

dim3 grid(1, 1, 1);
dim3 block(x_size,y_size,z_size);

if that’s not possible then try:

dim3 grid(z_size, 1, 1);
dim3 block(x_size,y_size,1);

if that’s not possible then try:

dim3 grid(y_size, z_size, 1);
dim3 block(x_size,1,1);

So you should try to stuff as much dimensions in a block as possible and then stuff the rest in a grid.

At least that seems somewhat logical to me ;)

This advice swaps the order in which the volume is traversed… perhaps this re-ordering will give more performance.

You should definetly try it… let us know how it works out ! Quite interesting ! ;) =D

CUDA Programming Guide has more on this.

Launching a kernel with a configuration which is required is called a “execution configuration”.

It works like this:

<<< GridSize in dim3 format, BlockSize in dim3 format, PerBlockDynamicMemoryInBytes in size_t format (integer format) (optiona, defaults to zero), cuda stream (optional, default to zero) >>>

So the grid size specifies how many blocks there are in the grid.
So the block size specifies how many threads there are per block.

So far your kernel call seems ok.

Perhaps you should give us an example of call code.

The number of threads per block should be high, and the grid as low as possible I think.

Perhaps you should try swapping these:

dim3 grid(x_size, y_size, 1);
dim3 block(z_size,1,1);

into this:

dim3 grid(1, 1, 1);
dim3 block(x_size,y_size,z_size);

if that’s not possible then try:

dim3 grid(z_size, 1, 1);
dim3 block(x_size,y_size,1);

if that’s not possible then try:

dim3 grid(y_size, z_size, 1);
dim3 block(x_size,1,1);

So you should try to stuff as much dimensions in a block as possible and then stuff the rest in a grid.

At least that seems somewhat logical to me ;)

This advice swaps the order in which the volume is traversed… perhaps this re-ordering will give more performance.

You should definetly try it… let us know how it works out ! Quite interesting ! ;) =D

Hi Skybuck, thanks for your response and your interests on this top.

I don’t think

dim3 grid(1, 1, 1);

dim3 block(x_size,y_size,z_size); is going to work since this limit the third dimension to 64. I will try some other configurations you mentioned.

Actually I agree with Tera. It’s about memory coalescing, but I have not figured out the better way to produce coalesced memory accesses for this this problem. Thanks.

Hi Skybuck, thanks for your response and your interests on this top.

I don’t think

dim3 grid(1, 1, 1);

dim3 block(x_size,y_size,z_size); is going to work since this limit the third dimension to 64. I will try some other configurations you mentioned.

Actually I agree with Tera. It’s about memory coalescing, but I have not figured out the better way to produce coalesced memory accesses for this this problem. Thanks.

Also perhaps it’s better to do these calculations outside of the kernel:

windowSize/2

To me it would seem this would get done by each kernel every time ? I could be wrong though ;)

Anybody know ? ;)

Also perhaps it’s better to do these calculations outside of the kernel:

windowSize/2

To me it would seem this would get done by each kernel every time ? I could be wrong though ;)

Anybody know ? ;)

No. You want enough blocks in the grid to fully load all available SMs, so the number of blocks should not be low. Please try to master the relevant section of the Programming Guide before repeatedly giving bad advice to novice users.

No. You want enough blocks in the grid to fully load all available SMs, so the number of blocks should not be low. Please try to master the relevant section of the Programming Guide before repeatedly giving bad advice to novice users.

Your memory coalescing seems fine for devices of compute capability 1.2 or higher. Your main problem is that you repeatedly read the the same data from global memory.

A band-aid fix for the GTX 280 would be to use a texture, so that the accesses at least get cached. However I guess that with volumetric data the small cache will overflow rather quickly (which would also explain why the GT 550M with it’s cached global memory space doesn’t perform better). So I think you would be best off by reading data to shared memory and manually managing it’s reuse. This will require some thinking about the best strategy, but will probably show why automatic caching does not work well.

You might find that there is not sufficient shared memory on the device to fully reuse the data fetched from global memory. In that case, a tiling strategy probably gives the best partial reuse of data.

Your memory coalescing seems fine for devices of compute capability 1.2 or higher. Your main problem is that you repeatedly read the the same data from global memory.

A band-aid fix for the GTX 280 would be to use a texture, so that the accesses at least get cached. However I guess that with volumetric data the small cache will overflow rather quickly (which would also explain why the GT 550M with it’s cached global memory space doesn’t perform better). So I think you would be best off by reading data to shared memory and manually managing it’s reuse. This will require some thinking about the best strategy, but will probably show why automatic caching does not work well.

You might find that there is not sufficient shared memory on the device to fully reuse the data fetched from global memory. In that case, a tiling strategy probably gives the best partial reuse of data.

Thanks Tera for your advise. I am going to try shared memory first.

Thanks Tera for your advise. I am going to try shared memory first.

Strange thing to say… since this is a volume with plenty of data for pretty much everything.

So start filling up threads first then proceed to blocks/grids seems logical.

I have no idea how many units are inside his gpu.

The point is if he doesn’t fill up the threads first then there will be memory gaps.

He must avoid the memory gaps first and the way his code looks it seems he’s doing memory access exactly the wrong way around.

Strange thing to say… since this is a volume with plenty of data for pretty much everything.

So start filling up threads first then proceed to blocks/grids seems logical.

I have no idea how many units are inside his gpu.

The point is if he doesn’t fill up the threads first then there will be memory gaps.

He must avoid the memory gaps first and the way his code looks it seems he’s doing memory access exactly the wrong way around.