Volumetric data processing with Cuda

myfeng · June 11, 2011, 12:13am

Hi, I am new with CUDA. I implemented a 3D volume smoothing filter with CUDA, but it seems slower than my OpenMP version.
Is this due to too many global memory access on GPU? or maybe this is not a suitable problem for GPU computation? Can any one please help me understand this?
Thanks in advance.

tera · June 11, 2011, 9:09am

It likely is due to your memory access patterns. Make sure to produce coalesced memory accesses and to reuse data read from global memory as much as possible.

On 2.x devices the cache helps with both of these points. But the cache is quite small compared to the large number of threads it serves, so don’t just rely on the cache.

Skybuck · June 12, 2011, 2:23pm

Post some general details of the volumetric algorithm/code otherwise we probably can’t help you to see what might be wrong…

myfeng · June 12, 2011, 4:40pm

Thanks for your response. Below is what I did for the test. I took the simplest volume data process as a example. My volume data size normally is less than 512X512X512

on host:

int volumeSmoothingMeanWrapper(unsigned char *a, int windowSize, int x_size, int y_size, int z_size)

{

int ret = 0;

dim3 grid(x_size, y_size, 1);

dim3 block(z_size,1,1);

unsigned char *D_s, *D_t;

int t_size =  x_size*y_size*z_size*sizeof(unsigned char);

cudaError_t err = cudaMalloc((void **)&D_s, t_size);

err = cudaMalloc((void **)&D_t, t_size);

err = cudaMemcpy(D_s, a, t_size,  cudaMemcpyHostToDevice); 

err = cudaMemcpy(D_t, a, t_size,  cudaMemcpyHostToDevice); 

cudavolumeSmoothingMean<<<grid, block>>>(D_s, D_t, windowSize, x_size, y_size, z_size);

err = cudaThreadSynchronize();

err = cudaMemcpy(a, D_t, t_size,  cudaMemcpyDeviceToHost); 

err = cudaGetLastError();

ret = (int)err;

cudaFree(D_s);

cudaFree(D_t);

return ret;

}

on device:

global void volumeSmoothingMean(unsigned char *D_s, unsigned char *D_t, int windowSize, int x_size, int y_size, int z_size)

{

int start_x, end_x;

start_x = blockIdx.x-windowSize/2;

if (start_x <0)

	start_x = 0;

end_x = blockIdx.x+windowSize/2;

if(end_x > x_size)

	end_x = x_size ;

int start_y, end_y;

start_y = blockIdx.y-windowSize/2;

if (start_y <0)

	start_y = 0;

end_y = blockIdx.y+windowSize/2;

if(end_y > y_size)

	end_y = y_size;

int start_z, end_z;

start_z = threadIdx.x-windowSize/2;

if (start_z <0)

	start_z = 0;

end_z = threadIdx.x+windowSize/2;

if(end_z > z_size)

	end_z = z_size;

float total = 0.f;

float window_voxels = 0;

for(int i=start_x; i<end_x; i++)

	for(int j=start_y; j<end_y; j++)

		for(int k=start_z; k<end_z; k++)

		{

			total += (float)D_s[gridDim.x*gridDim.y*k + gridDim.x*j + i];

			window_voxels += 1.f;

		}		

D_t[gridDim.x*gridDim.y*threadIdx.x+gridDim.x*blockIdx.y+blockIdx.x] = (unsigned char)(total/window_voxels+0.5f);

}

Thanks for your help. Another problem is sometimes it crashes when Windowsize goes up to 7 or 9(I only have one GPU for both rendering and computing). I did some search here. Looks like this happens for single GPU for both rendering and computing. If it requires me put another graphic card for computing only, I really need to think twice if we should go for CUDA. Thanks again. By the way, my tests were with Geforce GTX 280 and GT 550M.

Skybuck · June 13, 2011, 9:21pm

CUDA Programming Guide has more on this.

Launching a kernel with a configuration which is required is called a “execution configuration”.

It works like this:

<<< GridSize in dim3 format, BlockSize in dim3 format, PerBlockDynamicMemoryInBytes in size_t format (integer format) (optiona, defaults to zero), cuda stream (optional, default to zero) >>>

So the grid size specifies how many blocks there are in the grid.
So the block size specifies how many threads there are per block.

So far your kernel call seems ok.

Perhaps you should give us an example of call code.

The number of threads per block should be high, and the grid as low as possible I think.

Perhaps you should try swapping these:

dim3 grid(x_size, y_size, 1);
dim3 block(z_size,1,1);

into this:

dim3 grid(1, 1, 1);
dim3 block(x_size,y_size,z_size);

if that’s not possible then try:

dim3 grid(z_size, 1, 1);
dim3 block(x_size,y_size,1);

if that’s not possible then try:

dim3 grid(y_size, z_size, 1);
dim3 block(x_size,1,1);

So you should try to stuff as much dimensions in a block as possible and then stuff the rest in a grid.

At least that seems somewhat logical to me External Image

This advice swaps the order in which the volume is traversed… perhaps this re-ordering will give more performance.

You should definetly try it… let us know how it works out ! Quite interesting ! External Image =D

Skybuck · June 13, 2011, 9:21pm

CUDA Programming Guide has more on this.

Launching a kernel with a configuration which is required is called a “execution configuration”.

It works like this:

<<< GridSize in dim3 format, BlockSize in dim3 format, PerBlockDynamicMemoryInBytes in size_t format (integer format) (optiona, defaults to zero), cuda stream (optional, default to zero) >>>

So the grid size specifies how many blocks there are in the grid.
So the block size specifies how many threads there are per block.

So far your kernel call seems ok.

Perhaps you should give us an example of call code.

The number of threads per block should be high, and the grid as low as possible I think.

Perhaps you should try swapping these:

dim3 grid(x_size, y_size, 1);
dim3 block(z_size,1,1);

into this:

dim3 grid(1, 1, 1);
dim3 block(x_size,y_size,z_size);

if that’s not possible then try:

dim3 grid(z_size, 1, 1);
dim3 block(x_size,y_size,1);

if that’s not possible then try:

dim3 grid(y_size, z_size, 1);
dim3 block(x_size,1,1);

So you should try to stuff as much dimensions in a block as possible and then stuff the rest in a grid.

At least that seems somewhat logical to me External Image

This advice swaps the order in which the volume is traversed… perhaps this re-ordering will give more performance.

You should definetly try it… let us know how it works out ! Quite interesting ! External Image =D

myfeng · June 13, 2011, 9:40pm

CUDA Programming Guide has more on this.

Launching a kernel with a configuration which is required is called a “execution configuration”.

It works like this:

<<< GridSize in dim3 format, BlockSize in dim3 format, PerBlockDynamicMemoryInBytes in size_t format (integer format) (optiona, defaults to zero), cuda stream (optional, default to zero) >>>

So the grid size specifies how many blocks there are in the grid.

So the block size specifies how many threads there are per block.

So far your kernel call seems ok.

Perhaps you should give us an example of call code.

The number of threads per block should be high, and the grid as low as possible I think.

Perhaps you should try swapping these:

dim3 grid(x_size, y_size, 1);

dim3 block(z_size,1,1);

into this:

dim3 grid(1, 1, 1);

dim3 block(x_size,y_size,z_size);

if that’s not possible then try:

dim3 grid(z_size, 1, 1);

dim3 block(x_size,y_size,1);

if that’s not possible then try:

dim3 grid(y_size, z_size, 1);

dim3 block(x_size,1,1);

So you should try to stuff as much dimensions in a block as possible and then stuff the rest in a grid.

At least that seems somewhat logical to me External Image

This advice swaps the order in which the volume is traversed… perhaps this re-ordering will give more performance.

You should definetly try it… let us know how it works out ! Quite interesting ! External Image =D

Hi Skybuck, thanks for your response and your interests on this top.

I don’t think

dim3 grid(1, 1, 1);

dim3 block(x_size,y_size,z_size); is going to work since this limit the third dimension to 64. I will try some other configurations you mentioned.

Actually I agree with Tera. It’s about memory coalescing, but I have not figured out the better way to produce coalesced memory accesses for this this problem. Thanks.

myfeng · June 13, 2011, 9:40pm

CUDA Programming Guide has more on this.

Launching a kernel with a configuration which is required is called a “execution configuration”.

It works like this:

<<< GridSize in dim3 format, BlockSize in dim3 format, PerBlockDynamicMemoryInBytes in size_t format (integer format) (optiona, defaults to zero), cuda stream (optional, default to zero) >>>

So the grid size specifies how many blocks there are in the grid.

So the block size specifies how many threads there are per block.

So far your kernel call seems ok.

Perhaps you should give us an example of call code.

The number of threads per block should be high, and the grid as low as possible I think.

Perhaps you should try swapping these:

dim3 grid(x_size, y_size, 1);

dim3 block(z_size,1,1);

into this:

dim3 grid(1, 1, 1);

dim3 block(x_size,y_size,z_size);

if that’s not possible then try:

dim3 grid(z_size, 1, 1);

dim3 block(x_size,y_size,1);

if that’s not possible then try:

dim3 grid(y_size, z_size, 1);

dim3 block(x_size,1,1);

So you should try to stuff as much dimensions in a block as possible and then stuff the rest in a grid.

At least that seems somewhat logical to me External Image

This advice swaps the order in which the volume is traversed… perhaps this re-ordering will give more performance.

You should definetly try it… let us know how it works out ! Quite interesting ! External Image =D

Hi Skybuck, thanks for your response and your interests on this top.

I don’t think

dim3 grid(1, 1, 1);

dim3 block(x_size,y_size,z_size); is going to work since this limit the third dimension to 64. I will try some other configurations you mentioned.

Actually I agree with Tera. It’s about memory coalescing, but I have not figured out the better way to produce coalesced memory accesses for this this problem. Thanks.

Skybuck · June 14, 2011, 2:31am

Also perhaps it’s better to do these calculations outside of the kernel:

windowSize/2

To me it would seem this would get done by each kernel every time ? I could be wrong though External Image

Anybody know ? External Image

Skybuck · June 14, 2011, 2:31am

Also perhaps it’s better to do these calculations outside of the kernel:

windowSize/2

To me it would seem this would get done by each kernel every time ? I could be wrong though External Image

Anybody know ? External Image

tera · June 14, 2011, 11:43am

No. You want enough blocks in the grid to fully load all available SMs, so the number of blocks should not be low. Please try to master the relevant section of the Programming Guide before repeatedly giving bad advice to novice users.

tera · June 14, 2011, 11:43am

No. You want enough blocks in the grid to fully load all available SMs, so the number of blocks should not be low. Please try to master the relevant section of the Programming Guide before repeatedly giving bad advice to novice users.

tera · June 14, 2011, 12:02pm

Your memory coalescing seems fine for devices of compute capability 1.2 or higher. Your main problem is that you repeatedly read the the same data from global memory.

A band-aid fix for the GTX 280 would be to use a texture, so that the accesses at least get cached. However I guess that with volumetric data the small cache will overflow rather quickly (which would also explain why the GT 550M with it’s cached global memory space doesn’t perform better). So I think you would be best off by reading data to shared memory and manually managing it’s reuse. This will require some thinking about the best strategy, but will probably show why automatic caching does not work well.

You might find that there is not sufficient shared memory on the device to fully reuse the data fetched from global memory. In that case, a tiling strategy probably gives the best partial reuse of data.

tera · June 14, 2011, 12:02pm

Your memory coalescing seems fine for devices of compute capability 1.2 or higher. Your main problem is that you repeatedly read the the same data from global memory.

A band-aid fix for the GTX 280 would be to use a texture, so that the accesses at least get cached. However I guess that with volumetric data the small cache will overflow rather quickly (which would also explain why the GT 550M with it’s cached global memory space doesn’t perform better). So I think you would be best off by reading data to shared memory and manually managing it’s reuse. This will require some thinking about the best strategy, but will probably show why automatic caching does not work well.

You might find that there is not sufficient shared memory on the device to fully reuse the data fetched from global memory. In that case, a tiling strategy probably gives the best partial reuse of data.

myfeng · June 14, 2011, 3:51pm

Your memory coalescing seems fine for devices of compute capability 1.2 or higher. Your main problem is that you repeatedly read the the same data from global memory.

A band-aid fix for the GTX 280 would be to use a texture, so that the accesses at least get cached. However I guess that with volumetric data the small cache will overflow rather quickly (which would also explain why the GT 550M with it’s cached global memory space doesn’t perform better). So I think you would be best off by reading data to shared memory and manually managing it’s reuse. This will require some thinking about the best strategy, but will probably show why automatic caching does not work well.

You might find that there is not sufficient shared memory on the device to fully reuse the data fetched from global memory. In that case, a tiling strategy probably gives the best partial reuse of data.

Thanks Tera for your advise. I am going to try shared memory first.

myfeng · June 14, 2011, 3:51pm

Your memory coalescing seems fine for devices of compute capability 1.2 or higher. Your main problem is that you repeatedly read the the same data from global memory.

A band-aid fix for the GTX 280 would be to use a texture, so that the accesses at least get cached. However I guess that with volumetric data the small cache will overflow rather quickly (which would also explain why the GT 550M with it’s cached global memory space doesn’t perform better). So I think you would be best off by reading data to shared memory and manually managing it’s reuse. This will require some thinking about the best strategy, but will probably show why automatic caching does not work well.

You might find that there is not sufficient shared memory on the device to fully reuse the data fetched from global memory. In that case, a tiling strategy probably gives the best partial reuse of data.

Thanks Tera for your advise. I am going to try shared memory first.

Skybuck · June 14, 2011, 8:12pm

Strange thing to say… since this is a volume with plenty of data for pretty much everything.

So start filling up threads first then proceed to blocks/grids seems logical.

I have no idea how many units are inside his gpu.

The point is if he doesn’t fill up the threads first then there will be memory gaps.

He must avoid the memory gaps first and the way his code looks it seems he’s doing memory access exactly the wrong way around.

Skybuck · June 14, 2011, 8:12pm

Strange thing to say… since this is a volume with plenty of data for pretty much everything.

So start filling up threads first then proceed to blocks/grids seems logical.

I have no idea how many units are inside his gpu.

The point is if he doesn’t fill up the threads first then there will be memory gaps.

He must avoid the memory gaps first and the way his code looks it seems he’s doing memory access exactly the wrong way around.

Topic		Replies	Views
multi dimension array CUDA Programming and Performance	26	32792	February 12, 2010
performance issue CUDA Programming and Performance	21	10360	April 30, 2007
Newbie help on thread blocks CUDA Programming and Performance	22	10631	December 24, 2008
__syncthreads and shared memory CUDA Programming and Performance	21	4363	June 15, 2011
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3332	January 10, 2010
Is it possible to process multidimensional arrays inside the kernel? CUDA Programming and Performance	13	9053	March 31, 2015
Performance issues on memory transfer CUDA Programming and Performance	13	12985	November 26, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16335	January 30, 2011
Reduction CUDA Programming and Performance	19	3477	May 16, 2012
Efficient use of shared memory CUDA Programming and Performance	29	4567	December 2, 2019

Volumetric data processing with Cuda

Related topics