Filtering of Volumetric datasets CUDA Beginner asking for design advice

Being fairly new to CUDA (but experienced in C) i’d like to ask you for some advice in designing a cuda program:
So far i’ve written a framework for handling volumetric data sets (i.e. downloaded from and managed to transfer them to the device, applying some basic filtering and returning the results to the host.

Let’s assume i just want to sum up the 3x3x3 neighbourhood of each voxel and divide the value by 27 - i.e. implementing a simple “box filter” on the dataset:
how would you design such a program? those datasets are up to 1 GB in size - how should i map this task on CUDAs Kernel, Block, Grid Model? Where should i put the data? Constant or global mem? Should i transfer sub-volumes to the shared mem? I actually do have a working solution, i just don’t know if it is efficient…
Should one kernel process one new value of the smoothed dataset?

I’m just curious what tricks and hints you can give me and if you could point out possible problems or best practices…
maybe just a few lines of pseudocode / how you’d arrange the steps would give me a clue whether i’m on track in terms of “thinking CUDA”.

thanks a lot! :)

Since the ratio of arithmetic operations to memory accesses is very low, it’s indeed critical to get the memory right.

One way to do it would be to transfer sub-volumes to shared memory. The more, the better ie. the more you reuse once fetched data, the less trips to global memory you have to do.

The way I’d do it is I’d have blocks with the dimensions 32x4x4. Each thread would load a single data element that corresponds to its position into shared memory, the whole block would conceptually load a cuboid-shaped slice out of your domain (3D grid). Now the “inner” threads (0<y<3, 0<z<3, 0<x<31) all have neighbours in shared memory and they can do filtering. Blocks need to overlap to cover the margins, since “outer” threads don’t compute (they don’t have neighbours within shared memory). Your grid will be thus slightly larger than the domain. Remember to take that into account when indexing the dataset.
This requires 512*dataElementSize bytes of shared memory per block (float and float2 fit, if you filter rgba32 colours or other complex values you’ll need to do a couple of sequential loads/passes). Since the x dimension is aligned to 32, the global accesses will be coalesced.

You could also try using 3D textures, they attempt to do caching for you. You’d just have each thread read the data of it’s neighbours straight from a 3D texture and hope the caching works well enough.

Did you look into the boxFilter example from SDK?

Hey, thanks a lot for this very detailed explanation.

I had already read about the (arithmetic operations / memory access) ratio in the documentation, that was exactly what i though, … i wasn’t sure what to do to resolve it, so loading junks into shared mem is the way to go? you actually pointed out quite well how you’d do it, thanks…!

:"> haven’t seen that yet… uh… i’ll take a look

actually i will implement a bit more than a box filter: convolution with a gaussian kernel - and that requires a lot more operations. on the other hand, i will need more than a 3x3 neighbourhood. but i guess the key is to develope some kind of memory scheme fitting the proplem into junks and transferring them to shared memory, right? maybe this sounds funny to you, but as i said, i’m new to CUDA programming.

for a classical filtering problem, the general layout to use “one kernel per one voxel” is the way to go?

usually you’d process a volume by iterating by xyz over all the voxels and do whatever you have to; so in parallel programming the “3-for-loops-payload” translates to what the single kernel is doing (launched xyz times) ?

one question about your texture remark…

Do you mean, that storing the data into a texture will have benefits in caching? i.e. the spatial closeness of my operations will lead to better performance, as the GPU is able to use some internal caching techiques ? Or would the texture do the “filtering”, as i could receive values between two points and the texture thing “interpolates” the values?

In the former case: I guess textures can be quite big, i.e. bigger than what i could transfer to shared memory. A 3d texture actually is a volume to me (altough i have gray values only, not rgb). so i could just work with it, as if it was in global memoy, the caching is done by the GPU?

Yep. The bigger the neighborhood however, the worse relative performance you can expect (because the “marigins” are larger)

One kernel iteration, so one thread per one voxel is the most intuitive way. That’s just naive splitting/parallelizing the loops over x,y,z. So instead of

for (x in X)

  for (y in Y)

	for (z in Z)

	  voxel[x][y][z] = voxel[x+1][y][z] + voxel[x-1][y][z] + ...

You launch a kernel on a 3D grid of dim XYZ and have


  x = threadIdx.x + ...

  y = threadIdx.y + ...

  z = threadIdx.z + ...

  voxel[x][y][z] = voxel[x+1][y][z] + voxel[x-1][y][z] + ...

The problem with this is that for each element you load 8 others. There are 8 memory accesses per voxel. So, for a block of 512 threads there are 4096 memory reads. If you use shared memory, you reduce the numbers of memory accesses because there’s data reuse. Now you load a whole batch of voxels once (512 memory reads) but only the “inner” threads can work, because only they have their neighbors in shared mem. So only 120 threads do computation but 512 to 120 is a better ratio than 4096 to 512.

Actually, textures give both caching and built in filtering. Caching is, in that case, optimized for spatial locality so it reduces the number of trips to global memory for nearby voxels in a similar way. It’s easier to code and you can try it as the first attempt. Optionally, you can have “free” linear filtering of a certain, lower precision (done in hardware). Read the Programming Guide for more info.

You would allocate your whole domain as a texture (they can be as big as global memory data) and you’d write something like


  x = threadIdx.x + ...

  y = threadIdx.y + ...

  z = threadIdx.z + ...

  voxel[x][y][z] = tex3D(voxelTex, x+1, y, z) + tex3D(voxelTex, x-1, y, z) + ....

Hey Big_Mac! The credits for these detailed informations go to poland, thanks a lot!
I will proceed to implement a first gaussian neighbourhood filter based on a 3d texture; i guess i will benefit from the build in spatial caching in x,y,z directions - and at a first glance it seems, as if i do not have to think a lot about shared mem. as i can see from your sketched prototype and without having read about the textures yet, i guess, they are read only (?) and the memory (voxel) i write to is global?

another simple question:
instead of each kernel writing a result to the global mem: was it more efficiant, if all kernels would write to a block of shared mem, then synchronize and one of the kernels writes the shared mem block to global memory?

Yes and yes

It’s much better if each thread writes to global on its own, see Programming Guide chapter 5.1.2, ex. the image on page 84.