Volumetric data processing with Cuda

Also the smoothing filter algoritm is not clear to me… what’s it doing… give some pseudo code ?

Other advice would be: try splitting up code into subroutines, that’s what they are for (especially if they can be inlined that would nice). Try splitting up index calculations for more clearity… not sure if that’s possible in c/c++/cuda c/macros… (I have yet to write my own kernels External Image External Image

Start with SIMD routine for 32 threads as lowest basic unit.

Apply that to width, and make routine for that.

Then height multiplied by that.

Finally depth.

Seems logic… however perhaps the algorithm needs to do something else than simply 3d traversal ???

Therefore give algorithm in pseudo code…

For now as far as I understand your filter algorithm should be something like:

for z
for y
for x
Smooth surrounding pixels ?
end
end
end

Also the smoothing filter algoritm is not clear to me… what’s it doing… give some pseudo code ?

Other advice would be: try splitting up code into subroutines, that’s what they are for (especially if they can be inlined that would nice). Try splitting up index calculations for more clearity… not sure if that’s possible in c/c++/cuda c/macros… (I have yet to write my own kernels External Image External Image

Start with SIMD routine for 32 threads as lowest basic unit.

Apply that to width, and make routine for that.

Then height multiplied by that.

Finally depth.

Seems logic… however perhaps the algorithm needs to do something else than simply 3d traversal ???

Therefore give algorithm in pseudo code…

For now as far as I understand your filter algorithm should be something like:

for z
for y
for x
Smooth surrounding pixels ?
end
end
end

Hi SkyBuck,

Volume data processing can be a little sophisticated, but my example starts with the simplest one. I think you got the point. In my example, I just wanted to take the average value defined by WindowSize to be the smoothed value. I guess this is the easiest one and I wanted to work this out so that I can put other things in. I heard a lot about CUDA, but I was also wondering why I couldn’t get some performance with such a simple example. Thanks

Hi SkyBuck,

Volume data processing can be a little sophisticated, but my example starts with the simplest one. I think you got the point. In my example, I just wanted to take the average value defined by WindowSize to be the smoothed value. I guess this is the easiest one and I wanted to work this out so that I can put other things in. I heard a lot about CUDA, but I was also wondering why I couldn’t get some performance with such a simple example. Thanks

For somebody claiming that I should read the guide better you should tell that to yourself.

The term “sm” is not mentioned once in the guide…

There are only compile options which mention those two letters.

So the person who is actually giving bad advice is you yourself !

Stick to terms in the guide or bug off.

For somebody claiming that I should read the guide better you should tell that to yourself.

The term “sm” is not mentioned once in the guide…

There are only compile options which mention those two letters.

So the person who is actually giving bad advice is you yourself !

Stick to terms in the guide or bug off.

In principle the code should be real easy:

KernelLaunch<<< 1,1,1, 512, 512, 512 >>>

But I am starting to suspect that CUDA is pretty fokked up and can’t simply divide the work by itself, thus this creates problems, which you as a programmer probably have to solve which makes it unnecessary difficult.

So what you probably need is some program which will take these dimensions and divide them across the build-in indexes so that it obeys the constraints like “max thread per block”.

I am not sure what those constraints mean… does it mean the kernel won’t launch ?!? I think that’s what the guide said… it won’t launch…

In principle accessing a volume should be real easy:

// access volumetric cells with following formula:

Cell[ Height * Width * Z + Width * Y + X ]

The indexes would be the build-in indexes. However the problem now might be which ones to select.

Simply selecting thread.x,thread.y,thread.z or block.x, block.y, block.z might not work well.

It depends on your hardware but my guess would be for a volume limited to 512x512x512 or maybe even 1024x1024x1024

Set the X to thread.index.x that seems logical, this will at least process the X in SIMD like fashion and good warps.

Set the Y to block.index.y to not confuse things further.

Set the Z to block.index.z to not confuse things further.

This probably should be:

Set the Y to grid.index.y to not confuse things further.

Set the Z to grid.index.z to not confuse things further.

Perhaps it’s then best to launch as follows:

BlockDim[0] = VolumeWidth

BlockDim[1] = VolumeHeight

BlockDim[2] = VolumeDepth

I now see a problem.

What does block.x and thread.x mean are they the same thing ?

If not then how can a kernel launch possibly specifiy threads ?!? There is no launch option for that.

Anyway if there are multi processors in there then perhaps it’s necessarily to use grid as well… but I would first try without that.

Anyway from the cuda.h comments it seems: BlockDim is the total number of threads in a block.

So what the manual fails to explain is that a Block can actually be in 3D… all manuals I have seen so far assume it’s 2D.

In reality however since blocks have a certain limit 3d is unlikely to occur for a while.

Now the less it’s time to start thinking of a block as something that could be 3D…

So it has 3D of threads.

The grid is what has the blocks.

Anyway this is all somewhat confusing.

I have no CUDA C reference and the guides which I have seen so far do a pretty bad job of explaining what these indexes are…

Also if block.x is exactly the same as thread.x then that just unnecessarily confuses things to bad.

I go look into this somemore.

Ok I see now, first of all the indexes are really fu-bared I hate that:

It’s called:

BlockIdx

and

ThreadIdx

Why they think it’s a good idea to safe typing 2 letters is beyond me ?!?

Would it really be so bad if it was:

BlockIndex

and

ThreadIndex ?!?

^ Delphi nice, C bad, but ok let’s not go into that, learn to copy & paste if you don’t like typing.

Anyway now it’s obvious to me at least it should be how to traverse the volume:

X = ThreadIdx.x;

Y = BlockIdx.y;

Z = BlockIdx.z;

This assumes threads in a block can be large enough.

A more easy way could also be:

X = (BlockIdx.x * 32) + ThreadIdx.x;

Y = BlockIdx.y;

Z = BlockIdx.z;

This would divide the grid into blocks and the x is assumed to be divide by the warp size, so that the threads can traverse that linearly.

GridDim.x = VolumeWidth div 32;

GridDim.y = VolumeHeight;

GridDim.z = VolumeDepth;

BlockDim.x = 32;

BlockDim.y = 0;

BlockDim.z = 0;

^ These are a bit confusing… but grid dim indicates blocks on the grid. and block dim indicates threads in the block.

You could also further divide the volume width to get more warps.

Something like that.

In principle the code should be real easy:

KernelLaunch<<< 1,1,1, 512, 512, 512 >>>

But I am starting to suspect that CUDA is pretty fokked up and can’t simply divide the work by itself, thus this creates problems, which you as a programmer probably have to solve which makes it unnecessary difficult.

So what you probably need is some program which will take these dimensions and divide them across the build-in indexes so that it obeys the constraints like “max thread per block”.

I am not sure what those constraints mean… does it mean the kernel won’t launch ?!? I think that’s what the guide said… it won’t launch…

In principle accessing a volume should be real easy:

// access volumetric cells with following formula:

Cell[ Height * Width * Z + Width * Y + X ]

The indexes would be the build-in indexes. However the problem now might be which ones to select.

Simply selecting thread.x,thread.y,thread.z or block.x, block.y, block.z might not work well.

It depends on your hardware but my guess would be for a volume limited to 512x512x512 or maybe even 1024x1024x1024

Set the X to thread.index.x that seems logical, this will at least process the X in SIMD like fashion and good warps.

Set the Y to block.index.y to not confuse things further.

Set the Z to block.index.z to not confuse things further.

This probably should be:

Set the Y to grid.index.y to not confuse things further.

Set the Z to grid.index.z to not confuse things further.

Perhaps it’s then best to launch as follows:

BlockDim[0] = VolumeWidth

BlockDim[1] = VolumeHeight

BlockDim[2] = VolumeDepth

I now see a problem.

What does block.x and thread.x mean are they the same thing ?

If not then how can a kernel launch possibly specifiy threads ?!? There is no launch option for that.

Anyway if there are multi processors in there then perhaps it’s necessarily to use grid as well… but I would first try without that.

Anyway from the cuda.h comments it seems: BlockDim is the total number of threads in a block.

So what the manual fails to explain is that a Block can actually be in 3D… all manuals I have seen so far assume it’s 2D.

In reality however since blocks have a certain limit 3d is unlikely to occur for a while.

Now the less it’s time to start thinking of a block as something that could be 3D…

So it has 3D of threads.

The grid is what has the blocks.

Anyway this is all somewhat confusing.

I have no CUDA C reference and the guides which I have seen so far do a pretty bad job of explaining what these indexes are…

Also if block.x is exactly the same as thread.x then that just unnecessarily confuses things to bad.

I go look into this somemore.

Ok I see now, first of all the indexes are really fu-bared I hate that:

It’s called:

BlockIdx

and

ThreadIdx

Why they think it’s a good idea to safe typing 2 letters is beyond me ?!?

Would it really be so bad if it was:

BlockIndex

and

ThreadIndex ?!?

^ Delphi nice, C bad, but ok let’s not go into that, learn to copy & paste if you don’t like typing.

Anyway now it’s obvious to me at least it should be how to traverse the volume:

X = ThreadIdx.x;

Y = BlockIdx.y;

Z = BlockIdx.z;

This assumes threads in a block can be large enough.

A more easy way could also be:

X = (BlockIdx.x * 32) + ThreadIdx.x;

Y = BlockIdx.y;

Z = BlockIdx.z;

This would divide the grid into blocks and the x is assumed to be divide by the warp size, so that the threads can traverse that linearly.

GridDim.x = VolumeWidth div 32;

GridDim.y = VolumeHeight;

GridDim.z = VolumeDepth;

BlockDim.x = 32;

BlockDim.y = 0;

BlockDim.z = 0;

^ These are a bit confusing… but grid dim indicates blocks on the grid. and block dim indicates threads in the block.

You could also further divide the volume width to get more warps.

Something like that.

You should probably also use an input volume and an output volume so the kernel can simply read from the input without having to worry about sync issues when writing so everything can proceed in parallel.

To actually smooth the voxel with the surrounding voxels seems pretty straight forward:

x1 = X;
y1 = Y;
z1 = Z;

for z2 = -1 to 1
for y2 = -1 to 1
for x2 = -1 to 1
color_sum = color_sum + InputVolume[ apply formula(x1+x2, y1+y2, z1+z2)].color;
end
end
end

color_sum = color_sum / 27;

OutputVolume[ apply formula(X,Y,Z) ] = color_sum

Bye,
Skybuck.

You should probably also use an input volume and an output volume so the kernel can simply read from the input without having to worry about sync issues when writing so everything can proceed in parallel.

To actually smooth the voxel with the surrounding voxels seems pretty straight forward:

x1 = X;
y1 = Y;
z1 = Z;

for z2 = -1 to 1
for y2 = -1 to 1
for x2 = -1 to 1
color_sum = color_sum + InputVolume[ apply formula(x1+x2, y1+y2, z1+z2)].color;
end
end
end

color_sum = color_sum / 27;

OutputVolume[ apply formula(X,Y,Z) ] = color_sum

Bye,
Skybuck.

on device:

global void volumeSmoothingMean(unsigned char *D_s, unsigned char *D_t, int windowSize, int x_size, int y_size, int z_size)

{

float total = 0.f;

float window_voxels = 0;

for(int i=start_x; i<end_x; i++)

	for(int j=start_y; j<end_y; j++)

		for(int k=start_z; k<end_z; k++)

		{

			total += (float)D_s[gridDim.x*gridDim.y*k + gridDim.x*j + i];

			window_voxels += 1.f;

		}		

}

[/quote]

Also this code looks to me like you are trying to sum up a whole lot of stuff in each kernel, pretty much for each block, and each thread or so… it’s almost like summing the whole volume per pixel or so… well it’s not that bad… because of the start and end… but that’s not how it should be done.

If you wanna sum up a whole lot of stuff it should first of all by read and written to from different buffers, so an input volume and an output volume.

You should also look into pre-scan and stuff like that if you really only want to sum a few… but it seems you want some average over an entire area.

In that case look at previous code and adjust loops from -1 to 1 to whatever it is you want to sum over.

But I guess that’s the same as reducing a volume from X,Y,Z to X/2, Y/2, Z/2 or so…

So perhaps you want a large volume to be turned into a small smoothed volume ?

(Or perhaps just a large volume as output with some anti aliasing effect).

To go from large to small volumes is probably interesting to study pre-scan or so…

Perhaps running multiple kernels would be fastest.

Each time the kernel is advanced a bit.

So let’s say the volume is 1024x1024x1024 and has to be reduced to 1024/10x1024/10x1024/10

Then this would involve 10 passes to sum the in x direction.

Then this would involve x10 passes to sum the x and y direction

Then this would involve x10 passes to sum the x and y and z direction.

Each pass would add the input to a location in the output, which would be slightly adjusted based on the pass.

So total number of passes is 10x10x10 = 1000 passes.

The question is if it would be faster to do it in one pass…

However then each “large index tripple” would need to be divided down to a “small index tripple”

Something like:

SmallX = LargeX div 10;

SmallY = LargeY div 10;

SmallZ = LargeZ div 10;

^ That’s a lot of divisions ! External Image But then again cuda can do that and memory access takes long… so divisions take 3 times longer than memory access, so perhaps will be compute bound and not access bound, depends on success of cuda being able to hide latency and proceed with next divisions… however the divisions outweight the memory access, though not completely… perhaps it’s even perfectly

balanced… one write, one read and another read.

SmallVolume[SmallZ,SmallY,SmallX] = SmallVolume[SmallZ,SmallY,SmallX] + LargeVolume[LargeZ,LargeY,LargeZ];

Finally one last pass would be needed or a branch like so:

if (LargeX mod 10 == 0) and (LargeY mod 10 = 0) and (LargeZ mod 10 = 0)

{

SmallVolume[SmallZ,SmallY,SmallX] = SmallVolume[SmallZ,SmallY,SmallX] / 1000;

}

^ Perhaps only one thread will actually pass those branches which in that case atomic operation would not be needed, but I am not sure about that…

Ok, I am now going to stop posting, because it’s still not clear to me what kind of smoothing you actually want… these are just some ideas.

on device:

global void volumeSmoothingMean(unsigned char *D_s, unsigned char *D_t, int windowSize, int x_size, int y_size, int z_size)

{

float total = 0.f;

float window_voxels = 0;

for(int i=start_x; i<end_x; i++)

	for(int j=start_y; j<end_y; j++)

		for(int k=start_z; k<end_z; k++)

		{

			total += (float)D_s[gridDim.x*gridDim.y*k + gridDim.x*j + i];

			window_voxels += 1.f;

		}		

}

[/quote]

Also this code looks to me like you are trying to sum up a whole lot of stuff in each kernel, pretty much for each block, and each thread or so… it’s almost like summing the whole volume per pixel or so… well it’s not that bad… because of the start and end… but that’s not how it should be done.

If you wanna sum up a whole lot of stuff it should first of all by read and written to from different buffers, so an input volume and an output volume.

You should also look into pre-scan and stuff like that if you really only want to sum a few… but it seems you want some average over an entire area.

In that case look at previous code and adjust loops from -1 to 1 to whatever it is you want to sum over.

But I guess that’s the same as reducing a volume from X,Y,Z to X/2, Y/2, Z/2 or so…

So perhaps you want a large volume to be turned into a small smoothed volume ?

(Or perhaps just a large volume as output with some anti aliasing effect).

To go from large to small volumes is probably interesting to study pre-scan or so…

Perhaps running multiple kernels would be fastest.

Each time the kernel is advanced a bit.

So let’s say the volume is 1024x1024x1024 and has to be reduced to 1024/10x1024/10x1024/10

Then this would involve 10 passes to sum the in x direction.

Then this would involve x10 passes to sum the x and y direction

Then this would involve x10 passes to sum the x and y and z direction.

Each pass would add the input to a location in the output, which would be slightly adjusted based on the pass.

So total number of passes is 10x10x10 = 1000 passes.

The question is if it would be faster to do it in one pass…

However then each “large index tripple” would need to be divided down to a “small index tripple”

Something like:

SmallX = LargeX div 10;

SmallY = LargeY div 10;

SmallZ = LargeZ div 10;

^ That’s a lot of divisions ! External Image But then again cuda can do that and memory access takes long… so divisions take 3 times longer than memory access, so perhaps will be compute bound and not access bound, depends on success of cuda being able to hide latency and proceed with next divisions… however the divisions outweight the memory access, though not completely… perhaps it’s even perfectly

balanced… one write, one read and another read.

SmallVolume[SmallZ,SmallY,SmallX] = SmallVolume[SmallZ,SmallY,SmallX] + LargeVolume[LargeZ,LargeY,LargeZ];

Finally one last pass would be needed or a branch like so:

if (LargeX mod 10 == 0) and (LargeY mod 10 = 0) and (LargeZ mod 10 = 0)

{

SmallVolume[SmallZ,SmallY,SmallX] = SmallVolume[SmallZ,SmallY,SmallX] / 1000;

}

^ Perhaps only one thread will actually pass those branches which in that case atomic operation would not be needed, but I am not sure about that…

Ok, I am now going to stop posting, because it’s still not clear to me what kind of smoothing you actually want… these are just some ideas.

That’s three letters, but close enough, now I understand.

Anyway I notice something strange in the guide…

At the start it shows pictures.

It shows block 0 and block 4 would be executed by a multi processor on a 4 core gpu.

Shouldn’t core0 execute block 0 to 3 instead and core1 block 4 to 7 instead, according to a chapter later on it does do that, at least it’s implicitly implied by the index formula’s.

So the picture seems to contradict the index formula’s.

Anyway later on in the guide it mentions blocks can be executed in any order on any multi processor ? So perhaps the logical ordering/traversing of the indexes doesn’t apply for blocks.

But it should at least apply for threads/warps ?!? All threads within a single block should be executed on a single multi processor and I would assume thread 0 gets executed before thread 32 which gets executed before thread 64 or is thread/warp execution order also unspecified ?!?

Hmmm…

If so then perhaps “memory access order” doesn’t matter for a parallel system ?!! External Image

There is some “caching” of memory requests… like cache lines… so each memory access retrieves like 32 bytes or 64 bytes… even if it only needs 1 byte… so I would assume that some in-order processing would still be fastest because of that reason.

That’s three letters, but close enough, now I understand.

Anyway I notice something strange in the guide…

At the start it shows pictures.

It shows block 0 and block 4 would be executed by a multi processor on a 4 core gpu.

Shouldn’t core0 execute block 0 to 3 instead and core1 block 4 to 7 instead, according to a chapter later on it does do that, at least it’s implicitly implied by the index formula’s.

So the picture seems to contradict the index formula’s.

Anyway later on in the guide it mentions blocks can be executed in any order on any multi processor ? So perhaps the logical ordering/traversing of the indexes doesn’t apply for blocks.

But it should at least apply for threads/warps ?!? All threads within a single block should be executed on a single multi processor and I would assume thread 0 gets executed before thread 32 which gets executed before thread 64 or is thread/warp execution order also unspecified ?!?

Hmmm…

If so then perhaps “memory access order” doesn’t matter for a parallel system ?!! External Image

There is some “caching” of memory requests… like cache lines… so each memory access retrieves like 32 bytes or 64 bytes… even if it only needs 1 byte… so I would assume that some in-order processing would still be fastest because of that reason.