Is my problem suitable for implementation in CUDA?

Hello all, my first post.

CUDA seems very interesting and I’d like to learn more by using it for my current project, but I’m unsure if the algorithm is suitable for CUDA implementation. Here’s what I’m doing:

  1. I grab a frame from a firewire camera.
  2. I compute the contrast in a subimage of the image (typically 16x16 pixels).
  3. I draw a square in Direct3D with a color according to the computed contrast and alpha blend it on top of the grayscale image from the camera.
  4. I move over to the next subimage and repeat.

This problem is extremely parrallel as the results form the previous subimage contrast calculation are independant from any other, hence they can all be computed asynchronously. My current implementation is threaded over multi-cores and each thread is making use of SSE/SSE3 instructions. However, I can’t get the frame-rate I need. Is this something well suited to CUDA?

Thanks in advance for any input.

It sounds like it would be workable on CUDA. Sounds like a simple process though but I’m probably mistaken …

  • how many FPS do you need?
  • how complex is the contrast calculation?

You’re correct - the contrast calculation is very simple. Consider a neighborhood of pixels (say 16x16) in an image. First, compute the average:

float averageVal = 0.0f;

for(int row=0; row<16; row++)


for(int column=0; column<16; column++)


averageVal += SubImage[row][column];



averageVal = averageVal / (16.0f * 16.0f);

Then, compute the standard deviation over the same neighborhood:

float StdDev = 0.0f;

for(int row = 0; row < 16; row++)


for(int column = 0; column < 16; column++)


StdDev += (SubImage[row][column] - averageVal)*(SubImage[row][column] - averageVal);



StdDev = Sqrt(StdDev / (16.0f*16.0f));


float Contrast = StdDev / averageVal;

I’m looking for about 20 FPS for an 8-bit camera with resolution 1380 x 1024 (preferrably greater), with the contrast computed for a 16x16 subimage every 4 pixels. This translates to doing the neighborhood contrast calculation 88320 times per frame. I cannot achieve that now…with my parallel CPU implementation (threads + SIMD) I can only get the frame rate I need if I do the contrast calculation every 16 pixels, not every 4 like I would prefer. In other words, it would seem that I need a 16X increase in speed. Comments anyone?

You can have a blocksize of 16x16 threads, and a gridsize as big as N_pix_x * N_pix_y for which you want to calculate this. Put your image in a 2D texture, let each thread read the appropriate value, do a reduction to get the average, then calculate the deviation for each pixel and another reduction to get the sigma. Your writes will be uncoalesced since you will write 1 value per block, but it will be as fast as it gets and should get you close to bandwith performance (so 80 Gbyte/s on 8800GTX)

20 * 1380 * 1024 = 28 Mbyte per second of data, so also getting the data to your GPU should not be a problem (maybe it is best to get a G92 or later to get Compute Capability 1.1, where you can transfer the next frame to GPU while the current is calculating)

Someone correct me if I’m wrong but you seem to be using the slowest algorithm imaginable.

In general, to get averages, you should use a sliding window, adding pixels on one end and subtracting from the other end.

Everything should be in integers not floats, if only possible. This allows you to use psadbw to add 8 pixels (16 with sse2) in one go.

Finally, maths tells us that variance = const1sum(data^2) - const2(sum(data))^2 so everything can be calculated in one loop rather than two. See…or_the_variance

Hopefully that helps a bit. By all means try CUDA, but use the “practical” variance formula, not the brute-force one :)

Right you are sysKin. I used the above just to show the simple logic, not the actual algorithm. I’m using most of the optimizations you point out.

It’s very encouraging to hear that CUDA can give me the boost I’m looking for. I’ll start learning as soon as the Vista toolkit is released. Anyone know how soon this will be?

E.D., could you please elaborate a little on this sentence? I’m unclear as to what you mean by “reduction”. Thanks so much.

check the reduction example from the SDK ;) A reduction will do things like :

  • compute the sum of all values of an array
  • compute the min or max of all values of an array
    etc, etc

It’s an interesting project, especially as it seems like a nice, simple fit to CUDA. Be sure to keep us posted!

You should check if the overhead associated with copying frames to device memory, launching the kernel etc. won’t hurt the overall performance too much. Working with realtime data streams can be trickier with GPUs because they don’t have immediate access to peripherals like the CPU does.