100x Resampling->Sobel->Histogram over Angles: How do get Cuda to do it fast?

Hello everybody.

Quite some time now I think about how to best start implementing my problem in Cuda and the more I experiment, the more I tend to have problems to come along.
I need to do the following:

1.) Convert camera image (640x480, rgb) to grayscale
2.) 100 regions of interest in the grayscale image are resampled to images of 24x24 pixels size (bilinear scaling; I use Cuda’s texture access to achieve “cheap” bilinear filtering on the go).
3.) For all resampled images, Sobel X and Sobel Y is computed to get the angles of edges in these regions of interest
4.) For subwindows of size 8x8, histograms with 9 bins are filled over the computed angles

And I need to do that for four (!) camera views each.

No matter how i try to go for this scheme, I end up slower than on the cpu. I really appreciate any help on how to start with it. It seems the many memory accesses make things rather slow. Should I rather compute the resizing and then go for Sobel and split each step into threads or do resizing and Sobel in one pass and split each pixel up into one thread? There are so many possibilities and my head just spins around :-/

How would you guys start and cope with it?

For what it’s worth, my company has been working on similar DSP/image processing problems - with very similar algorithms.

Quite simply, it’s not possible for CUDA to even touch the speeds of a fully optimized SSE/MMX convolution - from my experience.
The latency of CUDA kernels alone is enough time to run a 5x5 convolution on a 320x240 image… heh - don’t even get me started on the time it takes to allocate/free memory, or copy results back from the GPU.

The ‘fastest’ kernel I’ve ever been able to write takes ~500us for a 320x240 image (and that’s simply comparing two images, returning white for identical intensities, black otherwise), yet we can run ~4 convolutions on 640x480 images on the CPU in that same time.

The only reason the company I work for is bothering with CUDA right now, is a) less CPU demand, and B) CUDA/OpenCL will potentially scale better in the long term.

I am not a big fan of CUDA in image processing either, but it seems strange to get worst timings than in a CPU. Here are my rules of thumb on the image-processing-in-CUDA story:

  • you input the image in tiles (one tile per block). If your tiles must overlap (to cover edge effects, when you need more input pixels than output pixels), then pass the image into the kernel as a 2D texture. Otherwise, read it coalesced (make sure your tiles are multiple of 16 pixels horizontally).

  • add as much processing as you can in a single kernel, to minimize kernel I/O. One quote from NVIDIA that I heard at NVISION 08 was that if you compare the throughput of the compute portion and the I/O portion of a kernel, the first is almost free. In my experience, you’ll first run out of registers when you keep on adding to your kernel. If your occupancy (use the CUDA occupancy calculator) becomes really low, it’s probably time to add another kernel. It’s perhaps best to keep it over 33% (my own psychological threshold) to give the hardware more opportunity to cover IO with processing-- when current threads wait for data, other threads are launched.

  • if you have more operators that require overlaps (cascaded convolutions, for example), it is best to separate them into different kernels because otherwise the overlaps will need to be compounded.

  • make sure you’re free of smem bank conflicts and all gmem access is coalesced. Use the visual profiler to watch for these things, which are by far the most important inefficiency factors.

  • try to limit the number of divergent threads (if-then-else statements that will push threads in the same warp in different directions).

I"m also a beginner, others may offer more substantial advice. Also, these rules of thumb may not apply to all cases. Just to get you started…

I’d be very interested to know if you’ve been able to beat IPP in terms of performance (execution time), for say - image convolutions w/ 3x3 kernels.

I’ve managed to do it in a few very select cases, but the overhead involved in allocating gpu memory and/or copying results back later when the final result is required by CPU calculations tends to kill any performance gains on those few kernels.

I haven’t tried and I’ll take your word for it. It is plausible that small operators are best kept on the CPU; what I am doing though employs large operators and numerous passes and the improvement over the CPU is significant. In regards to mikey79s’ algo, it just seemd to me that it is intensive enough to benefit from a GPU.

More than anything that would depend on the video card he’s running imo…

I should note that when I said that it’s “not possible”, I should’ve added that I’m speaking in terms of the hardware I currently have to work with (Core2 E8400 (3.0Ghz) vs. Quadro FX 570 which is more or less equivalent to an 8600). While my video card has more theoretical potential, the fact the images I’m working with all using integer pixel formats (in most cases 8bpp), and I only have a Compute 1.1 card (thus very limited memory coalescing) - is quite probably why I’m having such a hard time getting our CUDA implementation catch up to our CPU implementation in terms of speed. And of course, IPP is written by people who know their processors inside out, while I certainly don’t know as much about CUDA/nVidia cards as their engineers would.

So - in theory it’s potentially possible for the low-end graphics cards to get similar speeds to a current CPU for most DSP algorithms, in practice it’s quite probable you can get increased speeds (although not the kinds of 100x speed increases you see being flaunted by the nVidia marketing crew, I don’t think) using later video cards with higher bandwidth (eg: 200 series).

OP: If you don’t mind about memory footprint and/or hardware requirements, you can quite probably get your resampling->sobel->histogram over angles working faster than the CPU implementation (by how much, really does depend on your hardware though).

If you were using mostly floating point arithmetic & pixel formats, you’d probably see some very decent performance improvements.

One other thing to note, don’t take nVidia’s SDK samples as a good indicator of potential speed - in all cases my DSP kernels are generally double (sometimes quite a lot more) the speed of theirs. Generally because the samples are ‘samples’ (supposed to be easy to understand/read, not heavily optimized).

Transfer times to/from GPU should be neglicable for such small images; do you use page-locked memory? Most probably, you haven’t written your code in the most efficient way for CUDA, if you get such slow timings. It also depends on the kind of graphics card / processor of course…

Memory allocation times shouldn’t even factor in the equation, best is to allocate stuff once then keep using that. If you reallocate buffers every frame it’s bound to be very very slow. Use your own memory management…

Edit: so you are comparing a low/mid end card, yes, those are really a lot slower than the high-end ones for CUDA, especially when compared against a higher end CPU…