Optimizing my kernel to escape the watchdog timer

Hi all, i’m trying to implement a bilateral filter kernel using the brute-force method and textures fetches. The code i’ve attached works fairly good for medium/small size images (under 1000*1000 pix) but i tested it with a very big picture (2000 * 1414 pix) and the kernel got aborted about 12 secs after by the watchdog timer in windows xp. The test used a kernel radius of 32, that is a filter dimension of 65X65. Can you please give some advices to speed up my code? I thought about shared memory at first, but given such large filter dimension (I’m trying to reach 129x129 window, that is a radius of 128) i think i’ll eventually pass the per-block shared memory limit. Any suggestion is welcome!

A.
cuda_bilateral.txt (3.02 KB)

The timeout problem is really easy to solve… add a small parameter to allow running your kernel over a tile of the image, not the whole image, then run the kernel many times. You might use a tile size of 256x256 for example, giving enough blocks for efficiency and using 50 kernel launches for medium size images. The overhead of kernel launches is really quite small.

As with many problems, your likely bottleneck is memory, especially memory latency. There’s tons you can do.
I suspect right now you’re latency bound, you’re fetching data (from a texture) and immediately using it for a small amount of compute.

Youre right that you can’t fit a whole tile into shared memory at once, but that doesn’t mean you can’t fit a small tile and “roll” it.
A warp might load 64 RGB values in a “minirow” and start “panning” the thread queries over it. Once you reach the end of the first 32 values, you’ll load the next 32 values in (overwriting the now used-up older 32 front values.), and continue. In fact this is where you can prefetch the next values and you’ll probably become much closer to compute limited.
You COULD extend this to 2D as well and coordinate the threads but the simple horizontal-only per-warp preloading will likely give you the best bang for the buck for now.

You might think “why bother loading it, it will be in texture cache!” but that’s no guarantee, and you’re smarter than a cache. And the cache is merely a way to lower BANDWIDTH… I suspect you are LATENCY bound in many cases.

You could also boost your kernel compute speed enormously by making a filter response lookup table (in shared memory). Normally GPUs don’t like lookup tables, but in this case the table would only be 256 long (*3 for RGB) and save you about 30 fp ops, and also three divides and three expf()s. Worth trying. You’ll get bank conflicts when reading the table, but experiments will show if this is better than all that math or not.

Anyway, there’s lots and lots left to do in your kernel… it will require some experimentatin and reordering.

BTW you have a bug:

/ (2 * sigmar * sigmar));

is repeated 3 times. The last two should use sigmag and sigmab instead of sigmar.

I beleive the r in sigmar is for radius not red :)

Also, htis is minor, but im wondering why youre using i and not ix when texture fetching. Youll save a cast as you are adding a float and an int in the texture fetches.

Otherwize, what SPWorley said! If you experiment with the lookup table, let us know the results, id be interested.

How big id d_kernel? Could it fit in shared memory? Or constant?

–im blind, there is the kernel size in the original post. You could try constant, doubt itll be great as all threads wont be accessin the same element, but it doesnt cost much to try! Maybe the constant cache will help you.

If you has no way to solve that problem, I think that your program can runs under linux without Timeout occurs.

No CUDA program is ever bound by latency unless it’s not running enough threads.

Anyway, everything else SPWorley said is correct. Break up the problem to get over the timeout limit, and use smem by actively feeding in data as it’s used. You definately should use smem, because the same data gets reused by many threads.