Median Filter

What would be the best way to go about writing a median filter in CUDA? For simplicity, just assume it’s a 3x3 filter.

I’m fairly new to CUDA, and was wondering if putting a tile into shared memory from global memory and processing it is better/worse than going through the texture API to directly read pixels when required?

Shared memory has much lower latency than texture fetches, so if you are going to reuse the data (as a median filter would) then staging into shared memory makes sense. Whether you use texture or global loads to read the data depends on the application.

3x3 median reuses the input data quite a bit, so shared memory is useful.

Thanks for your response. I’ve got two questions:

  1. When is it better to use global reads as opposed to texture reads? It seems texture reads are cached and locality based. Is there any overhead for this? Does it make sense to always use a texture for reading?

  2. What is the best way to load a 1 pixel apron around a 16x16 thread block in to shared memory? Would getting each thread to load each pixel from the 16x16 and have some threads load more than one pixel to load the full 18x18 make sense?

global reads are better when you can predict access so you can keep them coalesced. Texture reads are the way to go when you sample your data ‘randomly’. But I would advise to test both approaches and benchmark your results. especially with 2D-locality you might even win with a texture.

I would even try the following first:

skip the copying to shared mem. The texture cache should fit 18x18 of int4’s (5KB vs 8KB) And as far as I remember a texture lookup that is in cache is as fast as shared memory.

I spent some time looking at median filters in CUDA.

Shared memory helps, but the tricky part is finding the median efficiently. For large radius filters you are better off building a histogram in local memory and scanning through it to find the median (or doing a binary search), rather than explicitly sorting the values.

You get bonus points if you can implement this algorithm in CUDA: