Median Filter on a large image

I own a QuadroFX 4600 GPU.
I intend to make a median filter program in cuda c for a large (12k X 12k) image using a window of size 3x3. My intent is to compare its performance with a multithreaded program which will run on CPU. I have already completed the CPU part and now working on CUDA program for it. The image is a 12k X 12k raw binary image using short int(2 Bytes) to represent gray levels.

What i have done in my program is that i am taking a buffer of three lines and then after reading three lines from input image, I apply median filter on that buffer…then i write back the result in the output image. In this way i scan complete image and the filter is working perfectly (but takes a long time on CPU…approx. 22 secs. with single thread).

I want to know whether it is a good idea to keep whole image(288 MB) in device memory and then operate? And if not…whether i should implement whole median function as kernel or just the sorting part of the median function as kernel ?? (which sort to use for 9 elements??)

Also, let me know if there already exists any discussion/code/explaination on this forum/elsewhere…

Thanks in advance.

keep it all on the device memory, USE SHARED MEMORY work in blocks not strips, if the transfer times are a concern you might try and use zero copy. do the whole operation not just the sort.

Thanx for the reply :)

can u tell me what sort would be best to apply on 9 elements ? ( Bitonic requires m = 2^r ). I am using simple bubble sort as of now.

Also, in future if my image size increases(& it will !! External Image ) … the program will fail to allocate enogh memory on my device for the image…right ???

My device memory is 784 MB. so what i m looking for… needs to be a scalable solution…!!

about the sort i don’t know, best would be to test it out. for a median filter its really easy you just divide the image to blocks and then work on one at a time or use zero copy. if u use the block approach you have very little overlap in data to transfer. and you can easily move data and do computation at the same time, this way even the transfer overhead is mitigated.

good luck :)

Thanx. i will return with the solution :)

You can implement the median filter by using a histogram or binary search (no need to actually sort per pixel). There’s an example in our OpenCL SDK which you could port to CUDA quite easily.

i thought i saw one some were :D