I own a QuadroFX 4600 GPU.
I intend to make a median filter program in cuda c for a large (12k X 12k) image using a window of size 3x3. My intent is to compare its performance with a multithreaded program which will run on CPU. I have already completed the CPU part and now working on CUDA program for it. The image is a 12k X 12k raw binary image using short int(2 Bytes) to represent gray levels.
What i have done in my program is that i am taking a buffer of three lines and then after reading three lines from input image, I apply median filter on that buffer…then i write back the result in the output image. In this way i scan complete image and the filter is working perfectly (but takes a long time on CPU…approx. 22 secs. with single thread).
I want to know whether it is a good idea to keep whole image(288 MB) in device memory and then operate? And if not…whether i should implement whole median function as kernel or just the sorting part of the median function as kernel ?? (which sort to use for 9 elements??)
Also, let me know if there already exists any discussion/code/explaination on this forum/elsewhere…
about the sort i don’t know, best would be to test it out. for a median filter its really easy you just divide the image to blocks and then work on one at a time or use zero copy. if u use the block approach you have very little overlap in data to transfer. and you can easily move data and do computation at the same time, this way even the transfer overhead is mitigated.