Your basic algorithm is sound, but this isn’t how we’d recommend doing image processing in CUDA.
The main problem is that you’re not using shared memory at all. Reading from global memory is relatively slow - hundreds of clock cycles, compared to 1 or 2 for reading from shared memory.
The method I would recommend is to load a tile of the image into shared memory, do your filtering operations on it there, and then write the results back out to global memory.
Alternatively, you could try using texture fetches - this will also remove many of the addressing calculations you’re performing.
From what I understand there is basically 2 ways to do image processing in CUDA:
I) tiles in shared-mem
II) texture fetches
For large images, size at least 640x480 up to 1280x1024, which is the right way I or II to go for implementing CUDA code eg for convolution or hough transform etc?
I believe Mark Harris said (somewhere) that nvidia’s shared mem implementation can do >2 GPixel/s for 5x5 convolutions. (I find this number somewhat curious as BLAS3 is supposed to get up to 100 GFlops/s and CUFFT about 50 GFlops/s; does anyone have performance numbers for image processing on 8800GTX? Maybe ‘ymxie’ for his example?)
Another question, the SDK release notes say that a 2D image convolution example (5x5 convolution) is included; which source file is this?
Would it be possible - Mark Harris or other nvidia people? - to make available some more code examples for image processing?
Why do you find it curious? 5x5 = 25 fmadds per pixel = 2*25 ie 50 or 100, depending on how you could fmad GFlops/second. On those terms that number seems a reasonable comparison to BLAS, unless you’re worried that memory bandwidth wouldn’t be able to keep up?