CUDA Image Processing Demo & Soure code&Tutorials

CUDA Image Processing Demo & Tutorials : :)

Soure code :… :)

if you have any comments ,i hope you do ,please share… :hug:

Thanks for posting the source code.

Your basic algorithm is sound, but this isn’t how we’d recommend doing image processing in CUDA.

The main problem is that you’re not using shared memory at all. Reading from global memory is relatively slow - hundreds of clock cycles, compared to 1 or 2 for reading from shared memory.

The method I would recommend is to load a tile of the image into shared memory, do your filtering operations on it there, and then write the results back out to global memory.

Alternatively, you could try using texture fetches - this will also remove many of the addressing calculations you’re performing.


thanks for the code and comments.

From what I understand there is basically 2 ways to do image processing in CUDA:

I) tiles in shared-mem

II) texture fetches

For large images, size at least 640x480 up to 1280x1024, which is the right way I or II to go for implementing CUDA code eg for convolution or hough transform etc?

I believe Mark Harris said (somewhere) that nvidia’s shared mem implementation can do >2 GPixel/s for 5x5 convolutions. (I find this number somewhat curious as BLAS3 is supposed to get up to 100 GFlops/s and CUFFT about 50 GFlops/s; does anyone have performance numbers for image processing on 8800GTX? Maybe ‘ymxie’ for his example?)

Another question, the SDK release notes say that a 2D image convolution example (5x5 convolution) is included; which source file is this?

Would it be possible - Mark Harris or other nvidia people? - to make available some more code examples for image processing?



Using shared memory should be faster than using texture, assuming your data fits in the 16KB and there’s plenty of data re-use.

The image convolution sample will be in the next release of the SDK (due out soon).

For the record, I don’t remember ever giving image processing performance numbers like that. :)

Note that you can use both texture and shared memory. In some cases this might be better than using one alone.


Why do you find it curious? 5x5 = 25 fmadds per pixel = 2*25 ie 50 or 100, depending on how you could fmad GFlops/second. On those terms that number seems a reasonable comparison to BLAS, unless you’re worried that memory bandwidth wouldn’t be able to keep up?

thanks, and thanks also to Mark for the hint to use shared mem and texture in parallel.

About when will the next release of the SDK be out?

Mark, you are right, sorry for wrongly connecting you with the performance number.

Actually, Simon stated the performance for a 5x5 convolution at >1 GPixel/s, see related post…ndpost&p=164233