what is the best way to implement a convolution with CUDA?

I’m trying to figure out the best way to implement image space convolution on the GPU and guessing that probably quite a few people have done it I was hoping for some pointers.

I’m assuming that the kernel size is not a multiple of 16, not sure what the practical maximum is.
Should I load the kernel into constant, share or texture memory?
It seems to me that it’s better to assign each thread to an output pixel (each thread applies the whole kernel to an image block) rather than let each thread handle a kernel pixel to avoid needing synchronization on write, and thus probably extra shared memory for the output.
It sounds to me that a good options is to load the kernel into constant memory and the image into shared memory, is that a good idea?

thanks

Joe Stam had a good talk at last year’s GTC on this subject.

Streaming video: http://nvidia.fullviewmedia.com/GPU2009/10…ornia-1401.html
Talk slides: http://www.nvidia.com/content/GTC/documents/1401_GTC09.pdf