CUDA OpenGL post-processing example

A few people have asked about this, so I’m attaching a simple example that demonstrates how to transfer image data back and forth between OpenGL and CUDA.

It performs a 2D convolution on an image of a simple 3D scene rendered by OpenGL.

Note that the image processing code is not really optimized for performance. (323 KB)

Instead of shared memory a texture could have been used for the convolution image as well right? Since convolution is done per pixel and pixels in a block are located spatially close I assumed using a texture would provide an easy implementation together with automatic caching.

That’s true, although unfortunately there’s no way currently to read directly from a texture allocated in OpenGL.

You can use texture lookups in CUDA, but only into arrays allocated by CUDA itself.

We hope to lift this restriction in a future release.

ah ok :)

Do you perhaps know the difference in execution time between a texture based convolution and a shared memory based convolution ?

Using texture would also not provide the benefits of data re-use that shared memory affords. The larger the filter kernel size, the more threads will re-use the same data elements.


In our tests CUDA-based convolution using shared memory is about 2x faster than the equivalent OpenGL texture-based solution.

For example, it’s possible to do a 5x5 convolution at >1Gpixel/sec.

How about a convolution with a really big kernel, for example 64x64 (and the FFT approach is not possible)?

The shared memory approach gives problems since it is not big enough to hold the kernel constants. How to solve this without cache misses?

Would it for example be more efficient to implement 4 shared memory convolutions with 32x32 kernels than to do a big texture/constant cache based convolution with a 64x64 kernel? I think it is, but perhaps someone has already implemented this.

The kernel constants should be stored in constant memory, so that is not a problem. However the pixel data would be hard to fit in shared memory for arbitrarily large filters, you are correct.

Note also that some filters, such as gaussian blur kernels, are separable. Large separable filters can be much more easily fit in the shared memory. We will have an example of such a convolution in the next release of the SDK.


Hi Simon,

thanks for this estimation; the example in the SDK is a separable kernel. Could you eventually make the full 5x5 convolution code available?

Thanks and regards,


When will u guy release it ??? we are all eagrly waiting :playball: