Color Image Processing Efficient 24bit RGB image access


Firstly let me say that I’m having alot of success with image processing in CUDA. I have a framework which transparently marshalls between framegrabbers and device memory and OpenGL displays in real-time. Its all great and proving very powerful - so far there is little I havent been able to achieve! Great work NVidia.

But a question comes up here quite often:

Is there a really efficient way to access 24-bit RGB colour images in CUDA? At the moment, I am forced to convert from 24bit RGB to 32bit RGBX on the host. NVidia neatly sidestep the problem in all of their image processing demos by pre-loading images in 32bit RGBX format - and then ignore the 24bit problems and conversion overhead during computation.

32-bit RGBA can be optimised quite nicely since that 4byte alignment permits coalesced memory access. Unfortunately reading 3 bytes per pixel of RGB is slower since coalescing does not work well. In order to get around this I tried texture access but that is not good either. It would seem that texture references HAVE to be either 1,2 or 4 bytes per pixel. i.e. You cannot use cuCreateArray with NumChannels=3.

Also, whilst a 4-byte texture reference such as

texture<uchar4, 2, cudaReadModeNormalizedFloat> texrgba;

will work, a 3-byte texture reference such as

texture<uchar3, 2, cudaReadModeNormalizedFloat> texrgb;

will not. There is no tex2D function for uchar3 and the compiler barfs an error.

Currently, I am force to treat RGB images as single channel images (1 byte per pixel) and then do the RGB addressing in the kernel, which ends up with alot of misaligned memory access.

If anybody has got a really neat way to deal with RGB images then please share. Many thanks!

Oh and one final thing, in the Runtime API, it is possible to make a cudaChannelFormatDesc which describes a 24bit image. But you cannot write a kernel that uses a 3channel tex2D fetch. Whats going on there?

Why not write a small kernel that reads 8 24-bit pixels (so you get reasonable coalescing), then writes them back to global memory as 32-bit values like (R,G,B,0)? (Obviously, make sure you allocate the output array to match the increased size.)

That’d be a lot faster than doing it on the host, and you’d actually be saving some host->device transfer bandwidth (which may or may not be valuble, depending on your application.)

EDIT: The kernel would be even faster if you made each thread read in 16 pixels, since it would ensure perfect coalesced memory access. And don’t use any loops…it’s a pretty small kernel, and using loops would probably make it slower…