CUDA for real-time video processing?

I am interested in using CUDA to do various real-time video effects. Ideally, I want use the video decompression hardware on my NVIDIA card to do MPEG-4 decoding of high definition movie frames into a buffer that can be accessed with CUDA and then finally output to an OpenGL texture to be rendered and sent out the Quadro FX SDI connector as HD video. Has anyone already successfully done this? If so, what approach did you use? Otherwise, what what approach would you recommend to accomplish this? I was considering using the Direct Show IAMVideoAccelerator interface to get frames since it appears to support DirectX Video Acceleration to decode frames into a DirectX buffer. Then map the buffer using CUDA to run the effect into another buffer to be transfered to an OpenGL texture. Is this a reasonable approach or simply just too complicated, mixing too many APIs that may not work well together?

– Mark

Using either of the below…

  • VMR9 (DirectShow, DXVA, XP/Vista)

  • EVR (DirectShow/Mediafoundation, DXVA2.0, Vista)

  • DXVA2.0 directly (Vista)

… or using any way or version of DXVA, you’ll always end up with a with Direct3DSurface9 in a fourcc format like YUY2, UYVY or NV12. You can pretty easily get this converted to a Direct3DTexture9 with a RGB type in order to process it with pixel shaders directly in DirectX. But CUDA currently only supports interoperability with Direct3DVertexBuffer9’s which means that direct CUDA acesss to DXVA decoded video currently is impossible. Hopefully just for now, as CUDA Interoperability with Direct3DSurface9’s using YUV formattypes is just around the corner… Anyone at Nvidia care to comment?

So in order to transfer it to OpenGL and process it here with CUDA you’ll have to copy the surface via system memory to an OpenGL object (texture) which is accessable with CUDA. But moving uncompressed HD video across the PCI express bus twice is very demanding. So far I’ve only succesfully been able to access uncompressed YUV surfaces by locking the surfaces and reading directly from graphics memory using the CPU and not using faster API readback methods. Those methods seem only to work on RGB surfaces which are 50-33% bigger in memory and requires the GPU to transfrom it to RGB as well.

Skipping DXVA accellerated decoding and doing decoding in software instead you could create a (DirectShow) video renderer that gets data rendered directly on to an OpenGL texture which you can access thru CUDA to create your effect.

Looking at the VideoFilter sample from the Nvidia SDK 9.5 you should be able to generate a OpenGL based DirectShow renderer. Microsoft’s has a simpler sample in the Windows Vista SDK [*] which also renders video to a Texture. Replacing anything DirectX related with OpenGL should get you started.

For the next step look at the Post-Process in OpenGL sample in the CUDA SDK in order to gain the knowledge needed on how to interact with the data residing in the OpenGL texture thru CUDA.

Mac OS X does provide limited support for Purevideo and thus maybe also hardware accelerated decoding in Quicktime on Windows. With Quicktime you can get output directly to an OpenGL texture. (I’ve just studying the docs).

Regards

Mikkel Haugstrup

[*] The Windows Vista SDK is a newer Platform SDK and supports XP as well, the mentioned sample is located in the Samples\Multimedia\DirectShow\Players\Texture3D9 folder.