OpenGL interop performance problems

I’m working on a project that involves taking the rendered image from OpenGL and comparing it to another image. For simplicity’s sake, I’m using PSNR to evaluate the difference between the rendered image and the reference image.

Previously, I was reading the pixels out of back buffer using glReadPixels and performing the PSNR calculation on the CPU. Since then, I’ve discovered that this part of my code is the largest bottleneck and have moved that calculation to the GPU using CUDA and the OpenGL interop APIs. My hope was to speed up this easily parallelized calculation by using CUDA and minimizing the amount of data the I need to transfer from the graphics card. I am using the same techniques used in the “Post-Process in OpenGL” to get the pixel data into CUDA (glReadPixels to a PBO, register & map PBO, run kernel, unmap & unregister PBO).

However, I’ve noticed that my program runs slower with the CUDA implementation. This problem exists when I compile the code in both Windows XP and in Ubuntu 9.04. Between the CUDA Visual Profiler and in-program timers, I’ve learned that the actual calculation is faster in CUDA than the CPU implementation. This leaves the OpenGL-to-CUDA mapping as the suspect for the slowdown.

Does anybody know why the OpenGL interop APIs are so slow? I’m using the CUDA 2.3 toolkit and the recommended 190.38/190.53 drivers in Windows XP/Ubuntu respectively. Thanks in advance.

The newest stable driver is version 196.21. It supports CUDA 3.0.1, but it’s backwards compatible with earlier versions too. I haven’t looked at the release notes for all of the versions in between 190.38 and 196.21, but your problem may have already been fixed in a newer version…you’ll have to try it out and see for yourself though.

how much time is the interop taking ? sounds to me you might be timing things wrongly … can you post your code and times ?