I’m working on a project that involves taking the rendered image from OpenGL and comparing it to another image. For simplicity’s sake, I’m using PSNR to evaluate the difference between the rendered image and the reference image.
Previously, I was reading the pixels out of back buffer using glReadPixels and performing the PSNR calculation on the CPU. Since then, I’ve discovered that this part of my code is the largest bottleneck and have moved that calculation to the GPU using CUDA and the OpenGL interop APIs. My hope was to speed up this easily parallelized calculation by using CUDA and minimizing the amount of data the I need to transfer from the graphics card. I am using the same techniques used in the â€œPost-Process in OpenGLâ€ to get the pixel data into CUDA (glReadPixels to a PBO, register & map PBO, run kernel, unmap & unregister PBO).
However, Iâ€™ve noticed that my program runs slower with the CUDA implementation. This problem exists when I compile the code in both Windows XP and in Ubuntu 9.04. Between the CUDA Visual Profiler and in-program timers, Iâ€™ve learned that the actual calculation is faster in CUDA than the CPU implementation. This leaves the OpenGL-to-CUDA mapping as the suspect for the slowdown.
Does anybody know why the OpenGL interop APIs are so slow? Iâ€™m using the CUDA 2.3 toolkit and the recommended 190.38/190.53 drivers in Windows XP/Ubuntu respectively. Thanks in advance.