I’ve been working on integrating CUDA with an OpenGL app - I’ve previously been using CUDA to fill some GL buffer objects, so I decided to see if I could apply CUDA as a post-processing step. What I tried doing was to copy the render targets to PBOs, map the PBOs in CUDA, and do my processing. When I tried this with a trivial kernel (just copying one of the PBOs to the output buffer), performance went from 14ms / frame to >70ms / frame. I played around with configurations, using ping-pong PBOs and so on, and still could not get the performance up.
Then I stumbled across a thread that indicated that, to get full performance, I had to use cuGLCtxCreate instead of cuCtxCreate. I tried that, and now I get CUDA_ERROR_OUT_OF_MEMORY when I try to register a buffer object - this with changing only cuCtxCreate → cuGLCtxCreate. I can only assume I am missing some critical step to make the GL interop work right. What I currently do is:
Create GL context
Initialize CUDA device
Query some capabilities on the device
cuCtxCreate() / cuGLInit()
Do some other stuff (allocate some device memory for some kernels, create some GL buffer objects / textures / etc)
Register GL buffers.
What am I missing? I’ve got a GeForce 8800GTS 512 MB, using Win7 x64 190.38 (and now 190.62) drivers. If I create a “regular” CUDA context, it works (but slowly on interop), but if I create a “GL” CUDA context, I can’t register buffer objects.