OpenGL / CUDA interop Can't register GL buffers after cuGLCtxCreate

I’ve been working on integrating CUDA with an OpenGL app - I’ve previously been using CUDA to fill some GL buffer objects, so I decided to see if I could apply CUDA as a post-processing step. What I tried doing was to copy the render targets to PBOs, map the PBOs in CUDA, and do my processing. When I tried this with a trivial kernel (just copying one of the PBOs to the output buffer), performance went from 14ms / frame to >70ms / frame. I played around with configurations, using ping-pong PBOs and so on, and still could not get the performance up.

Then I stumbled across a thread that indicated that, to get full performance, I had to use cuGLCtxCreate instead of cuCtxCreate. I tried that, and now I get CUDA_ERROR_OUT_OF_MEMORY when I try to register a buffer object - this with changing only cuCtxCreate → cuGLCtxCreate. I can only assume I am missing some critical step to make the GL interop work right. What I currently do is:

Create GL context
Initialize CUDA device
Query some capabilities on the device
cuCtxCreate() / cuGLInit()

Do some other stuff (allocate some device memory for some kernels, create some GL buffer objects / textures / etc)

Register GL buffers.

What am I missing? I’ve got a GeForce 8800GTS 512 MB, using Win7 x64 190.38 (and now 190.62) drivers. If I create a “regular” CUDA context, it works (but slowly on interop), but if I create a “GL” CUDA context, I can’t register buffer objects.

I did some more digging - the 32 bit version of the app works in Win XP, but the same executable generates CUDA out of memory errors when registering buffer objects in Vista 32 bit. It appears that I’m having some sort of hiccup dealing with the Vista driver model, but I really don’t know what is going on.