title just describes what I’d like to do. My CUDA application renders on top of an already populated OpenGL frame buffer (with a depth component). I may not assume anything about the frame buffer, in general it may be the default frame buffer (I don’t create it myself).
For the quite common case with 24-bit depth and 8-bit stencil buffer, I would like to use CUDA/GL interop to map the depth buffer in the CUDA kernel w/o having to go through host memory. So I ask glGetFramebufferAttachmentParameteriv() if the frame buffer actually has those properties, and in that case:
1.) Create an OpenGL PBO
2.) Register it with a CUDA graphics resource
3.) glReadPixels() to the PBO with GL_DEPTH_STENCIL and GL_UNSIGNED_INT_24_8
4.) Map the graphics resource and obtain a device pointer
5.) Call my rendering kernel with the device pointer (my code basically marches rays through a volume and stops short if the ray origin is “behind” the depth item)
6.) Display the composited image with OpenGL and perform cleanup
Transferring ownership however seemed achingly slow to me, so I asked GL_KHR_debug if there are any issues. And indeed I was told that the driver schedules a device to host transfer for the PBO in question:
Buffer performance warning: Buffer object 1 (bound to GL_PIXEL_PACK_BUFFER_ARB, usage hint is GL_STREAM_COPY) is being copied/moved from VIDEO memory to HOST memory.
I assumed that my code must be flawed somehow, so I tried the very same commands, but transferring the color buffer to the kernel (of course giving incorrect results, format and type passed to glReadPixels() were GL_BGRA and GL_UNSIGNED_BYTE). This however didn’t result in performance warnings and was as fast as I expected.
For the fun of it, I then tried to glCopyPixels() the depth buffer to the currently active color buffer with GL_DEPTH_STENCIL_TO_RGBA_NV and read the depth buffer with glReadPixels() from the color buffer. This is fast, provides me with the correct depth buffer, but of course invalidates the color buffer (not an option for me).
I hope that someone can have a look at my code and point me in the right direction or maybe confirm that this is an issue. I thus tried to assemble a minimal example to reproduce the issue:
https://gist.github.com/szellmann/4a2f44f254af31e795c5b368d6f38423 . You will need a GLUT implementation supporting debug contexts (e.g. freeglut) and GLEW with support for GL_KHR_debug to compile the example (tested with Ubuntu 14.04 and CUDA 7.5). Instructions on how to compile it can be found in the comments. There you will also find instructions on how to modify the code to test the various modalities that I tried and described above.
For completeness’ sake and maybe to clarify some things, here’s a link to the source file I’d like to optimize:
The pixel transfer is done with a class from a library that can be found here: