Map OpenGL depth buffer in CUDA kernel


title just describes what I’d like to do. My CUDA application renders on top of an already populated OpenGL frame buffer (with a depth component). I may not assume anything about the frame buffer, in general it may be the default frame buffer (I don’t create it myself).

For the quite common case with 24-bit depth and 8-bit stencil buffer, I would like to use CUDA/GL interop to map the depth buffer in the CUDA kernel w/o having to go through host memory. So I ask glGetFramebufferAttachmentParameteriv() if the frame buffer actually has those properties, and in that case:

1.) Create an OpenGL PBO
2.) Register it with a CUDA graphics resource
3.) glReadPixels() to the PBO with GL_DEPTH_STENCIL and GL_UNSIGNED_INT_24_8
4.) Map the graphics resource and obtain a device pointer
5.) Call my rendering kernel with the device pointer (my code basically marches rays through a volume and stops short if the ray origin is “behind” the depth item)
6.) Display the composited image with OpenGL and perform cleanup

Transferring ownership however seemed achingly slow to me, so I asked GL_KHR_debug if there are any issues. And indeed I was told that the driver schedules a device to host transfer for the PBO in question:
Buffer performance warning: Buffer object 1 (bound to GL_PIXEL_PACK_BUFFER_ARB, usage hint is GL_STREAM_COPY) is being copied/moved from VIDEO memory to HOST memory.

I assumed that my code must be flawed somehow, so I tried the very same commands, but transferring the color buffer to the kernel (of course giving incorrect results, format and type passed to glReadPixels() were GL_BGRA and GL_UNSIGNED_BYTE). This however didn’t result in performance warnings and was as fast as I expected.

For the fun of it, I then tried to glCopyPixels() the depth buffer to the currently active color buffer with GL_DEPTH_STENCIL_TO_RGBA_NV and read the depth buffer with glReadPixels() from the color buffer. This is fast, provides me with the correct depth buffer, but of course invalidates the color buffer (not an option for me).

I hope that someone can have a look at my code and point me in the right direction or maybe confirm that this is an issue. I thus tried to assemble a minimal example to reproduce the issue: . You will need a GLUT implementation supporting debug contexts (e.g. freeglut) and GLEW with support for GL_KHR_debug to compile the example (tested with Ubuntu 14.04 and CUDA 7.5). Instructions on how to compile it can be found in the comments. There you will also find instructions on how to modify the code to test the various modalities that I tried and described above.

For completeness’ sake and maybe to clarify some things, here’s a link to the source file I’d like to optimize:

The pixel transfer is done with a class from a library that can be found here:


Reiterating this because there was no answer to my question so far. I had hoped for some official statement (or a pointer to the section in the docs that I overlooked?) if interop with a GL depth buffer for reading is supported or not.

Is this a correct TL;DR summary of your original post: “Transferring depth buffers from OpenGL to CUDA is slow, compared to the transfer of RGBA buffers of the same size”?

If so, consider filing a request for enhancement, via the bug reporting form linked from the CUDA registered developer website (prefix the bug synopsis with “RFE:” to mark it as an enhancement request rather than a functional bug).

My last interaction with OpenGL dates to 2005, and I have vague recollections that reading depth buffers was not a performance-optimized path, so the slowness you observe may well be a function of the OpenGL driver rather than the CUDA driver.

Yes, that’s basically it, and in addition I know that it is slow because the transfer goes through host memory.


Yes, thanks for the hint. So it is probably best, before filing a RFE, to check the performance of depth buffer transfers between two GL FBOs or so.

I think there can be several sources for slow transfers, especially if you read depth buffers in a format other than the one the GLX visual maintains. Something in the lines of glReadPixels(GL_DEPTH32F) with a 24-bit depth buffer will be slow for sure.