Map OpenGL depth buffer in CUDA kernel

szellmann · July 1, 2016, 3:23pm

Hi,

title just describes what I’d like to do. My CUDA application renders on top of an already populated OpenGL frame buffer (with a depth component). I may not assume anything about the frame buffer, in general it may be the default frame buffer (I don’t create it myself).

For the quite common case with 24-bit depth and 8-bit stencil buffer, I would like to use CUDA/GL interop to map the depth buffer in the CUDA kernel w/o having to go through host memory. So I ask glGetFramebufferAttachmentParameteriv() if the frame buffer actually has those properties, and in that case:

1.) Create an OpenGL PBO
2.) Register it with a CUDA graphics resource
3.) glReadPixels() to the PBO with GL_DEPTH_STENCIL and GL_UNSIGNED_INT_24_8
4.) Map the graphics resource and obtain a device pointer
5.) Call my rendering kernel with the device pointer (my code basically marches rays through a volume and stops short if the ray origin is “behind” the depth item)
6.) Display the composited image with OpenGL and perform cleanup

Transferring ownership however seemed achingly slow to me, so I asked GL_KHR_debug if there are any issues. And indeed I was told that the driver schedules a device to host transfer for the PBO in question:
Buffer performance warning: Buffer object 1 (bound to GL_PIXEL_PACK_BUFFER_ARB, usage hint is GL_STREAM_COPY) is being copied/moved from VIDEO memory to HOST memory.

I assumed that my code must be flawed somehow, so I tried the very same commands, but transferring the color buffer to the kernel (of course giving incorrect results, format and type passed to glReadPixels() were GL_BGRA and GL_UNSIGNED_BYTE). This however didn’t result in performance warnings and was as fast as I expected.

For the fun of it, I then tried to glCopyPixels() the depth buffer to the currently active color buffer with GL_DEPTH_STENCIL_TO_RGBA_NV and read the depth buffer with glReadPixels() from the color buffer. This is fast, provides me with the correct depth buffer, but of course invalidates the color buffer (not an option for me).

I hope that someone can have a look at my code and point me in the right direction or maybe confirm that this is an issue. I thus tried to assemble a minimal example to reproduce the issue:
(Minimal) OpenGL to CUDA PBO example, purpose of this example is to evaluate why depth transfer is so slow · GitHub . You will need a GLUT implementation supporting debug contexts (e.g. freeglut) and GLEW with support for GL_KHR_debug to compile the example (tested with Ubuntu 14.04 and CUDA 7.5). Instructions on how to compile it can be found in the comments. There you will also find instructions on how to modify the code to test the various modalities that I tried and described above.

For completeness’ sake and maybe to clarify some things, here’s a link to the source file I’d like to optimize:

github.com

deskvox/deskvox/blob/master/virvo/virvo/vvraycaster.cpp

// Virvo - Virtual Reality Volume Rendering
// Copyright (C) 1999-2003 University of Stuttgart, 2004-2005 Brown University
// Contact: Jurgen P. Schulze, jschulze@ucsd.edu
//
// This file is part of Virvo.
//
// Virvo is free software; you can redistribute it and/or
// modify it under the terms of the GNU Lesser General Public
// License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
// Lesser General Public License for more details.
//
// You should have received a copy of the GNU Lesser General Public
// License along with this library (see license.txt); if not, write to the
// Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

This file has been truncated. show original

The pixel transfer is done with a class from a library that can be found here:

github.com

szellmann/visionaray/blob/master/include/visionaray/cuda/pixel_pack_buffer.h

// This file is distributed under the MIT license.
// See the LICENSE file for details.

#pragma once

#ifndef VSNRAY_CUDA_PIXEL_PACK_BUFFER_H
#define VSNRAY_CUDA_PIXEL_PACK_BUFFER_H 1

#include <memory>

#include <visionaray/math/forward.h>
#include <visionaray/math/rectangle.h>
#include <visionaray/pixel_format.h>

namespace visionaray
{
namespace cuda
{

class pixel_pack_buffer

This file has been truncated. show original

https://github.com/szellmann/visionaray/blob/master/include/visionaray/cuda/detail/pixel_pack_buffer.inl

Cheers,
Stefan

szellmann · September 8, 2016, 7:40pm

Reiterating this because there was no answer to my question so far. I had hoped for some official statement (or a pointer to the section in the docs that I overlooked?) if interop with a GL depth buffer for reading is supported or not.

njuffa · September 8, 2016, 8:46pm

Is this a correct TL;DR summary of your original post: “Transferring depth buffers from OpenGL to CUDA is slow, compared to the transfer of RGBA buffers of the same size”?

If so, consider filing a request for enhancement, via the bug reporting form linked from the CUDA registered developer website (prefix the bug synopsis with “RFE:” to mark it as an enhancement request rather than a functional bug).

My last interaction with OpenGL dates to 2005, and I have vague recollections that reading depth buffers was not a performance-optimized path, so the slowness you observe may well be a function of the OpenGL driver rather than the CUDA driver.

szellmann · September 8, 2016, 9:23pm

Yes, that’s basically it, and in addition I know that it is slow because the transfer goes through host memory.

[/quote]

Yes, thanks for the hint. So it is probably best, before filing a RFE, to check the performance of depth buffer transfers between two GL FBOs or so.

I think there can be several sources for slow transfers, especially if you read depth buffers in a format other than the one the GLX visual maintains. Something in the lines of glReadPixels(GL_DEPTH32F) with a 24-bit depth buffer will be slow for sure.

Topic		Replies	Views
Pass openGL data to CUDA. Question about speed. CUDA Programming and Performance	4	1855	August 22, 2016
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76277	February 14, 2010
CUDA and OpenGL data transfer CUDA Programming and Performance	9	21267	October 6, 2007
display a buffer openGL/cuda question CUDA Programming and Performance	11	8110	May 13, 2008
OpenGL interop performance ... yes, STILL CUDA Programming and Performance	6	6458	March 29, 2010
CUDA/OpenGL interop 'bug'/missing-documentation CUDA Programming and Performance	4	7615	February 6, 2009
cudaGraphicsGLRegisterBuffer and unspecified driver error CUDA Programming and Performance	5	4587	September 22, 2011
CUDA & OpenGL FrameBuffer Object. CUDA Programming and Performance	3	9611	September 8, 2011
OpenGL & CUDA CUDA Programming and Performance	12	9824	January 16, 2009
OpenGL Error CUDA Programming and Performance	6	5651	November 15, 2011

Map OpenGL depth buffer in CUDA kernel

Related topics