How to do zero-copy mapping from decompressor output to OpenGL (not ES)?

Questions below are on lines starting with ***:

We are porting an application that is deployed on desktop Linux to also run on the Tegra TX2 platform. The application is written in OpenGL (not ES) and uses CUDA to perform image analysis. This application reads in video streams (streams of JPEG images, H.264 and H.265), stitches them and renders them at 30fps. To maintain these rates, we do predictive decompression ahead of the current rendering time on multiple streams.

To maintain sufficient throughput, the desktop application uses multiple simultaneous decompression threads sharing the decoding hardware, which produce YUV output buffers. We then use CUDA on the main (rendering) thread to convert these to RGB values stored in pixel-buffer objects and then map each PBO to a texture and render from the textures. In steady state, the rendering from texture happens on frames that were decompressed in previous iterations. On desktop and laptop systems (GeForce 1060+), we are able to decompress and convert at 30fps for 2 4K frames on a laptop and 5 4K frames on a desktop with a 1080 card.

We have been able to get the OpenGL rendering code all working on the Tegra and have everything running at rate other than the predictive decompression into textures. We’ve tried several paths to get there, each of which gets us most of the way:

  1. EGLImage to CPU memory, PBO copy: We can use pixel-buffer objects to copy the CPU-mapped EGL buffers into textures. This works, but (as happens on the desktop implementation when we’re doing CPU-based decompression), the data transfer time for the uncompressed RGB buffers takes the bulk of the time. For the desktop case, we’ve switched to passing the compressed data to the GPU memory and never bringing it back, enabling us to run at rate. With the memory-to-memory copy, we cannot decompress fast enough. This was expected, but we were hoping that the shared memory spaces on the Tegra might make this faster. This approach was our first to implement (because it was easiest) but was not expected to work at rate.

  2. EGLImage to texture: The NvEglRenderer code in the Tegra Multimedia API sample programs performs zero-copy updates to textures from EGLImage structures using the GL_TEXTURE_EXTERNAL_OES and glEGLImageTargetTexture2DOES texture binding, which is only available in GLES (GLES/gl2ext.h) and not in OpenGL on the Tegras (and seems unavailable in OpenGL generally, according to https://www.khronos.org/registry/OpenGL/extensions/OES/OES_EGL_image.txt). This blocks us from using this approach to get the images into OpenGL textures.

*** The ideal solution would enable us to map this decompressed RGB data into an OpenGL texture; is there a way to do that?

  1. CUDA format conversion: On the desktop systems, the cuvid hardware decompressors write YUV output into a GPU buffer and we use a CUDA kernel to convert this to RGBA and place the result into a GPU pixel-buffer object. This enables rendering at rate. On the Tegra, we implemented this approach – CUDA code that registers and binds both the ELGImage and Pixel Buffer Object (keeping a cache of each set of registrations to avoid slowing down each frame). When we run this without doing synchronization, we can operate at the theoretical limit for JPEG images, but we end up displaying textures that have not yet been completely filled in (there seems to be no implicit synchronization between CUDA and OpenGL). On the desktop system, the synchronization seems to happen implicitly – we do not synchronize the threads after writing each into its PBO but the rendered images do not exhibit tearing the way the ones on the Tegras do (we followed the example in videoDecodeGL.cpp from the cuda-8.0/samples directory, which does not do explicit synchronization). We’re using the same operations in the same order on both the desktop and Tegra: cuGLMapBufferObject, kernel <<< grid, block >>>, cuGLUnmapBufferObject, followed by glBindBuffer for the PBO to the GL_PIXEL_UNPACK_BUFFER and glBindTexture for the texture and then glTexSubImage2D. These texture IDs are stored in a cache and some time later are bound to texture units for use in shaders (unless the decompression is not keeping up, in which case they are used right away). It seems like cuGLUnmapBufferObject should be performing this synchronization (per http://horacio9573.no-ip.org/cuda/group__CUDART__OPENGL__DEPRECATED_ge0b087bcfe521b66fe21d5845f46e59f.html) (Switching from the deprecated cuGLMap/Unmap to the cudaGraphicsResource calls did not change the behavior.)

*** Is there a way to enable implicit, per-buffer, CUDA->OpenGL synchronization on the Tegras as happens on the desktop? We’re rendering new images at 30fps and decompressing at 75fps, so in steady state there should be plenty of time to get ahead with the decompression.

When we add cuCtxSynchronize() after each batch of conversions for a given frame, the conversion rate is slowed way down and this step becomes the limiting step in our rate. Because we’re doing the final decompression step in a single thread (required because this is the rendering thread that owns the OpenGL context that the PBOs and textures live in), this slows down rendering too much.

Failing successful solutions to either question above, the only approach that seems to remain available is to do the decoding in separate threads, such that each context synchronization blocks only one thread. That will require multi-threaded writing to the pixel-buffer objects in OpenGL, which presumably requires mapping each CUDA context to its own OpenGL context and providing a different OpenGL context per thread with textures shared between them.

Hi,
we are working on unifying the interfaces of desktop GPUs and Tegra SoCs, but it will take significant time/effort to achieve the goal. As of now there can be issues of running OpenGL. There are some other posts discussing on it:
https://devtalk.nvidia.com/default/topic/1025021/jetson-tx1/screen-tearing-when-dual-monitor/post/5218402/#5218402

The current status is that OpenGL is not run well on Tegra and we suggest use EGL/GLES.

Thank you for the clarity of your answer. I’m looking forward to seeing you achieve your goal of unified interfaces, and of full OpenGL support on Tegras.

@DaneLLL In trying to port our code to GLES, I’m realizing that we’re using a lot of GLES3 features. I see that there is a GLES3 include directory and I see from other postings that you can include those and link against the GLES2 library, but the GL_TEXTURE_EXTERNAL_OES definition and related function pointers only seem to be available in GLES2’s header files and not in 3’s.

I’m assuming that this means I’m going to have to back-port to only GLES2 functionality during the OpenGL-GLES port (yuck!). If there is another way to use GLES3 and still get the required functions to pull EGL buffers into textures, I’m all ears!

Hi nvidia4v64h,

Please refer to SW section in L4T document for GLES version we support.

https://developer.nvidia.com/embedded/dlc/l4t-documentation-28-1

Separately: Why do you need CUDA to do YUV -> RGB conversion?
Can you upload the YUV texture data without conversion, and convert in the GL(ES) shader?
That seems like it would cut out an entire step.
(Especially if you get planar YUV, in which case you can use texture filtering, sampling three separate textures, too!)

@snarky There is actually special hardware on the TX2 that will do the format conversion. The hard part for us is getting the image (in any format) from the decoder output in GPU space into a PBO or other GPU object that can be used by OpenGL (not ES), because that’s what our code is written in. Failing that, we can use their existing GLES 2.0 examples and port all of our code to GLES for Tegra while keeping OpenGL for desktop and laptop systems. Given that the TX2 supports OpenGL, and our code already compiles and runs on it, we were hoping for a solution that avoided copying from CPU space back into GPU space that was directly compatible with OpenGL.