Questions below are on lines starting with ***:
We are porting an application that is deployed on desktop Linux to also run on the Tegra TX2 platform. The application is written in OpenGL (not ES) and uses CUDA to perform image analysis. This application reads in video streams (streams of JPEG images, H.264 and H.265), stitches them and renders them at 30fps. To maintain these rates, we do predictive decompression ahead of the current rendering time on multiple streams.
To maintain sufficient throughput, the desktop application uses multiple simultaneous decompression threads sharing the decoding hardware, which produce YUV output buffers. We then use CUDA on the main (rendering) thread to convert these to RGB values stored in pixel-buffer objects and then map each PBO to a texture and render from the textures. In steady state, the rendering from texture happens on frames that were decompressed in previous iterations. On desktop and laptop systems (GeForce 1060+), we are able to decompress and convert at 30fps for 2 4K frames on a laptop and 5 4K frames on a desktop with a 1080 card.
We have been able to get the OpenGL rendering code all working on the Tegra and have everything running at rate other than the predictive decompression into textures. We’ve tried several paths to get there, each of which gets us most of the way:
-
EGLImage to CPU memory, PBO copy: We can use pixel-buffer objects to copy the CPU-mapped EGL buffers into textures. This works, but (as happens on the desktop implementation when we’re doing CPU-based decompression), the data transfer time for the uncompressed RGB buffers takes the bulk of the time. For the desktop case, we’ve switched to passing the compressed data to the GPU memory and never bringing it back, enabling us to run at rate. With the memory-to-memory copy, we cannot decompress fast enough. This was expected, but we were hoping that the shared memory spaces on the Tegra might make this faster. This approach was our first to implement (because it was easiest) but was not expected to work at rate.
-
EGLImage to texture: The NvEglRenderer code in the Tegra Multimedia API sample programs performs zero-copy updates to textures from EGLImage structures using the GL_TEXTURE_EXTERNAL_OES and glEGLImageTargetTexture2DOES texture binding, which is only available in GLES (GLES/gl2ext.h) and not in OpenGL on the Tegras (and seems unavailable in OpenGL generally, according to https://www.khronos.org/registry/OpenGL/extensions/OES/OES_EGL_image.txt). This blocks us from using this approach to get the images into OpenGL textures.
*** The ideal solution would enable us to map this decompressed RGB data into an OpenGL texture; is there a way to do that?
- CUDA format conversion: On the desktop systems, the cuvid hardware decompressors write YUV output into a GPU buffer and we use a CUDA kernel to convert this to RGBA and place the result into a GPU pixel-buffer object. This enables rendering at rate. On the Tegra, we implemented this approach – CUDA code that registers and binds both the ELGImage and Pixel Buffer Object (keeping a cache of each set of registrations to avoid slowing down each frame). When we run this without doing synchronization, we can operate at the theoretical limit for JPEG images, but we end up displaying textures that have not yet been completely filled in (there seems to be no implicit synchronization between CUDA and OpenGL). On the desktop system, the synchronization seems to happen implicitly – we do not synchronize the threads after writing each into its PBO but the rendered images do not exhibit tearing the way the ones on the Tegras do (we followed the example in videoDecodeGL.cpp from the cuda-8.0/samples directory, which does not do explicit synchronization). We’re using the same operations in the same order on both the desktop and Tegra: cuGLMapBufferObject, kernel <<< grid, block >>>, cuGLUnmapBufferObject, followed by glBindBuffer for the PBO to the GL_PIXEL_UNPACK_BUFFER and glBindTexture for the texture and then glTexSubImage2D. These texture IDs are stored in a cache and some time later are bound to texture units for use in shaders (unless the decompression is not keeping up, in which case they are used right away). It seems like cuGLUnmapBufferObject should be performing this synchronization (per http://horacio9573.no-ip.org/cuda/group__CUDART__OPENGL__DEPRECATED_ge0b087bcfe521b66fe21d5845f46e59f.html) (Switching from the deprecated cuGLMap/Unmap to the cudaGraphicsResource calls did not change the behavior.)
*** Is there a way to enable implicit, per-buffer, CUDA->OpenGL synchronization on the Tegras as happens on the desktop? We’re rendering new images at 30fps and decompressing at 75fps, so in steady state there should be plenty of time to get ahead with the decompression.
When we add cuCtxSynchronize() after each batch of conversions for a given frame, the conversion rate is slowed way down and this step becomes the limiting step in our rate. Because we’re doing the final decompression step in a single thread (required because this is the rendering thread that owns the OpenGL context that the PBOs and textures live in), this slows down rendering too much.
Failing successful solutions to either question above, the only approach that seems to remain available is to do the decoding in separate threads, such that each context synchronization blocks only one thread. That will require multi-threaded writing to the pixel-buffer objects in OpenGL, which presumably requires mapping each CUDA context to its own OpenGL context and providing a different OpenGL context per thread with textures shared between them.