Encoding + displaying not keeping-up (NVEnc, OpenGL, CUDA)

I’m writing a Windows application that processes, encodes to H.264 and displays live images acquired from a camera at 60Hz. I’m using a Quadro P400. I need the latency from acquisition to display to be minimal.

The images uploaded to the GPU are 1920x1080, bayer format. I’m using the Npp primitives to deBayer and upscale the images to 4K (part of my requirement). I’m using OpenGL for the display.

Functionally the app is correct (images are displayed and encoded correctly) but the frame period is around 22ms, meaning it’s dropping frames.

I’m looking for some advice on how restructure the application in order to improve performance.

I have profiled the app using the NSight profiler in Visual Studio and I can see that the kernels are using 100% occupancy. The memcpy to OpenGL texture (cudaMemcpyToArray()) is substantially slower than the simple cudaMemcpy() calls. Can this be improved?

There are a number of idle times for the GPU which I do not understand. Profiler trace available here: https://we.tl/DdDMU6bbw3. Look at time 5.2ms for the beginning of a frame which takes 21.7ms (starts with a H->D memcpy).

Commenting out the display makes the GPU idle times go away. Is the interaction with OpenGL causing the GPU to stall for some reason?

All the code runs in a single thread and there’s only one CUDA Stream. Would using more Streams and / or more threads help?

Removing the call to SwapBuffer() allows the GPU to be 100% utilized. Is there a fundamental problem with the CUDA / GL interop causing this?

Pseudo-code for the main loop of the app is shown below.

  • The host buffer is pinned memory.
  • nv_enc.Encode() takes a copy of the input RGBA image and then calls NVENC::nvEncEncodePicture(). The NVENC API internally converts the image from RGBA to YUV before sending it to the encoder.
  • gl_window.SetImageFromCUDA copies the image into texture memory using cudaMemcpyToArray(), renders it onto a quad and calls SwapBuffers(). The cudaArray pointer has been retrieved using the cuGraphicsMapResources() function as shown in this example https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st.
  • The display is synched to VSync.

      for (int i = 0; i < N; ++i) {
         // Copy Lena to GPU
         __cuda(cudaMemcpy(d_lena, h_lena, lena_size, cudaMemcpyHostToDevice));

         // Debayer using NPP
         __npp(nppiCFAToRGBA_8u_C1AC4R(d_lena, nSrcStep, oSrcSize, oSrcROI, d_rgba, nDstStep, eGrid, eInterpolation, nAlpha));
            // Upscale to 4K using NPP
            __npp(nppiResize_8u_C4R(d_rgba, nSrcStep, oSrcSize, oSrcROI, d_upscaled_rgba, nDstStep, oDstSize, oDstROI, eInterpolation));

            // Encode

            // Display


double elapsedTime = (double)(lEnd - lStart);
print("Average Processing Time : %6.2fms\n", ((elapsedTime*1000.0) / numFramesEncoded) / lFreq);

Hi Jean-Phillipe,

glSwapBuffers will block until the frame is rendered. With v-sync this will cause idle time as swapBuffers will only return after the frame has been displayed. Even without vsync there is only a number of frames that will be buffered in the gpu, so you might see idle time even with vsync disabled.

I think a clean approach would be to have a separate rendering thread which will alternate between rendering two textures. The texture that is not currently being rendered can then be updated from your main thread using cudaMemcpyToArray(). You’ll need some basic signalling to make sure the rendering and main thread are synchronized and not attempting to access resources that are currently in use by the other thread.

On the memcpy vs memcpyToArray isue: The sample you referenced uses memcpyToArray as it needs to address a volume texture. From my understanding you are using a simple 2d texture for display, so you should be fine using cuGraphicsResourceGetMappedPointer and cudaMemcpy. Check out the SimpleTexture3D sample of the CUDA SDK for an example.