Encoding + displaying not keeping-up (NVEnc, OpenGL, CUDA)

jean-philippe · March 5, 2018, 6:05pm

I’m writing a Windows application that processes, encodes to H.264 and displays live images acquired from a camera at 60Hz. I’m using a Quadro P400. I need the latency from acquisition to display to be minimal.

The images uploaded to the GPU are 1920x1080, bayer format. I’m using the Npp primitives to deBayer and upscale the images to 4K (part of my requirement). I’m using OpenGL for the display.

Functionally the app is correct (images are displayed and encoded correctly) but the frame period is around 22ms, meaning it’s dropping frames.

I’m looking for some advice on how restructure the application in order to improve performance.

I have profiled the app using the NSight profiler in Visual Studio and I can see that the kernels are using 100% occupancy. The memcpy to OpenGL texture (cudaMemcpyToArray()) is substantially slower than the simple cudaMemcpy() calls. Can this be improved?

There are a number of idle times for the GPU which I do not understand. Profiler trace available here: https://we.tl/DdDMU6bbw3. Look at time 5.2ms for the beginning of a frame which takes 21.7ms (starts with a H->D memcpy).

Commenting out the display makes the GPU idle times go away. Is the interaction with OpenGL causing the GPU to stall for some reason?

All the code runs in a single thread and there’s only one CUDA Stream. Would using more Streams and / or more threads help?

EDIT
Removing the call to SwapBuffer() allows the GPU to be 100% utilized. Is there a fundamental problem with the CUDA / GL interop causing this?

Pseudo-code for the main loop of the app is shown below.

The host buffer is pinned memory.
nv_enc.Encode() takes a copy of the input RGBA image and then calls NVENC::nvEncEncodePicture(). The NVENC API internally converts the image from RGBA to YUV before sending it to the encoder.
gl_window.SetImageFromCUDA copies the image into texture memory using cudaMemcpyToArray(), renders it onto a quad and calls SwapBuffers(). The cudaArray pointer has been retrieved using the cuGraphicsMapResources() function as shown in this example https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st.
The display is synched to VSync.

NvQueryPerformanceCounter(&lStart);

      for (int i = 0; i < N; ++i) {
         // Copy Lena to GPU
         __cuda(cudaMemcpy(d_lena, h_lena, lena_size, cudaMemcpyHostToDevice));

         // Debayer using NPP
         __npp(nppiCFAToRGBA_8u_C1AC4R(d_lena, nSrcStep, oSrcSize, oSrcROI, d_rgba, nDstStep, eGrid, eInterpolation, nAlpha));
   
            // Upscale to 4K using NPP
            __npp(nppiResize_8u_C4R(d_rgba, nSrcStep, oSrcSize, oSrcROI, d_upscaled_rgba, nDstStep, oDstSize, oDstROI, eInterpolation));

            // Encode
            nv_enc.Encode(d_upscaled_rgba);

            // Display
            gl_window.SetImageFromCUDA(d_upscaled_rgba);

            ++numFramesEncoded;
      }

NvQueryPerformanceCounter(&lEnd);
NvQueryPerformanceFrequency(&lFreq);
double elapsedTime = (double)(lEnd - lStart);
print("Average Processing Time : %6.2fms\n", ((elapsedTime*1000.0) / numFramesEncoded) / lFreq);

Stefan_Schoenefeld · March 15, 2018, 11:28am

Hi Jean-Phillipe,

glSwapBuffers will block until the frame is rendered. With v-sync this will cause idle time as swapBuffers will only return after the frame has been displayed. Even without vsync there is only a number of frames that will be buffered in the gpu, so you might see idle time even with vsync disabled.

I think a clean approach would be to have a separate rendering thread which will alternate between rendering two textures. The texture that is not currently being rendered can then be updated from your main thread using cudaMemcpyToArray(). You’ll need some basic signalling to make sure the rendering and main thread are synchronized and not attempting to access resources that are currently in use by the other thread.

On the memcpy vs memcpyToArray isue: The sample you referenced uses memcpyToArray as it needs to address a volume texture. From my understanding you are using a simple 2d texture for display, so you should be fine using cuGraphicsResourceGetMappedPointer and cudaMemcpy. Check out the SimpleTexture3D sample of the CUDA SDK for an example.

Regards,
Stefan

Topic		Replies	Views
cudaMemcpyAsync Func Used too long time. CUDA Programming and Performance	5	2348	July 15, 2019
Encoding OpenGL textures live on Windows Video Processing & Optical Flow	0	896	April 3, 2019
NVENC H.264 Encoder MFT latency increases when framerate is limited General Topics and Other SDKs	4	2600	September 4, 2018
Needing expert advice.. CUDA Programming and Performance	4	1268	July 21, 2014
Efficient memory integration of CUDA NPP functions and HW NvVideoEncoder (HEVC/H264) Jetson TX2	11	2842	October 18, 2021
What would cause of 1-second GPU lockups in CUDA program? How to debug them beyond nvprof? CUDA Programming and Performance	4	777	June 3, 2017
NVencs Output Bitstream is not readable GPU-Accelerated Libraries	9	2907	August 2, 2016
Low performance and high CPU usage CUDA Programming and Performance	13	19458	August 7, 2007
memcpyDtoH in a while loop Idle time problem CUDA Programming and Performance	8	1416	December 27, 2011
CUDA 5.0 suggestions.. CUDA Programming and Performance	1	6702	December 26, 2011

Encoding + displaying not keeping-up (NVEnc, OpenGL, CUDA)

Related topics