I’m writing a Windows application that processes, encodes to H.264 and displays live images acquired from a camera at 60Hz. I’m using a Quadro P400. I need the latency from acquisition to display to be minimal.
The images uploaded to the GPU are 1920x1080, bayer format. I’m using the Npp primitives to deBayer and upscale the images to 4K (part of my requirement). I’m using OpenGL for the display.
Functionally the app is correct (images are displayed and encoded correctly) but the frame period is around 22ms, meaning it’s dropping frames.
I’m looking for some advice on how restructure the application in order to improve performance.
I have profiled the app using the NSight profiler in Visual Studio and I can see that the kernels are using 100% occupancy. The memcpy to OpenGL texture (cudaMemcpyToArray()) is substantially slower than the simple cudaMemcpy() calls. Can this be improved?
There are a number of idle times for the GPU which I do not understand. Profiler trace available here: https://we.tl/DdDMU6bbw3. Look at time 5.2ms for the beginning of a frame which takes 21.7ms (starts with a H->D memcpy).
Commenting out the display makes the GPU idle times go away. Is the interaction with OpenGL causing the GPU to stall for some reason?
All the code runs in a single thread and there’s only one CUDA Stream. Would using more Streams and / or more threads help?
EDIT
Removing the call to SwapBuffer() allows the GPU to be 100% utilized. Is there a fundamental problem with the CUDA / GL interop causing this?
Pseudo-code for the main loop of the app is shown below.
- The host buffer is pinned memory.
- nv_enc.Encode() takes a copy of the input RGBA image and then calls NVENC::nvEncEncodePicture(). The NVENC API internally converts the image from RGBA to YUV before sending it to the encoder.
- gl_window.SetImageFromCUDA copies the image into texture memory using cudaMemcpyToArray(), renders it onto a quad and calls SwapBuffers(). The cudaArray pointer has been retrieved using the cuGraphicsMapResources() function as shown in this example https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st.
- The display is synched to VSync.
NvQueryPerformanceCounter(&lStart);
for (int i = 0; i < N; ++i) {
// Copy Lena to GPU
__cuda(cudaMemcpy(d_lena, h_lena, lena_size, cudaMemcpyHostToDevice));
// Debayer using NPP
__npp(nppiCFAToRGBA_8u_C1AC4R(d_lena, nSrcStep, oSrcSize, oSrcROI, d_rgba, nDstStep, eGrid, eInterpolation, nAlpha));
// Upscale to 4K using NPP
__npp(nppiResize_8u_C4R(d_rgba, nSrcStep, oSrcSize, oSrcROI, d_upscaled_rgba, nDstStep, oDstSize, oDstROI, eInterpolation));
// Encode
nv_enc.Encode(d_upscaled_rgba);
// Display
gl_window.SetImageFromCUDA(d_upscaled_rgba);
++numFramesEncoded;
}
NvQueryPerformanceCounter(&lEnd);
NvQueryPerformanceFrequency(&lFreq);
double elapsedTime = (double)(lEnd - lStart);
print("Average Processing Time : %6.2fms\n", ((elapsedTime*1000.0) / numFramesEncoded) / lFreq);