NVDEC - Post decode performance issue

Hi,
I’m trying to integrate hw decoding capabilities into my video player, so I’m using Cuda 10.1 and the latest Video Codec SDK.
Currently I’m able to decode h264/h265 video stream, but I’m not satisfied by the cpu performance.

My test case is a matrix of 40 indipendent video player, and each instance plays the same live stream: H264 704*576 @ 15 FPS.

In that situation, the CPU usage climbs up to 90% of my i7-8750H.

As far as I understood, there’s nothing wrong with the decoding procedure. In fact, if I comment out the “Picture display” stage (NV12 CUDA_MEMCPY2D/Nv12ToColor32), the CPU usage drops to 12/15%.

Any ideas? Thank you!

Please can someone give help on my request?

Hi fcetrini,
We are trying to replicate this behavior internally. Can you provide more details, OS, GPU used, driver version etc. Have you tried profiling an application to see any specific API/function call consuming higher CPU cycles?

For future reference, this issue is tracked internally as 200614071.

Thanks.

Hi mandar_godse, thank you for your reply.

My dev machine: i7-8750H/GeForce GTX 1050/16GB ram
I’ve got 445.87 driver at the moment, tried older before with same results.

Let me specify better my “Picture display” stage:

  1. cuGraphicsMapResources + cuGraphicsSubResourceGetMappedArray
  2. cuMemAlloc + cuMemcpy2D (3 planes)
  3. cuMemAlloc + Nv12ToColor32 + cuMemcpy2D to mapped array

These are my results (40 instances of H264 704*576 @ 15 FPS stream):

  • Decoding and rendering: 80% CPU
  • Commenting line 3: 55% CPU
  • Commenting lines 3+2: 17% CPU
  • Commenting lines 3+2+1: 11% CPU

Is there something I’m doing sooo wrong?
Thanks.

cuMemAlloc + Nv12ToColor32 + cuMemcpy2D to mapped array

Are you actually allocating new buffers per frame, and then using the synchronous copy functions? You should probably avoid doing either, and use async operations on a dedicated CUDA stream (created with CU_STREAM_NON_BLOCKING), with only a single GPU to CPU sync point via explicit cuEventCreate(…, CU_EVENT_BLOCKING_SYNC) -> cuEventRecord -> cuEventSynchronize after the last async copy operation.

Furthermore, ensure you have allocated your CUDA context with CU_CTX_SCHED_BLOCKING_SYNC, the default of CU_CTX_SCHED_AUTO (resolving to CU_CTX_SCHED_SPIN) yields horrible performance for any scenario where you have more streams than CPU cores.

Finally, there is an issue with cuGraphicsMapResources, it’s unfortunately introducing an undesirable sync point between 3D and CUDA. You won’t notice that issue in terms of CPU utilization though, just stalled 3D context. You said your video players are independent, but using cuGraphicsMapResources to map resources in batch can give a huge win when displaying streams in parallel.

Hi Ext3h, thanks for your reply.

Yes my video players are independent, each instance got its own rendering pipeline, sharing only the same CudaContext (as documentation suggests). And indeed, the CudaContext is created with CU_CTX_SCHED_BLOCKING_SYNC flag.

That said, can you please explain better your suggestion on using a dedicated cuda stream?
I’ve never used it before…is there some docs/example on the argument?

About your last point, what do you mean with “cuGraphicsSubResourceGetMappedArray to map resources in batch”?

Thank you again!

Hi again, I’m doing some modifications to my code, I think I’m on the right track.

In my previous code, I was allocating new buffers per frame, copying nv12, converting nv12->rgb and finally copying the rgb into my back buffer: it was a total mess!

Instead, now I’m creating a single nv12 texture on the rendering side, and that texture I register with cuGraphicsD3D11RegisterResource. After decode/cuvidMapVideoFrame, I’m trying a direct copy of the decoded frame to the nv12 directx texture, and then let the pixel shader do its job.

What I’m not succeeding is to copy the ENTIRE nv12 to the d3d11 texture array in a single shot.
This is the code:

cuGraphicsMapResources(1, &this->cuGraphicResource, this->cuStream);
CUarray dstArray;
cuGraphicsSubResourceGetMappedArray(&dstArray, this->cuGraphicResource, 0, 0);
CUDA_MEMCPY2D m = { 0 };
m.srcMemoryType = CU_MEMORYTYPE_DEVICE;
m.srcDevice = dpSrcFrame;
m.srcPitch = nSrcPitch;
m.dstMemoryType = CU_MEMORYTYPE_ARRAY;
m.dstArray = dstArray;
m.dstY = 0;
m.WidthInBytes = m_nWidth;
m.Height = m_nLumaHeight;

That way I’m obviously obtaining a greenish frame on screen, because I’m copying only the luminance plane, and the cpu drops dramaticaly.
I’ve tried to play with the CUDA_MEMCPY2D parameters, without success.

Any help? Thank you.