NVDEC - Post decode performance issue

fcetrini · March 10, 2020, 2:45pm

Hi,
I’m trying to integrate hw decoding capabilities into my video player, so I’m using Cuda 10.1 and the latest Video Codec SDK.
Currently I’m able to decode h264/h265 video stream, but I’m not satisfied by the cpu performance.

My test case is a matrix of 40 indipendent video player, and each instance plays the same live stream: H264 704*576 @ 15 FPS.

In that situation, the CPU usage climbs up to 90% of my i7-8750H.

As far as I understood, there’s nothing wrong with the decoding procedure. In fact, if I comment out the “Picture display” stage (NV12 CUDA_MEMCPY2D/Nv12ToColor32), the CPU usage drops to 12/15%.

Any ideas? Thank you!

fcetrini · May 4, 2020, 2:39pm

Please can someone give help on my request?

mandar_godse · May 5, 2020, 2:36pm

Hi fcetrini,
We are trying to replicate this behavior internally. Can you provide more details, OS, GPU used, driver version etc. Have you tried profiling an application to see any specific API/function call consuming higher CPU cycles?

For future reference, this issue is tracked internally as 200614071.

Thanks.

fcetrini · May 5, 2020, 6:41pm

Hi mandar_godse, thank you for your reply.

My dev machine: i7-8750H/GeForce GTX 1050/16GB ram
I’ve got 445.87 driver at the moment, tried older before with same results.

Let me specify better my “Picture display” stage:

cuGraphicsMapResources + cuGraphicsSubResourceGetMappedArray
cuMemAlloc + cuMemcpy2D (3 planes)
cuMemAlloc + Nv12ToColor32 + cuMemcpy2D to mapped array

These are my results (40 instances of H264 704*576 @ 15 FPS stream):

Decoding and rendering: 80% CPU
Commenting line 3: 55% CPU
Commenting lines 3+2: 17% CPU
Commenting lines 3+2+1: 11% CPU

Is there something I’m doing sooo wrong?
Thanks.

Ext3h · May 6, 2020, 7:58am

cuMemAlloc + Nv12ToColor32 + cuMemcpy2D to mapped array

Are you actually allocating new buffers per frame, and then using the synchronous copy functions? You should probably avoid doing either, and use async operations on a dedicated CUDA stream (created with CU_STREAM_NON_BLOCKING), with only a single GPU to CPU sync point via explicit cuEventCreate(…, CU_EVENT_BLOCKING_SYNC) → cuEventRecord → cuEventSynchronize after the last async copy operation.

Furthermore, ensure you have allocated your CUDA context with CU_CTX_SCHED_BLOCKING_SYNC, the default of CU_CTX_SCHED_AUTO (resolving to CU_CTX_SCHED_SPIN) yields horrible performance for any scenario where you have more streams than CPU cores.

Finally, there is an issue with cuGraphicsMapResources, it’s unfortunately introducing an undesirable sync point between 3D and CUDA. You won’t notice that issue in terms of CPU utilization though, just stalled 3D context. You said your video players are independent, but using cuGraphicsMapResources to map resources in batch can give a huge win when displaying streams in parallel.

fcetrini · May 6, 2020, 8:56am

Hi Ext3h, thanks for your reply.

Yes my video players are independent, each instance got its own rendering pipeline, sharing only the same CudaContext (as documentation suggests). And indeed, the CudaContext is created with CU_CTX_SCHED_BLOCKING_SYNC flag.

That said, can you please explain better your suggestion on using a dedicated cuda stream?
I’ve never used it before…is there some docs/example on the argument?

About your last point, what do you mean with “cuGraphicsSubResourceGetMappedArray to map resources in batch”?

Thank you again!

fcetrini · May 14, 2020, 10:42pm

Hi again, I’m doing some modifications to my code, I think I’m on the right track.

In my previous code, I was allocating new buffers per frame, copying nv12, converting nv12->rgb and finally copying the rgb into my back buffer: it was a total mess!

Instead, now I’m creating a single nv12 texture on the rendering side, and that texture I register with cuGraphicsD3D11RegisterResource. After decode/cuvidMapVideoFrame, I’m trying a direct copy of the decoded frame to the nv12 directx texture, and then let the pixel shader do its job.

What I’m not succeeding is to copy the ENTIRE nv12 to the d3d11 texture array in a single shot.
This is the code:

cuGraphicsMapResources(1, &this->cuGraphicResource, this->cuStream);
CUarray dstArray;
cuGraphicsSubResourceGetMappedArray(&dstArray, this->cuGraphicResource, 0, 0);
CUDA_MEMCPY2D m = { 0 };
m.srcMemoryType = CU_MEMORYTYPE_DEVICE;
m.srcDevice = dpSrcFrame;
m.srcPitch = nSrcPitch;
m.dstMemoryType = CU_MEMORYTYPE_ARRAY;
m.dstArray = dstArray;
m.dstY = 0;
m.WidthInBytes = m_nWidth;
m.Height = m_nLumaHeight;

That way I’m obviously obtaining a greenish frame on screen, because I’m copying only the luminance plane, and the cpu drops dramaticaly.
I’ve tried to play with the CUDA_MEMCPY2D parameters, without success.

Any help? Thank you.

Topic		Replies	Views
NVDEC decoded frame - trying a zero copy to NV12 d3d11 texture Video Processing & Optical Flow	6	1509	June 29, 2020
Sample AppDecMultiFiles in VideoCodec SDK does not improve the performance Video Processing & Optical Flow	1	696	December 6, 2019
performance problem CUDA Programming and Performance	2	611	July 16, 2018
Sharing the same Cuda context for encoding(NVENC) and decoding(NVDEC) Video Processing & Optical Flow	13	4354	January 12, 2020
8 channel 4K HEVC decoder Video Processing & Optical Flow	7	2820	May 30, 2019
Cannot get any stream parallelism. CUDA Programming and Performance	13	1297	December 31, 2019
Debugging slow NVDEC h264 decoding using FFMPEG -- time is spent in avcodec_send_packet() Video Processing & Optical Flow	8	692	September 30, 2024
NVDEC hardware CUDA_ERROR_INVALID_VALUE cuvidDecodePicture call CUDA Programming and Performance	0	2628	March 28, 2019
NVDEC How to multihread decoding for better performance (lower latency)? Video Processing & Optical Flow camera	3	2066	April 21, 2020
cuStreamSynchronize() pulls cpu usage to 100% Video Processing & Optical Flow decoder , cuda	0	641	November 24, 2020

NVDEC - Post decode performance issue

Related topics