Nvcuvid Decoding Slower with v537.42 Drivers

Hello Everyone,

We are seeing performance degradation in h246 video decoding when upgrading from NVidia v431.70 to v537.42 drivers. Running on a Quadro RTX 5000, and Windows 10 21H2.

Of note:

A. The time taken for cuvidParseVideoData starts off running well (sub ms), but then starts to slowly degrade linearly up to 10s of ms per call. This excludes any time spent in callbacks.
Compared to v431.70 drivers, where cuvidParseVideoData will happily run sub ms.

B. The time taken to copy the decoded frame buffers (1080p NV12) to a CUdeviceptr with cuvidMapVideoFrame and cuMemcpy2D, is taking 10-20ms on v431.70, while taking only ~1ms on v431.70.

Does anyone know of anything that’s changed in the drivers that might affect this?

The decoding process roughly follows:

  • main:

    • cuMemAlloc target_buffers
  • Decode thread per video stream (x7):

    • cuvidParseVideoData (linear growth)
      • pfnDecodePicture callback
        • cuvidDecodePicture (constant time)
      • pfnDisplayPicture callback
        • push CUVIDPARSERDISPINFO
    • pop CUVIDPARSERDISPINFO
    • copy to CUdeviceptr (order of magnitude slower)
      • cuvidMapVideoFrame CUVIDPARSERDISPINFO
      • cuMemcpy2D Y → target_buffer
      • cuMemcpy2D UV → target_buffer
      • cuvidUnmapVideoFrame

Thanks in advance,
~Edgar

Update:

The 552.22 (R550) drivers don’t have the issue, but the latest LTBS (as of 538.62 R535) still have the cpu usage ramp up.

The issue can be reproduced by simply calling cuvidParseVideoData with an h264 RTP video stream. (ie cpu use ramp up occurs even without further processing on the GPU via cuvidDecodePicture).

Sample code and data has been provided to NVidia via Incident: 240517-000386 / Bug 4652908.

is taking 10-20ms on v431.70, while taking only ~1ms on v431.70.

Typo. Is this issue fixed?

Sorry, that have been:

is taking 10-20ms on v537.42, while taking only ~1ms on v431.70.

The issue appears fixed on the R550 branch, but not on the R535 LTSB.

That is a third issue? Can you describe it? I mean more CPU use is good, no?

What I call the “cpu ramp up”, is the A issue: cuvidParseVideoData starts off running well (sub ms), but then starts to slowly degrade linearly up to 10s of ms per call.

So at a fixed frame rate the CPU use increases each time cuvidParseVideoData is called on the same CUvideoparser instance. So more CPU load for the same amount of work. Not good ;)
In real time this ramp up is slow, taking multiple hours.

Or if you look at it another way, when pushing in frames as fast as possible the number of frames parsed per second decreases over time. In this case CPU use is constant, but less work is done.