Nvcuvid Decoding Slower with v537.42 Drivers

Hello Everyone,

We are seeing performance degradation in h246 video decoding when upgrading from NVidia v431.70 to v537.42 drivers. Running on a Quadro RTX 5000, and Windows 10 21H2.

Of note:

A. The time taken for cuvidParseVideoData starts off running well (sub ms), but then starts to slowly degrade linearly up to 10s of ms per call. This excludes any time spent in callbacks.
Compared to v431.70 drivers, where cuvidParseVideoData will happily run sub ms.

B. The time taken to copy the decoded frame buffers (1080p NV12) to a CUdeviceptr with cuvidMapVideoFrame and cuMemcpy2D, is taking 10-20ms on v431.70, while taking only ~1ms on v431.70.

Does anyone know of anything that’s changed in the drivers that might affect this?

The decoding process roughly follows:

  • main:

    • cuMemAlloc target_buffers
  • Decode thread per video stream (x7):

    • cuvidParseVideoData (linear growth)
      • pfnDecodePicture callback
        • cuvidDecodePicture (constant time)
      • pfnDisplayPicture callback
        • push CUVIDPARSERDISPINFO
    • pop CUVIDPARSERDISPINFO
    • copy to CUdeviceptr (order of magnitude slower)
      • cuvidMapVideoFrame CUVIDPARSERDISPINFO
      • cuMemcpy2D Y → target_buffer
      • cuMemcpy2D UV → target_buffer
      • cuvidUnmapVideoFrame

Thanks in advance,
~Edgar

Update:

The 552.22 (R550) drivers don’t have the issue, but the latest LTBS (as of 538.62 R535) still have the cpu usage ramp up.

The issue can be reproduced by simply calling cuvidParseVideoData with an h264 RTP video stream. (ie cpu use ramp up occurs even without further processing on the GPU via cuvidDecodePicture).

Sample code and data has been provided to NVidia via Incident: 240517-000386 / Bug 4652908.