8 channel 4K HEVC decoder

Hello,

Any suggestions on a suitable NVIDIA PCIe card which is capable of to 8 channels 4K HEVC decoding at 60fps?

NVIDIA cards are primarily used for compute intensive tasks. However, I am looking for a card which can only perform HEVC decoding.

Is it possible to use CUDA cores to do HEVC / H.264 decoding in addition to NVDEC core?

Thanks,
Subbarao

Hi,
For a double confirmation, are you looking for a PC with NVIDIA graphics card to do 8 4Kp60 decoding? Or a embedded system like Jetson Xavier?

Hello,

I am looking for an NVIDIA PCIe card capable of decoding 8 channel 4Kp60 and not a PC or embedded system.

Thanks,
Subbarao

Hi,
Please check
https://developer.nvidia.com/nvidia-video-codec-sdk#NVDECFeatures

… and also check NVDEC_Application_Note.pdf !

The answer also depends on input bitrate and output transfers (PCI/DMA, you need up to 16GB/s!). You can estimate minimum required card:

  1. linearly translate “8x 4k HEVC 4:2:0 60 FPS” to FPS/1080p - 8 (streams) * 4 (4k/1080p) * 60 (FPS) = 1920 FPS/1080p
  2. try one pcie card “Quadro RTX 4000” - TU104 with maximum clock 1545 Mhz (wikipedia), linearly scale down Turing NVDEC estimated performance (NVDEC_Application_Note.pdf) HEVC = 1261*(1545/1755) = 1110 FPS/1080p, two decoders in TU104 (see matrix) 2*1100 = 2220 FPS/1080p, it should be sufficient (2200>1920)

DaneLLL, mcerveny,

Thanks for the suggestions. I went through the docs and especially NVDEC_Application_Note.pdf. I have tried compiling the ffmpeg and tried to decode a 5MP(5Mbps 8bit YUV 4:2:0 H.264) video file as given in the following webpage of Nvidia.

https://developer.nvidia.com/ffmpeg

Unfortunately, I did not get more than 24fps. The command I used was as follows:

ffmpeg -vsync 0 -c:v h264_cuvid -i 4mp.mp4 -f rawvideo output.yuv

ffmpeg debug messages are pasted below for reference. Any suggestions what could be wrong here ?

Successfully opened the file.
Parsing a group of options: output url output.yuv.
Successfully parsed a group of options.
Opening an output file: output.yuv.
[file @ 0x3590440] Setting default whitelist 'file,crypto'
Successfully opened the file.
[h264_cuvid @ 0x3584080] Format nv12 chosen by get_format().
[h264_cuvid @ 0x3584080] Loaded lib: libnvcuvid.so.1
[h264_cuvid @ 0x3584080] Loaded sym: cuvidGetDecoderCaps
[h264_cuvid @ 0x3584080] Loaded sym: cuvidCreateDecoder
[h264_cuvid @ 0x3584080] Loaded sym: cuvidDestroyDecoder
[h264_cuvid @ 0x3584080] Loaded sym: cuvidDecodePicture
[h264_cuvid @ 0x3584080] Loaded sym: cuvidGetDecodeStatus
[h264_cuvid @ 0x3584080] Loaded sym: cuvidReconfigureDecoder
[h264_cuvid @ 0x3584080] Loaded sym: cuvidMapVideoFrame64
[h264_cuvid @ 0x3584080] Loaded sym: cuvidUnmapVideoFrame64
[h264_cuvid @ 0x3584080] Loaded sym: cuvidCtxLockCreate
[h264_cuvid @ 0x3584080] Loaded sym: cuvidCtxLockDestroy
[h264_cuvid @ 0x3584080] Loaded sym: cuvidCtxLock
[h264_cuvid @ 0x3584080] Loaded sym: cuvidCtxUnlock
[h264_cuvid @ 0x3584080] Loaded sym: cuvidCreateVideoSource
[h264_cuvid @ 0x3584080] Loaded sym: cuvidCreateVideoSourceW
[h264_cuvid @ 0x3584080] Loaded sym: cuvidDestroyVideoSource
[h264_cuvid @ 0x3584080] Loaded sym: cuvidSetVideoSourceState
[h264_cuvid @ 0x3584080] Loaded sym: cuvidGetVideoSourceState
[h264_cuvid @ 0x3584080] Loaded sym: cuvidGetSourceVideoFormat
[h264_cuvid @ 0x3584080] Loaded sym: cuvidGetSourceAudioFormat
[h264_cuvid @ 0x3584080] Loaded sym: cuvidCreateVideoParser
[h264_cuvid @ 0x3584080] Loaded sym: cuvidParseVideoData
[h264_cuvid @ 0x3584080] Loaded sym: cuvidDestroyVideoParser
[AVHWDeviceContext @ 0x357ba40] Loaded lib: libcuda.so.1
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuInit
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuDeviceGetCount
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuDeviceGet
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuDeviceGetAttribute
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuDeviceGetName
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuDeviceComputeCapability
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuCtxCreate_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuCtxSetLimit
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuCtxPushCurrent_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuCtxPopCurrent_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuCtxDestroy_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuMemAlloc_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuMemAllocPitch_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuMemsetD8Async
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuMemFree_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuMemcpy2D_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuMemcpy2DAsync_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuGetErrorName
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuGetErrorString
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuStreamCreate
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuStreamQuery
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuStreamSynchronize
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuStreamDestroy_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuStreamAddCallback
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuEventCreate
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuEventDestroy_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuEventSynchronize
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuEventQuery
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuEventRecord
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuLaunchKernel
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuModuleLoadData
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuModuleUnload
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuModuleGetFunction
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuTexObjectCreate
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuTexObjectDestroy
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuGLGetDevices_v2
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuGraphicsGLRegisterImage
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuGraphicsUnregisterResource
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuGraphicsMapResources
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuGraphicsUnmapResources
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuGraphicsSubResourceGetMappedArray
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuDeviceGetUuid
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuImportExternalMemory
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuDestroyExternalMemory
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuExternalMemoryGetMappedBuffer
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuExternalMemoryGetMappedMipmappedArray
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuMipmappedArrayGetLevel
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuMipmappedArrayDestroy
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuImportExternalSemaphore
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuDestroyExternalSemaphore
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuSignalExternalSemaphoresAsync
[AVHWDeviceContext @ 0x357ba40] Loaded sym: cuWaitExternalSemaphoresAsync
[h264_cuvid @ 0x3584080] CUVID capabilities for h264_cuvid:
[h264_cuvid @ 0x3584080] 8 bit: supported: 1, min_width: 48, max_width: 4096, min_height: 16, max_height: 4096
[h264_cuvid @ 0x3584080] 10 bit: supported: 0, min_width: 0, max_width: 0, min_height: 0, max_height: 0
[h264_cuvid @ 0x3584080] 12 bit: supported: 0, min_width: 0, max_width: 0, min_height: 0, max_height: 0
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (h264_cuvid) -> rawvideo (native))
Press [q] to stop, [?] for help
cur_dts is invalid st:0 (0) [init:0 i_done:0 finish:0] (this is harmless if it occurs once at the start per stream)
    Last message repeated 1 times
[h264_cuvid @ 0x3584080] Format nv12 chosen by get_format().
[h264_cuvid @ 0x3584080] Formats: Original: nv12 | HW: nv12 | SW: nv12
cur_dts is invalid st:0 (0) [init:0 i_done:0 finish:0] (this is harmless if it occurs once at the start per stream)
    Last message repeated 2 times
detected 4 logical cores
[graph 0 input from stream 0:0 @ 0x3de1c80] Setting 'video_size' to value '2592x1944'
[graph 0 input from stream 0:0 @ 0x3de1c80] Setting 'pix_fmt' to value '23'
[graph 0 input from stream 0:0 @ 0x3de1c80] Setting 'time_base' to value '1/1200000'
[graph 0 input from stream 0:0 @ 0x3de1c80] Setting 'pixel_aspect' to value '1/1'
[graph 0 input from stream 0:0 @ 0x3de1c80] Setting 'sws_param' to value 'flags=2'
[graph 0 input from stream 0:0 @ 0x3de1c80] Setting 'frame_rate' to value '25/1'
[graph 0 input from stream 0:0 @ 0x3de1c80] w:2592 h:1944 pixfmt:nv12 tb:1/1200000 fr:25/1 sar:1/1 sws_param:flags=2
[AVFilterGraph @ 0x4108080] query_formats: 3 queried, 2 merged, 0 already done, 0 delayed
Output #0, rawvideo, to 'output.yuv':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.27.103
    Stream #0:0(und), 0, 1/25: Video: rawvideo, 1 reference frame (NV12 / 0x3231564E), nv12(left), 2592x1944 [SAR 1:1 DAR 4:3], 0/1, q=2-31, 1511654 kb/s, 25 fps, 25 tbn, 25 tbc (default)
    Metadata:
      handler_name    : VideoHandler
      encoder         : Lavc58.52.102 rawvideo
frame=  793 fps= 20 q=-0.0 Lsize= 5853232kB time=00:00:31.72 bitrate=1511654.4kbits/s speed= 0.8x    
video:5853232kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000000%
Input file #0 (../5mp.mp4):
  Input stream #0:0 (video): 797 packets read (20573074 bytes); 794 frames decoded; 
  Total: 797 packets (20573074 bytes) demuxed
Output file #0 (output.yuv):
  Output stream #0:0 (video): 793 frames encoded; 793 packets muxed (5993709696 bytes); 
  Total: 793 packets (5993709696 bytes) muxed
794 frames successfully decoded, 0 decoding errors
[AVIOContext @ 0x35904c0] Statistics: 0 seeks, 22865 writeouts
[AVIOContext @ 0x35848c0] Statistics: 20626756 bytes read, 2 seeks

We could achieve the decoding FPS as in application notes. Major bottleneck is to transfer the decoded raw video to host CPU memory. I have updated our latest findings in following thread.

https://devtalk.nvidia.com/default/topic/987460/gpu-accelerated-libraries/nvdec-cuda-nvenc-speed-comparison/post/5343483/?offset=10#5344656

Check memory throughput with “CUDA-Z” or CUDA samples “bandwidthTest” (https://docs.nvidia.com/cuda/cuda-samples/index.html#bandwidth-test). Try page-locked/pinned memory. Check results on Internet (it depends on used card and PCIe generation) but all under 13 GiB/s.