CUDA H264 decoding


I’m trying to decode several full HD (1920x1080) h264 videos at the same time using NVIDIA CUDA video decoder API.
Also I’m monitoring GPU status using GPUZ; my card is a GTX 570.
I’m experiencing a serious bottleneck in the Video Processor indicator at GPUZ ( I think it’s correspond to card’s Video Processor Engine (VPE) ).
While GPU load marks only 3% and memory usage is also low, VPE indicator marks 35 % only with one video, so decoding more than 3 signals will be a problem, because will overpass the 100 %.
I also tried a commercial CUDA H264 decoder called CoreAVC and result it’s the same, Video Engine indicator it’s very high.
My question essentially is why I having a bottleneck in VPE? While GPU load is so low and video engine so high, it’s an API bug ? How can I do to improve performance ?
Thanks in advance.

Very generally speaking, H264 decoding consists of two phases - bitstream parsing and image reconstruction.

The second phase is highly parallelizable and suitable for GPU acceleration. The first phase, however, is not, it is essentially linear (GPUs are bad at linear tasks). If you’re dealing with high-quality, high bitrate videos (in HD resolution, anything above, say, 10 Mbps should be considered high bitrate), the first phase should be expected to take the bulk of time.

GPUs such as GTX 570 contain a dedicated section of the chip (known as PureVideo) that takes care of the first phase. As far as I know, it does not scale the same way as the rest of the video card. It’s been a while since I did any tests in this area, but I recall observing that PureVideo was barely fast enough to manage a single high bitrate 1080p60 stream.

If you enable hardware acceleration (DXVA2) in CoreAVC, it will go through the same code pathway.

There’s really not much you can do: you only have a limited number of dedicated units in your PC which are fast enough to do bitstream parsing. You could try to decode some videos in hardware and some in software. But there’s a hard limit on what you can do in either case, and it’s low.

If you have any control over source videos, prefer streams with multiple slices per frame, as those are easier on the decoder because they are easier to parallelize. IIRC, Blu-Ray mandates at least 4, if not 6, slices. But still, beyond a certain number of streams, you’ll have no choice but to drop frames.