Getting the most H.264 decode performance

I’m trying to decode as many H.264 streams as possible on an Nvidia card. It’s easy enough to use the NVDEC hardware from a gstreamer pipeline with the nvv4l2decoder element.

The avdec_h264 element is a software decoder that runs on the CPU, or presumably a CPU decoder chip if present.

What I’m looking for is a CUDA based software decoder that will fully saturate the GPUs cores. Is that possible to do that on CUDA cores? I’m not concerned with how efficient it is compared to the dedicated NVDEC, but I am curious to see if I can increase my total stream count with this strategy.