PureVideo on Fermi chip

Hi,

As a master student electrical engineering I’ve recently been studying the Fermi architecture a lot. Today I found out that NVIDIA’s PureVideo HD—e.g. h.264 decoding—is also handled by the same chip.

My question now is if anyone here knows if this is accomplished by a separate part of silicon—so dedicated logic—or if this is also handled by the CUDA cores. My best guess would be dedicated logic, since also the low-end Fermi cards can do PureVideo HD. But on the other hand would it maybe be more cost effective to not reserve silicon for this specific application.

Any thoughts on this are more than welcome, even if you can’t give a definite answer to the question.

PureVideo is separate logic present in all CUDA-compatible chips. There are a few generations with differing capabilities, VP2, VP3, and the latest VP4 (in Fermi and other 40nm cards). The logic operates at a different clock (~450Mhz) and excels at sequential bitstream processing such as H.264 CAVLC and CABAC decoding. In theory this could be implemented using CUDA cores, but since there are lots of sequential data dependencies it would be near impossible to achieve parallelism. Previous approaches with VP1 were to perform bitstream decoding on the CPU and then DCT and motion compensation on the GPU. CUDA cards now do everything in dedicated logic with much improved power efficiency.

Thank you Oxydius for the quick and clear reply.

I agree that entropy decoding has not a lot of intrinsic parallelism. The only one being probably task level parallelism among different slices.

The rest of the h.264 decoding pipeline however seems to me as a good fit to the Fermi architecture. Of course dedicated logic is much more power efficient than general purpose hardware, but it also increases the die size. Do you have any idea what kind of chip area we are talking about for this VP4 processor?

Unfortunately multiple H.264 slices cannot be relied upon for parallelism. Most high-quality H.264 streams are encoded with single slices for optimal compression.

I agree that post entropy decoding, you could write CUDA kernels for most of the H.264 pipeline. Partial DXVA acceleration did that. VP4 is not its own chip; as far as I know it’s just a small low-power logic block bundled with NVIDIA GPU’s, either designed or licensed by NVIDIA for integration in their ASIC. It’s pretty much a co-processor with access to global CUDA memory.

Based on the GPU load while decoding, some part of the work is done on the dedicated chip (serial stuff some as entropy encoding) and some part is done on the cores. There is definitely a dedicated chip for the work though.