H.264 Deblocking Filter using GPU How to do deblocking Filter for H.264 using GPU

Hi everyone,

I want to use GPU to implement Deblocking Filter for H.264/AVC now, you know, the H.264/AVC deblocking filter has data dependency largely. Do you have any good idea to accelerating this filter?


How large are the regions with different data dependency? Can you make each CUDA thread block cover one such region?

If this is not possible, how many different states of the “data” (probably corresponding to filter strength) does the deblocking filter support?


yes, it almost all the data are dependented on the previous data, I mean the right/below one depended on the left/top one. so the effect is propagated.

Can we use the video processor (vp) in CUDA, which is the specific processor to video processing.

From my understanding, the deblocking filter requires only “pre-filtered” luma/chroma samples. Thus, the parallelism of deblocking filter is quite high. Just use one thread block for a MB and the whole grid for one frame/field.

H.264 was designed to be hardware-friendly, so there is no serious bottle-neck in implementing it on many-core system such as CUDA.

The h.264 standard defines the deblocking filter… so the algorithm cannot be changed in order maintain fidelity of the video…
And the deblocking filter requires the filtered neighboring samples… using unfiltered neighboring samples will break the standard!! (and possibly cause artifacts)

I guess the initial argument by egg seems correct… we need access to the VP2 engine! … because the nvidia provided decoder (in the SDK) certainly manages deblocking pretty well using the vp2… (1080p avc full decoding at 60fps on a gtx260+) … however one would need to access this VP engine to build an encoder!!
any idea how to access this vp2 engine??

You could use the nvcuvid api to access the decode functionality. For the original question, it should be possible to parallelize the deblocking along a diagonal edge, and/or using global atomics to resolve spatial dependencies.

Thanks for the reply Sulik.

As I understand nvcuvid api gives me just the decode entry point. (decoder frame api). However I am not sure how to access only the deblocking filter using the nvcuvid api. (as an encoder would require it).

About the diagonal…or better known as ‘wavefront’ parallelism, (something i tried before) the block level synch is very very critical. I am not sure how we can control the block’s scheduling over the available multi-processors. (if all the multi-processors are scheduled with the blocks not supposed to be processed immediately, there could be a deadlock )…

Do let me know, your inputs.