Optimization of JM 18.0 reference software for H.264 CUDA Programming implementation

Hi,

Is there any link or paper where I can find the guideline for optimizing the inter prediction and motion estimation algorithm in H.264 video codec using CUDA? I found one implementation under CUDA samples on NVIDIA’s website, but it has a .exe so, underlying details are hidden. I want to know more about it.

Thanks!