Npp Based filters performace issue

I am trying to create a filter library that uses NPP for filtering. The library is based on various filters that can be connected in serial and filter a image. Right now I connected two filters on to resize tha imge using NPP and other to blend the image on other image using NPP. Finally i transfer the frame from GPU to CPU for other processing.

This chain is getting 145 ms to process a 1920x1080 frame which is too much. Other thing i noticed is that if i remove blend filter it reduces to 12 ms and if i run blend filter separately it only takes 2 ms. But when i connect all these filters in chain it shoots to 145 ms. I am puzzled.

I would recommend filing a bug report with a self-contained repro program.