Inquiry regarding converting NV12 to RGBA. what is faster?

Hello,

I would like to convert the extracted NV12 data from using NVDec after receiving H.264 compressed from 4K Video.
Regarding this, I have tried several trials and found some problems as below;

  1. The performance is poor when I applied cudaNV12ToRGBA function of jetson-utils.
  2. The performance is good when I applid nppiNV12ToRGB_8u_P2C3R function using NPP Library, however, I found Aplha is additionally neccessary.
  3. Accordingly, I added Alpha to RGB by using cuda, however, the performance is poor again.

Could you please let me know other different/new ideas to make better performance?