How to take advantage of Xavier's best performance?

I put my program from Tesla P4 onto Xvaiver without any modification. but Xavier’s algorithm is slower
than Tesla P4.I am unfamiliar with Xvaiver.how can i make my program faster?
i preprocess my mat using cpu(resize and so on)before inference, so i need to download from gpu.
so if i preprocess in gpu directly, it wilt be faster.
but is there any greater thing i have not use?

Hi quanminzhu,

Please refer to CUDA for Tegra :: CUDA Toolkit Documentation and see if it helps on your cases.