Why nppiCopyConstBorder_32f_C1R is so slow

i use nppiCopyConstBorder_32f_C1R to pad a PSF(21x21) to be 1280x1024 with ‘0’, it spend 18ms unexpectedly, it is too long, why? any help, please! GTX-1660, CUDA-v11.0, windows10

it is the cudaDeviceReset()'s fault

