Cuda memory copy throughput in jetson device

Jetson devices use unified memory, is there any reason why the memory copy speed of host to device (even if pinned memory is used) is slower than device to device?

Have you maximized the device performance before benchmarking?
If yes, could you share the time you measured with us first?