Why does it take longer for a program to use Unified Memory than not to use Uuified Memoery?

Code address:https://github.com/zhuzhuoyue/cuda_benchmarks

use Unified Memory:(simpleManaged.cu)
./simpleManaged 400000000

host: MallocManaged: 1.082769
host: init arrays: 3.432402
device: uvm+compute+synchronize: 0.013866
host: access all arrays: 6.175977
host: access all arrays a second time: 0.570206
host: free: 0.382470
total: 11.658073

without using Unified Memory:(simpleMemcpy.cu)
./simpleMemcpy 400000000

host: MallocHost: 1.311571
host: init arrays: 3.348044
device: malloc+copy+compute: 1.734390
host: access all arrays: 2.175081
host: access all arrays a second time: 0.552091
host: free: 0.416059
total: 9.537628


First, please remember to maximize the device performance as below:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Unified memory doesn’t require memory copy but does have some overhead in buffer synchronization.
Here is our document for Jetson memory for your reference: