Zero Copy vs Pinned Memory Performance . Need some explanation

I have a lense correction kernel using OpenCv (Cv4Tegra) running on nVidia Tx1.

I have tested the kernel using two of the memory models.

1- Pinned Memory allocated using GpuMat . Uploading data to it . Processing it. then downloading.

2 - Zero Copy Mapped memory. no uploading , no downloading. Just processing

Since Tx1 is an integrated GPU with same memory space as the host so I shouldn’t have to “upload” to device memory before processing. If i understand it correctly, there is no device memory per-say.

I ran my tests and approach 1 is twice as fast as approach 2. even with uploading and downloading.

So when we “upload” to GpuMat what exactly is happening ? Why is this faster.

Similarly, why is processing on zero copy data slower.

What does it mean by “read once write once” ? is it w.r.t the whole matrix or is it talking about indexing e-g read index 0 only once. Do not go back to index 0 again.

I have gone through the documentation already but I haven’t been able to figure out why the performance loss instead of gain.