Performance difference between cudaHostAlloc and malloc

I have an image processing application. Now I process all calculation on CPU but in the future I plan to write some of algorithms in CUDA and to run them on GPU. Fro that purpose I allocate all image buffers by using
cudaHostAlloc( …, cudaHostAllocMapped) for zero copy in the future instead of regular malloc.
I was surprised that performance of Image processing algorithms is better in case of malloc allocation.
Why this happens? How can I allocate buffers for future using on device without degradation of performance?

You may read this topic: https://devtalk.nvidia.com/default/topic/1021702/jetson-tx1/performance-of-zero-copy-on-jetson-tx1/.

Probably Unified Memory will have better performance.