I have conducted experiments on TK1 and TX2 to compare the performance of this technique, zero copy technique, and the explicit memory management technique using cudaMalloc and cudaMemcpy. The results show that the unified memory and zero copy is always faster.
So I am curious whether the unified memory is better in all cases even for Kepler and Maxwell that only support a limited form of unified memory, and zero copy is always faster if the CPU memory and GPU memory are integrated. If the answer is yes, does it mean that we do not need to use the explicit memory management technique in the future? By the way, I have noticed that some Nvidia provided libraries (such as VisionWorks) seem to still use the explicit memory management technique.