I have conducted experiments on TK1 and TX2 to compare the performance of this technique, zero copy technique, and the explicit memory management technique using cudaMalloc and cudaMemcpy. The results show that the unified memory and zero copy is always faster.
So I am curious whether the unified memory is better in all cases even for Kepler and Maxwell that only support a limited form of unified memory, and zero copy is always faster if the CPU memory and GPU memory are integrated. If the answer is yes, does it mean that we do not need to use the explicit memory management technique in the future? By the way, I have noticed that some Nvidia provided libraries (such as VisionWorks) seem to still use the explicit memory management technique.
Thanks for your reply. How about the unified memory. Is it always better than memory copy in both Tegra board or other GPU/CPU separated systems? Thanks.
One way of thinking about this is that these are convenience features, just like caches in processors, compilers to generate machine code, etc.
Convenience features can be a boon to programmer productivity and typically provide good or a least acceptable performance for 95% of use cases. But for a small percentage of cases, they might do pretty bad things from a performance perspective (e.g. cache trashing). In those situations, a skilled human can do a better job, in particular if that skilled human also has a better understanding of the details of a particular use case than cannot be adequately conveyed to some automated process.
Note that as automated mechanisms mature, the number of humans with sufficient skill to beat the machine tends to get smaller and smaller, but in most engineering domains, it hasn’t become zero yet (other than for playing chess, for example).