Difference between cudaMallocManaged and zero copy memory function


I am a CUDA beginner, learning through the Toolkit documentation. However, I am also trying to leverage the code repo https://github.com/dusty-nv/jetson-inference for my work. While I catch up with the documentation, I just wanted to understand the quick difference between a function(cudaAllocMapped()) used in this repo and cudaMallocManaged().

The function is given here:

 * Allocate ZeroCopy mapped memory, shared between CUDA and CPU.
 * @ingroup util
inline bool cudaAllocMapped( void** cpuPtr, void** gpuPtr, size_t size )
	if( !cpuPtr || !gpuPtr || size == 0 )
		return false;


	if( CUDA_FAILED(cudaHostAlloc(cpuPtr, size, cudaHostAllocMapped)) )
		return false;

	if( CUDA_FAILED(cudaHostGetDevicePointer(gpuPtr, *cpuPtr, 0)) )
		return false;

	memset(*cpuPtr, 0, size);
	printf("[cuda]  cudaAllocMapped %zu bytes, CPU %p GPU %p\n", size, *cpuPtr, *gpuPtr);
	return true;

From the documentation, I find:

So can someone please give me a scenario/example when one would use zero-copy memory vs. unified memory or vice versa?


zero-copy: data are allocated on cpu, GPU accesses them via PCI-E on each operation

UM: data are moved between CPU and GPU RAM on demand. It’s similar to manual copying before/after kernel call, but automatically managed by the CUDA. You just allocates the single universal pointer and can access it on both sides. So it never faster than manual memory management, sometimes may be slower (when automatic heuristics sucks), but simplifies the program. Essentially, you can remove all explicit memory movу operations, remove any separate allocations of device memory and alloc everything as UM arrays used by both CPU and GPU code

Heuristic AFAIR is the following: when data, which are absent on CPU side, are accessed by CPU, they are moved from GPU on-demand with a page (4KB) granularity. When data that can be potentially accessed by kernel (i.e. available for the stream to which the kernel belongs) are absent on GPU side, entire array is copied from CPU to GPU prior to kernel start

Pascal+ GPUs can use on-demand copying with page granularity for both cases, but ATM it is implemented only by Linux (TCC?) driver

PS: more info at http://www.drdobbs.com/parallel/unified-memory-in-cuda-6-a-brief-overvie/240169095