How does CUDA unified memory handle data movement?
Let’s say I have a program that looks like this -
cudaMallocManaged(&y, ...);
host_kernel(y);
device_kernel(y);
host_kernel(y);
device_kernel(y);
host_kernel(y);
device_kernel(y);
Will I take a performance hit due to data movement between device and host kernels?
If it is dependent on the GPU, what generation of GPUs does this start becoming efficient? Would Compute Capability 6+ (Pascal) suffice? https://developer.nvidia.com/blog/unified-memory-cuda-beginners/#what_happens_on_pascal_when_i_call_cudamallocmanaged
Do I need to prefetch the data on the host to reduce data movement overheads? https://developer.nvidia.com/blog/unified-memory-cuda-beginners/#what_happens_on_pascal_when_i_call_cudamallocmanaged
For better performance should I come up with an algorithm to only allocate the amount of memory that would fit on the GPU and then allocate the next batch? c++ - CUDA - Unified memory (Pascal at least) - Stack Overflow