Does CUDA unified memory solve data movement issues on newer GPUs?

How does CUDA unified memory handle data movement?

Let’s say I have a program that looks like this -

cudaMallocManaged(&y, ...);
host_kernel(y);
device_kernel(y);
host_kernel(y);
device_kernel(y);
host_kernel(y);
device_kernel(y);

Will I take a performance hit due to data movement between device and host kernels?

If it is dependent on the GPU, what generation of GPUs does this start becoming efficient? Would Compute Capability 6+ (Pascal) suffice? https://developer.nvidia.com/blog/unified-memory-cuda-beginners/#what_happens_on_pascal_when_i_call_cudamallocmanaged

Do I need to prefetch the data on the host to reduce data movement overheads? https://developer.nvidia.com/blog/unified-memory-cuda-beginners/#what_happens_on_pascal_when_i_call_cudamallocmanaged

For better performance should I come up with an algorithm to only allocate the amount of memory that would fit on the GPU and then allocate the next batch? c++ - CUDA - Unified memory (Pascal at least) - Stack Overflow

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.