Does unified memory serve any purpose on a system with multiple GPUs that lacks peer-to-peer (P2P) support?
The CUDA C++ Programming Guide states the following under 19.3.2.3. Multi-GPU:
On Linux the managed memory is allocated in GPU memory as long as all GPUs that are actively being used by a program have the peer-to-peer support. If at any time the application starts using a GPU that doesn’t have peer-to-peer support with any of the other GPUs that have managed allocations on them, then the driver will migrate all managed allocations to system memory. In this case, all GPUs experience PCIe bandwidth restrictions.
My interpretation is that, if the system does not support P2P, all allocations made with cudaMallocManaged will reside in host memory. Thus, all memory accesses will go over PCIe and the device memory will remain completely unused. This will cripple performance, even if just one GPU is used.
Most consumer cards (e.g., RTX 3060) do not support P2P. Thus, it would be a mistake to use unified memory in systems with multiple consumer GPUs. In this case, one should fall back to manual device allocations with cudaMalloc.
Is the above interpretation and conclusion correct?
Yes, I would say generally that is correct. Perhaps a few things to point out:
some consumer cards support P2P. In particular those cards that supported a NVLink bridge do support P2P. Some members of RTX 20 series and RTX 30 series GPUs supported NVLink bridge, and therefore do support P2P when an NVLink bridge is installed (ie. support P2P with the bridged device).
some behavior may be somewhat modifiable with cudaMemAdvise hints. I have not tried it, but cudaMemAdviseReadMostly may allow for migration to a device, as long as the data is being used in a read-only fashion.
UM might still be interesting when non-performance-critical code or data structures are being traversed. For example a doubly-linked list where the traversal of the list guides behavior but is not involved/dominant in performance-critical routines.
Could these limitations be overcome by running a separate process for each GPU?
If at any time the application starts using a GPU that doesn’t have peer-to-peer support with any of the other GPUs that have managed allocations on them, then the driver will migrate all managed allocations to system memory.
My interpretation of the above is that, as long as a given process does not use more than one GPU, unified memory can reside in device memory. Thus, one approach is to run a separate process for each GPU and communicate between the processes on the host. A limitation is that the GPUs cannot communicate with each other directly.
If it were me, I would simply try it, and observe what happens. If I wanted to be certain that the CUDA runtime only had 1 GPU “in view”, I would try to use the CUDA_VISIBLE_DEVICES variable to select that.