Unified memory with multiple GPUs and no P2P

OwenTB · December 21, 2024, 3:47am

Does unified memory serve any purpose on a system with multiple GPUs that lacks peer-to-peer (P2P) support?

The CUDA C++ Programming Guide states the following under 19.3.2.3. Multi-GPU:

On Linux the managed memory is allocated in GPU memory as long as all GPUs that are actively being used by a program have the peer-to-peer support. If at any time the application starts using a GPU that doesn’t have peer-to-peer support with any of the other GPUs that have managed allocations on them, then the driver will migrate all managed allocations to system memory. In this case, all GPUs experience PCIe bandwidth restrictions.

My interpretation is that, if the system does not support P2P, all allocations made with cudaMallocManaged will reside in host memory. Thus, all memory accesses will go over PCIe and the device memory will remain completely unused. This will cripple performance, even if just one GPU is used.

Most consumer cards (e.g., RTX 3060) do not support P2P. Thus, it would be a mistake to use unified memory in systems with multiple consumer GPUs. In this case, one should fall back to manual device allocations with cudaMalloc.

Is the above interpretation and conclusion correct?

pizzapy · December 21, 2024, 5:53am

Yeah, I wouldn’t add a dedicated GPU to a Jeton. It would be better to build a system meant for multiple GPUs or TPUs.

Robert_Crovella · December 21, 2024, 3:24pm

Yes, I would say generally that is correct. Perhaps a few things to point out:

some consumer cards support P2P. In particular those cards that supported a NVLink bridge do support P2P. Some members of RTX 20 series and RTX 30 series GPUs supported NVLink bridge, and therefore do support P2P when an NVLink bridge is installed (ie. support P2P with the bridged device).
some behavior may be somewhat modifiable with cudaMemAdvise hints. I have not tried it, but cudaMemAdviseReadMostly may allow for migration to a device, as long as the data is being used in a read-only fashion.
UM might still be interesting when non-performance-critical code or data structures are being traversed. For example a doubly-linked list where the traversal of the list guides behavior but is not involved/dominant in performance-critical routines.

OwenTB · December 26, 2024, 3:37am

@Robert_Crovella Thank you for the insightful response.

Could these limitations be overcome by running a separate process for each GPU?

If at any time the application starts using a GPU that doesn’t have peer-to-peer support with any of the other GPUs that have managed allocations on them, then the driver will migrate all managed allocations to system memory.

My interpretation of the above is that, as long as a given process does not use more than one GPU, unified memory can reside in device memory. Thus, one approach is to run a separate process for each GPU and communicate between the processes on the host. A limitation is that the GPUs cannot communicate with each other directly.

Is this correct?

Robert_Crovella · December 26, 2024, 3:04pm

Perhaps.

If it were me, I would simply try it, and observe what happens. If I wanted to be certain that the CUDA runtime only had 1 GPU “in view”, I would try to use the CUDA_VISIBLE_DEVICES variable to select that.

system · January 9, 2025, 3:05pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unified memory across multiple GPU Question CUDA Programming and Performance	6	1814	July 12, 2015
Abysmal performance with Unified Memory and CUBLAS CUDA Programming and Performance	15	4294	November 29, 2014
Peer-to-Peer Memory Access can suppport a system-wide max of 8 peer connections CUDA Programming and Performance	4	1462	August 30, 2017
multiple gpu and unified memory CUDA Programming and Performance	3	4590	March 29, 2022
Does Cuda Unified Memory let multiple GPUs access randomly on non-overlapping regions of host array, concurrently? CUDA Programming and Performance	6	2331	March 30, 2018
about managed memory Legacy PGI Compilers	1	1777	October 9, 2017
Peer-To-Peer Access with cudaPitchedPtr CUDA Programming and Performance	3	1100	October 19, 2011
Can Unified Memory Migration use NVLink? CUDA Programming and Performance	2	746	October 12, 2021
CUDA peer to peer example ./simpleP2P failing CUDA Programming and Performance	11	8611	February 5, 2015
openMP+CUDA, need help! CUDA Programming and Performance	7	2003	November 23, 2012

Unified memory with multiple GPUs and no P2P

Related topics