How does CUDA behave differently in a multi-GPU system compared to a single-GPU system?

Hello, I am new to CUDA and GPU architecture. I recently surveyed several studies on multi-GPU architectures, including GPUpd [MICRO’17], MGPUSim [ISCA’19], Griffin [HPCA’20], GPS [MICRO’21], IDYLL [MICRO’23], and GRIT [HPCA’24]. I noticed that some terms related to inter-GPU memory communication are frequently mentioned, such as on-touch page migration and RDMA-based direct cache access.

However, I find it difficult for me to connect these terms to their real implementations. For instance, what techniques does NVIDIA and CUDA actually use in a multi-GPU environment? Are these techniques used only in inter-GPU communication? Since the academic research often relies on simulations, I am not sure whether these terms are indeed utilized in NVIDIA’s multi-GPU system by CUDA.

With these thoughts, here comes out the ultimate question: How does CUDA behave differently in a multi-GPU system compared to a single-GPU system?

I have searched online and in forums, but the information I found is quite scattered. The NVIDIA UVM driver, CUDA-MPI, NCCL, the page migration engine, many CUDA P2P samples, and some posts may be relevant to this question. However, there seems to be a lack of systematic and unified descriptions about how those terms (page migration, direct cache access, etc) are presented in CUDA for multi-GPU scenarios. Any suggestions would be greatly appreciated!

The terms related to inter-GPU memory communication, focusing on optimizations for data transfer and memory sharing between GPUs. Here’s an explanation of the two terms mentioned:

  1. On-touch Page Migration:

    • In multi-GPU systems, memory management is crucial, especially when different GPUs need to access each other’s memory. Typically, each GPU has its own isolated memory, but in some cases, data needs to be shared between GPUs.
    • “On-touch page migration” refers to the automatic migration of data between GPUs when one GPU accesses the memory of another. This means that when a GPU tries to access data located in another GPU’s memory, the system triggers the migration of the data to the requesting GPU’s memory. The data is moved to the correct GPU memory location before being accessed, without needing explicit data copying operations.
  2. RDMA-based Direct Cache Access:

    • RDMA (Remote Direct Memory Access) is a technology that allows direct access to remote memory without involving the CPU, which is typically used for high-performance data transfer.
    • In multi-GPU systems, RDMA can be used for memory access between GPUs. RDMA-based direct cache access means that one GPU can directly access the cache or memory of another GPU without the need to copy the data to an intermediate cache or main memory. This significantly reduces latency and improves bandwidth, making it especially useful in large-scale distributed computing tasks, such as deep learning training across multiple GPUs.
      Both of these techniques are designed to improve the efficiency of inter-GPU memory communication by reducing latency and bandwidth bottlenecks. They are particularly useful in high-performance computing tasks, such as machine learning and large-scale parallel computing.

发件人:Max1z via NVIDIA Developer Forums notifications@nvidia.discoursemail.com
收件人:cqiao0@sina.com
主题:[NVIDIA Developer Forums] [CUDA/CUDA Programming and Performance] How does CUDA behave differently in a multi-GPU system compared to a single-GPU system?
日期:2024年11月29日 16点14分

| linxzhan9
November 29 |

  • | - |

Hello, I am new to CUDA and GPU architecture. I recently surveyed several studies on multi-GPU architectures, including GPUpd [MICRO’17], MGPUSim [ISCA’19], Griffin [HPCA’20], GPS [MICRO’21], IDYLL [MICRO’23], and GRITT [HPCA’24]. I noticed that some terms related to inter-GPU memory communication are frequently mentioned, such as on-touch page migration and RDMA-based direct cache access.

However, I find it difficult for me to connect these terms to their real implementations. For instance, what techniques does NVIDIA and CUDA actually use in a multi-GPU environment? Are these techniques used only in inter-GPU communication? Since the academic research often relies on simulations, I am not sure whether these terms are indeed utilized in NVIDIA’s multi-GPU system by CUDA.

With these thoughts, here comes out the ultimate question: How does CUDA behave differently in a multi-GPU system compared to a single-GPU system?

I have searched online and in forums, but the information I found is quite scattered. The NVIDIA UVM driver, CUDA-MPI, NCCL, the page migration engine, many CUDA P2P samples, and some posts may be relevant to this question. However, there seems to be a lack of systematic and unified descriptions about how those terms (page migration, direct cache access, etc) are presented in CUDA for multi-GPU scenarios. Any suggestions would be greatly appreciated!

Hi! Thanks for the explanation of these terms. I actually want to know more about their implementations in CUDA or the GPU driver for multi-GPU scenarios. Altough there are many guides or open-sourced codebase, there doesn’t seem to be a clear document summarizing the differences in memory communication between multi-GPU and single-GPU systems in the real CUDA’s implementations. Compared to this, I can find plenty of well-organized docs about the hardware model (e.g. NVlink) .

Or maybe I was just overthinking and the difference between the two settings is not significant actually? :-)

on demand page migration is covered in unit 6 of this online training series.

Thanks. I will check these materials.