How does NVSHMEM achieve GPU initiated RDMA?

zhongchen530 · February 22, 2025, 8:35pm

I got 3 speculations here.

The GPU kernel talks to the device driver directly to initiate RDMA.
The GPU kernel talks to a CPU thread on the host shared memory. The host thread in turn utilize UCX or Verbs directly to initiate RDMA on behalf of the kernel.
There exists GPU device side RDMA library for initiating and NVSHMEM relies on it? The RDMA library talks to the device driver directly?

I wonder which one is true. Is this information even publicly available?

Curefab · February 22, 2025, 9:36pm

There is an open source Linux Nvidia driver.

zhongchen530 · February 23, 2025, 6:41am

Thank you for responding. Do you by any chance has a link to that? I can’t seem to find it, maybe because I am new to the forum. Also, even if I can see the code of the device driver, it doesn’t tell me how NVSHMEM is interacting with it right? Does NVSHMEM interact with the device driver directly?

Curefab · February 23, 2025, 11:04am

Here is a link to the Cuda drivers:

And a link to an early version of nvshmem:

NVSHMEM uses existing facilities and provides an interface layer.
I do not know specifics about the architecture.

zhongchen530 · February 28, 2025, 12:55am

Thank you for the links.

I think I figured out the answer to my question. According to this paper

NVSHMEM at least back in 2020 was having GUP threads talk to the GPU progress threads which process the GPU initiated SHMEM requests by invoking verbs. The SHMEM runtime is largely managed by the CPU and it act as an intermediary. I suppose the GPU threads and host threads use CUDA shared memory to talk to each other.

Curefab · February 28, 2025, 6:30am

Very good.

But the thread cannot access GPU shared memory directly.

The paper mentions shared segment of pinned (host) memory.

Topic		Replies	Views
NVSHMEM Host Execution Pattern GPU-Accelerated Libraries nvshmem	2	1038	April 18, 2022
Internode nvshmme and ib problem GPU-Accelerated Libraries nvshmem	20	1783	April 24, 2024
Single process with multiple threads pure RDMA data transfer on multiple nodes GPU-Accelerated Libraries nvshmem	0	65	December 19, 2025
The effect of neither nv_peer_mem or nvidia peermem detected GPU-Accelerated Libraries nvshmem	2	701	July 8, 2024
Using NVSHMEM on a Python Library GPU-Accelerated Libraries nvshmem	1	958	January 29, 2024
What counts as an AMO in NVSHMEM? GPU-Accelerated Libraries nvshmem	1	1119	April 18, 2022
NVSHMEM without mpi, 1 thread for each GPU on a node- howto initialize? GPU-Accelerated Libraries nvshmem	1	922	April 18, 2022
Corrupt nvshmem online doc link GPU-Accelerated Libraries nvshmem	2	85	June 30, 2025
NVSHMEM applicability in setting of two PCIe connected GPUs GPU-Accelerated Libraries nvshmem	6	1178	October 12, 2021
NVSHMEM working and Unified Virtual Addressing (UVA) GPU-Accelerated Libraries cuda	2	133	April 8, 2025

How does NVSHMEM achieve GPU initiated RDMA?

Related topics