How does NVSHMEM achieve GPU initiated RDMA?

I got 3 speculations here.

  1. The GPU kernel talks to the device driver directly to initiate RDMA.
  2. The GPU kernel talks to a CPU thread on the host shared memory. The host thread in turn utilize UCX or Verbs directly to initiate RDMA on behalf of the kernel.
  3. There exists GPU device side RDMA library for initiating and NVSHMEM relies on it? The RDMA library talks to the device driver directly?

I wonder which one is true. Is this information even publicly available?

There is an open source Linux Nvidia driver.

Thank you for responding. Do you by any chance has a link to that? I can’t seem to find it, maybe because I am new to the forum. Also, even if I can see the code of the device driver, it doesn’t tell me how NVSHMEM is interacting with it right? Does NVSHMEM interact with the device driver directly?

Here is a link to the Cuda drivers:

And a link to an early version of nvshmem:

NVSHMEM uses existing facilities and provides an interface layer.
I do not know specifics about the architecture.

Thank you for the links.

I think I figured out the answer to my question. According to this paper

NVSHMEM at least back in 2020 was having GUP threads talk to the GPU progress threads which process the GPU initiated SHMEM requests by invoking verbs. The SHMEM runtime is largely managed by the CPU and it act as an intermediary. I suppose the GPU threads and host threads use CUDA shared memory to talk to each other.

Very good.

But the thread cannot access GPU shared memory directly.

The paper mentions shared segment of pinned (host) memory.