Problem porting a GPUDirect [rdma?] solution between topologies

ahepner · February 25, 2025, 3:40pm

i’ve been having trouble porting a GPUDirect based solution between an of the shelf system to a custom-built server.

GPUDirect worked on the old system but not for the new

in this application, a peripheral PCIe device has a DMA engine that directly write two streams to two different physical memory blocks.
the primary mode of operation is when the device writes to a block of pinned, registered GPU memory as well as to a block pinned CPU memory.
all addresses are 64bit

simply put the device receives physical address and size of memory blocks and control commands.
since the device only sees a physical address it’s agnostic to it’s destination, and all combinations of GPU and CPU memory are possible and have been proven to work in the original system.

GPUDirect is used to register, pin and get physical address of the first page of the memory block (pages have been shown to be continuous) using a primitive implementation of the GPUDirect documentation guidelines (including bookkeeping and cleanup) using a custom kernel driver that operates on CudaMalloc’d memory.

the topology of the original system was a single (Quadro RTX 5000) GPU (Quadro and a single PCie card under a common switch connected to the root complex of the CPU

the new system, besides some other server hardware changes (including a server bios/uefi with lots of things to configure) the topology has two (newer A4500) gpu’s and is suppose to have two pcie cards.
the pcie network is two tiered where there are 3 switches: each GPU and card connected to a switch, and a switch connecting the two switches to the CPU root complex.
despite this topology change, all the endpoints are supposedly under the same memory space.

now to the actual problem:

in the new system DMA only works when host memory is targeted when GPU memory I targeted, except maybe some random display gliches no data arrives.

since both dma streams are dependent, if we target one to the host and one to a GPU, the host still recieves data, meaning, that data is been written somewhere in the address space, but not to the desired memory on the GPU.
we check this by “dirtying” the memory before dma and checking afterwards.

before fully porting the entire OS , we’ve tried with rhel9.4 and one of the recent nvidia-open drivers and cuda 12.5. the application as well as the drivers were ported and built for the new OS.

since we couldn’t make it work we’ve cloned to original OS in order to remove as many changed variables.
we’ve ported the operating system from the original system as is (ubunu 20.04 with 11.6 cuda sdk and TBD nvidia driver)
but ALAS - same phenomena host memory is written, GPU memory not. when directing both streams to host, they arrive as expected.

back in the modern OS, we’ve ran both p2pBandwidthLatencyTest and other test from the recent CUDA-samples and tests ran as expected

nvisia-smi top -p and nvisia-smi top -mp both show as there is an open PCIe and GPUDirect path between GPUs

we’ve built and ran GDYCOPY (we plan to use it as a basis for a more robust driver and API) and tests of GPUDIRECT and Host all look OK.

my assumption is either we’ve missed something in the bios or the switch configurations or that somehow, nvidia’s software and drivers use some black magic that Traveses this more complex topology in a way that my original, primitive driver and software aren’t built for.

we are currently working on reimplementing our software stack with a better validated mechanism to interact with GPUDirect.

are there any other tests we can perform or path’s we can take to test that our hardware is configured as it should, or suggestion for how to debug, or at least approach this problem?

any insights will help

Topic		Replies	Views
Memory from peripheral devices to GPU DMA directly to another device... CUDA Programming and Performance	6	4163	August 16, 2009
Using GPUDirect RDMA under OpenCL CUDA Programming and Performance	2	1501	August 7, 2024
GPUDirect RDMA at the ibverbs level. Software And Drivers iterations , bytes	4	1440	November 30, 2020
GPUDirect question - cudaDeviceCanAccessPeer information CUDA Programming and Performance	9	4211	January 2, 2020
NVidia GPUs in Embedded Computing Has the GPU computing and CUDA penetrated the embedded market? CUDA Programming and Performance	11	3859	August 3, 2010
GPUDirect RDMA Single PCI-e writes CUDA Programming and Performance	2	567	October 23, 2018
GPUDirect Performance : 25% less bandwidth than CudaMemcpy from host pinned memory Software And Drivers	6	348	February 13, 2019
Error when trying to write data to GPU DMA memory (using GPU Direct RDMA) Jetson AGX Xavier pcie , kernel , fpga	8	1453	May 30, 2023
Rivermax & GPUDirect Network Management Products gpu , inception , rivermax	5	2419	October 6, 2022
GPUDirect RDMA performance CUDA Programming and Performance	2	2166	March 26, 2013

Problem porting a GPUDirect [rdma?] solution between topologies

Related topics