i’ve been having trouble porting a GPUDirect based solution between an of the shelf system to a custom-built server.
GPUDirect worked on the old system but not for the new
in this application, a peripheral PCIe device has a DMA engine that directly write two streams to two different physical memory blocks.
the primary mode of operation is when the device writes to a block of pinned, registered GPU memory as well as to a block pinned CPU memory.
all addresses are 64bit
simply put the device receives physical address and size of memory blocks and control commands.
since the device only sees a physical address it’s agnostic to it’s destination, and all combinations of GPU and CPU memory are possible and have been proven to work in the original system.
GPUDirect is used to register, pin and get physical address of the first page of the memory block (pages have been shown to be continuous) using a primitive implementation of the GPUDirect documentation guidelines (including bookkeeping and cleanup) using a custom kernel driver that operates on CudaMalloc’d memory.
the topology of the original system was a single (Quadro RTX 5000) GPU (Quadro and a single PCie card under a common switch connected to the root complex of the CPU
the new system, besides some other server hardware changes (including a server bios/uefi with lots of things to configure) the topology has two (newer A4500) gpu’s and is suppose to have two pcie cards.
the pcie network is two tiered where there are 3 switches: each GPU and card connected to a switch, and a switch connecting the two switches to the CPU root complex.
despite this topology change, all the endpoints are supposedly under the same memory space.
now to the actual problem:
in the new system DMA only works when host memory is targeted when GPU memory I targeted, except maybe some random display gliches no data arrives.
since both dma streams are dependent, if we target one to the host and one to a GPU, the host still recieves data, meaning, that data is been written somewhere in the address space, but not to the desired memory on the GPU.
we check this by “dirtying” the memory before dma and checking afterwards.
before fully porting the entire OS , we’ve tried with rhel9.4 and one of the recent nvidia-open drivers and cuda 12.5. the application as well as the drivers were ported and built for the new OS.
since we couldn’t make it work we’ve cloned to original OS in order to remove as many changed variables.
we’ve ported the operating system from the original system as is (ubunu 20.04 with 11.6 cuda sdk and TBD nvidia driver)
but ALAS - same phenomena host memory is written, GPU memory not. when directing both streams to host, they arrive as expected.
back in the modern OS, we’ve ran both p2pBandwidthLatencyTest and other test from the recent CUDA-samples and tests ran as expected
nvisia-smi top -p and nvisia-smi top -mp both show as there is an open PCIe and GPUDirect path between GPUs
we’ve built and ran GDYCOPY (we plan to use it as a basis for a more robust driver and API) and tests of GPUDIRECT and Host all look OK.
my assumption is either we’ve missed something in the bios or the switch configurations or that somehow, nvidia’s software and drivers use some black magic that Traveses this more complex topology in a way that my original, primitive driver and software aren’t built for.
we are currently working on reimplementing our software stack with a better validated mechanism to interact with GPUDirect.
are there any other tests we can perform or path’s we can take to test that our hardware is configured as it should, or suggestion for how to debug, or at least approach this problem?
any insights will help