GPUDirect RDMA -- perftest not actually writing to GPU

Hello, i am trying to make GPUDirect RDMA work on our machine:

Specifications:
OS: Ubuntu 24.04 with 6.8.0-88-generic kernel version.
GPU: nvidia A100-40GB Driver Version: 570.86.10 CUDA Version: 12.8
NIC: ConnectX6DX 100GbE

nvidia-smi topo -m confirms a PHB connection between NIC and GPU.

I followed DOCA documentation to install doca-all and then went to DOCA-GPUNetIO page to finish configuration for GPUDirect RDMA. The only thing i did not apply was disabling IOMMU from BIOS since i do not have instant access to it.

The point is that when trying with ib_send_bw -d mlx5_1 -x 3 --run_infinitely --use_cuda 0 --use_cuda_dmabuf (in the receiver side only) i get DMA-BUF is not supported on this GPU.

My stack is very recent and i just reinstalled today doca-all 3.2.0 so i should have dma-buf module ready.

I also tried with the legacy nvidia-peermem module, manually loading it.
That resulted in ib_send_bw to work without --use_cuda_dmabuf but if i inspect PCIe traffic entering the GPU, it is basically zero. Also doing a simple code using libibverbs trying to do a SEND (and WRITE) operation did not actually modify the MR assigned from the GPU, while modifying the same for CPU-side operations actually works.

I have two questions:

  1. why no dma-buf module? everything should be supported in my case, i checked grep DMABUF /boot/config-$(uname -r) and my kernel has support for it, what other checks can i make for GDR DOCA/perftest to work?
  2. even with legacy nvidia-peermem module the operation completes gracefully but nothing gets actually modified, why?

Any help is appreciated, thanks a lot!

Following from last week:

  • I installed nvidia-open drivers and got dma-buf support, still the situation did not change much with it.
  • Downgraded to DOCA 3.1.0 to align with driver and CUDA version, now doca-samples 3.1.0 builds

I am trying to make something simple like doca-samples/doca_gpunetio_dma_memcpy work, but apparently DMA is not supported for my configuration:

./build/doca_gpunetio_dma_memcpy -g 0000:cb:00.0 -n 0000:ca:00.1
[11:47:30:562939][623788032][DOCA][INF][doca_log.cpp:628] DOCA version 3.1.0105
[11:47:30:563002][623788032][DOCA][INF][gpunetio_dma_memcpy_main.c:164][main] Starting the sample
[11:47:30:872455][623788032][DOCA][INF][gpunetio_dma_memcpy_sample.c:191][init_sample_mem_objs] The CPU source buffer value to be copied to GPU memory: This is a sample piece of text from CPU
[11:47:31:067080][623788032][DOCA][INF][gpunetio_dma_memcpy_sample.c:138][init_sample_mem_objs] The GPU source buffer value to be copied to CPU memory: This is a sample piece of text from GPU
[11:47:31:067114][623788032][DOCA][WRN][doca_mmap.cpp:2015] Mmap 0x60f933cc6e40: Memory range isn’t aligned to 64B - addr=0x60f933cc6a30. For best performance using CPU memory, align address to 64B (cache-line size). For best performance using GPU memory, align address to 64KB (page size)
[11:47:31:069567][623788032][DOCA][ERR][doca_dma.cpp:2122] Failed to create DMA with exception:
[11:47:31:069613][623788032][DOCA][ERR][doca_dma.cpp:2122] DOCA exception [DOCA_ERROR_NOT_SUPPORTED] with message DMA is not supported for the device
[11:47:31:069632][623788032][DOCA][ERR][gpunetio_dma_memcpy_sample.c:452][init_dma_ctx] Failed to initialize dma ctx: Unable to create DMA engine: Operation not supported
[11:47:31:069639][623788032][DOCA][ERR][gpunetio_dma_memcpy_sample.c:657][gpunetio_dma_memcpy] Function init_dma_ctx returned Operation not supported
[11:47:31:069647][623788032][DOCA][INF][gpunetio_dma_memcpy_sample.c:344][gpu_dma_cleanup] Cleanup DMA ctx with GPU data path
[11:47:31:070742][623788032][DOCA][INF][gpunetio_dma_memcpy_sample.c:384][gpu_dma_cleanup] Cleanup DMA ctx with CPU data path
[11:47:31:072543][623788032][DOCA][ERR][gpunetio_dma_memcpy_main.c:187][main] gpunetio_dma_memcpy() encountered an error: Operation not supported
[11:47:31:072554][623788032][DOCA][INF][gpunetio_dma_memcpy_main.c:199][main] Sample finished with errors

  • Now ib_send_bw works with --use_cuda_dmabuf too, but i wanted something that i could actually verify the modification of GPU buffer.

    Another error from the example gpunetio_simple_receive:

EAL: No free 2048 kB hugepages reported on node 0
EAL: No free 2048 kB hugepages reported on node 1
EAL: FATAL: Cannot get hugepage information.
EAL: Cannot get hugepage information.
[11:52:14:917766][2719842304][DOCA][ERR][hws_layer.c:188] failed registering dpdk layer - failed to implicitly initiate dpdk. rc=-1
[11:52:14:917773][2719842304][DOCA][ERR][dpdk_engine.c:182] failed to initialize dpdk engine - dpdk layer register failed ret=-1
[11:52:14:917776][2719842304][DOCA][ERR][doca_flow.c:584] failed initializing dpdk engine layer with rc=-1

Any advice on where I should take it from here?
Using latest driver+cuda+doca would be an option but still this should be supported.