GPU direct RDMA on L40s with ConnectX4 and Linux kernel 6.8

I can’t get this combination to work. The way I’m testing is with

ib_write_bw --use_cuda 0

It fails with

“Couldn’t allocate MR with error=14” (EFAULT)

when calling ibv_reg_mr to register the GPU memory

It seems L40S doesn’t support registering by DMA_BUF handle, and uses an earlier method that needs nvidia-peermem driver because cuDeviceGetAttribute(CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED) returns false

But the nvidia-peermem kernel module is broken in Linux >= 6.8, because it removed support for ib_register_peer_memory_client that it uses uses and replaced it with registering by a DMA_BUF handle. According to NVIDIA GPUDirect over Infiniband Migration Paths - Kernel - Ubuntu Community Hub , if L40S doesn’t support DMA_BUF, then it sounds like, I either need to,

  • downgrade Linux. Haven’t tried, but is undesirable.
  • install proprietary Mellanox drivers + userspace (MLNX_OFED_LINUX). I tried it, but it got the same error

How can I get it to work? If that’s too hard, what is the correct code path for debugging? ibv_reg_mr → ? Mellanox extension …

Or should it be using ibv_reg_dmabuf_mr ? I don’t understand what the difference between registering by DMA_BUF handle and the previous method is. From reading nvidia-peermem.c, it seems the main thing it’s doing is translating virtual addresses to a list of physical pages for scatter, gather DMA (nvidia_p2p_get_pages). One would expect registering by DMA_BUF handle to be doing the same fundamental thing, so how can it not be supported? Or is it not a completely new hardware capability and more of an optimization like bindless textures?

Hello~

It’s very hard to give answer on your questions. Because we don’t know what test conditions you are running.

My suggestion.

1.Did you follow below RDMA perftest? What GPU Direct RDMA perftest tool uses? actually i am using this tool. If you visite the site, there is steps & prerequits. I’d strongly you to download RDMA perftest below tool and compile and test to see if it works or not

GitHub - linux-rdma/perftest: Infiniband Verbs Performance Tests

2.Please open a technical case if above does not work. You need to talk with NVIDIA Technical Support Engineer.

/HyungKwang