GPU direct RDMA on L40s with ConnectX4 and Linux kernel 6.8

Uncle_Joe · September 28, 2025, 12:10pm

I can’t get this combination to work. The way I’m testing is with

ib_write_bw --use_cuda 0

It fails with

“Couldn’t allocate MR with error=14” (EFAULT)

when calling ibv_reg_mr to register the GPU memory

It seems L40S doesn’t support registering by DMA_BUF handle, and uses an earlier method that needs nvidia-peermem driver because cuDeviceGetAttribute(CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED) returns false

But the nvidia-peermem kernel module is broken in Linux >= 6.8, because it removed support for ib_register_peer_memory_client that it uses uses and replaced it with registering by a DMA_BUF handle. According to NVIDIA GPUDirect over Infiniband Migration Paths - Kernel - Ubuntu Community Hub , if L40S doesn’t support DMA_BUF, then it sounds like, I either need to,

downgrade Linux. Haven’t tried, but is undesirable.
install proprietary Mellanox drivers + userspace (MLNX_OFED_LINUX). I tried it, but it got the same error

How can I get it to work? If that’s too hard, what is the correct code path for debugging? ibv_reg_mr → ? Mellanox extension …

Or should it be using ibv_reg_dmabuf_mr ? I don’t understand what the difference between registering by DMA_BUF handle and the previous method is. From reading nvidia-peermem.c, it seems the main thing it’s doing is translating virtual addresses to a list of physical pages for scatter, gather DMA (nvidia_p2p_get_pages). One would expect registering by DMA_BUF handle to be doing the same fundamental thing, so how can it not be supported? Or is it not a completely new hardware capability and more of an optimization like bindless textures?

hyungkwangc · October 2, 2025, 12:49am

Hello~

It’s very hard to give answer on your questions. Because we don’t know what test conditions you are running.

My suggestion.

1.Did you follow below RDMA perftest? What GPU Direct RDMA perftest tool uses? actually i am using this tool. If you visite the site, there is steps & prerequits. I’d strongly you to download RDMA perftest below tool and compile and test to see if it works or not

GitHub - linux-rdma/perftest: Infiniband Verbs Performance Tests

2.Please open a technical case if above does not work. You need to talk with NVIDIA Technical Support Engineer.

/HyungKwang

Topic		Replies	Views
ibv_reg_mr got file exists error when used nv_peer_mem	2	549	September 9, 2017
GPUDirect RDMA at the ibverbs level. Software And Drivers iterations , bytes	4	1764	November 30, 2020
Device Memory MR and GPU RDMA Mellanox OFED software-and-drivers , iterations , bytes	1	713	February 13, 2019
Pinning GPU memory for RDMA failed CUDA Programming and Performance	1	660	April 3, 2022
RDMA GPUDirect//nvidia-peer-memory/cuda issue RDMA Software For GPU software-and-drivers , howto-enable-verify-and-troubleshoo	11	2394	September 12, 2019
Trying to get GPUdirect RDMA working. CUDA Setup and Installation	2	1671	April 10, 2014
The PEER MEMORY API RDMA Software For GPU	1	937	October 1, 2015
"--use_cuda_dmabuf" is not supported on this GPU RDMA Software For GPU	4	2639	July 31, 2023
GPUDirect RDMA support with CUDA 5 CUDA Programming and Performance	19	9362	May 28, 2013
GPUDirect RDMA performance CUDA Programming and Performance	2	2231	March 26, 2013

GPU direct RDMA on L40s with ConnectX4 and Linux kernel 6.8

Related topics