ibv_reg_mr got file exists error when used nv_peer_mem

haizhu.shao · September 6, 2017, 12:13pm

Hi, everyone

I like to test the GPUDirect with RDMA, so i use ConnectX-3, Nvidia-K80 to do the experiment. the environment is list bellow:

kernel-4.8.7

cuda-drivers: 384.66

cuda-toolkit: 375.26

nv_peer_mem: 1.0.5

I use perftest tool to do the expeirment.

server1: ./ib_write_bw -a -F -n10000 --use_cuda

server2: ./ib_write_bw -a -F -n10000 server1

but the server1 output error:

Couldn't allocate MR failed to create mr Failed to create MR

at last, i printout the error and errno, the error is 14, and errno is “Bad address”.

can anyone help me, tell me is there any question. thank you very much.

MvB · September 8, 2017, 11:50pm

Hi Haizhu,

Thank you for contacting the Mellanox Community.

For your test, please install the latest Mellanox OFED version and redo the test with ib_send_bw WITHOUT cuda to check if RDMA is working properly including the option to define the device you want to use.

Example without CUDA

Server:

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits

Client:

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits

Example with CUDA

Server:

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda

Client:

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda

Also we recommend following the benchmark test from the GPUDirect UM ( http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf ), Section 3.

For further support, we recommend opening a support case with Mellanox Support.

Thanks.

Cheers,

~Martijn

haizhu.shao · September 9, 2017, 3:21am

Hi Martijin

Thank you for your reply about the issue.

I didn’t describe the question clearly, the h/w environment is list below:

Hardware:

ConnectX-3 (Mellanox Technologies MT27500 Family [ConnectX-3])

Nvidia K80

Software:

ubuntu-16.04, kernel 4.8.7

nvidia-driver: nvidia-diag-driver-local-repo-ubuntu1604-384.66_1.0-1_amd64.deb (downsite: NVIDIA DRIVERS Tesla Driver for Ubuntu 16.04 Tesla Driver for Ubuntu 16.04 | 384.66 | Linux 64-bit Ubuntu 16.04 | NVIDIA )

cuda-toolkit: cuda_8.0.61_375.26_linux.run (CUDA Toolkit Download | NVIDIA Developer CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer )

MLNX_OFED: MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz http://content.mellanox.com/ofed/MLNX_OFED-4.1-1.0.2.0/MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz http://www.mellanox.com/downloads/ofed/MLNX_OFED-4.1-1.0.2.0/MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz http://www.mellanox.com/downloads/ofed/MLNX_OFED-4.1-1.0.2.0/MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz

nv_peer_mem: 1.0.5

I have two servers, with one server has a K80 GPU. I want to use perftest to test the RDMA and GPUDirect. Reference to this https://devblogs.nvidia.com/parallelforall/benchmarking-gpudirect-rdma-on-modern-server-platforms , I install nv_peer_mem in server with 80 GPU.

When i didn’t use --use_cuda, the ib_write_bw work well, but when i use --use_cuda, it hase error, and i print the error message, the ib_write_bw run into ibv_reg_mr, and then got an error: “File has opened”. If i didn’t insmod nv_peer_mem, ibv_reg_mr got an error: “Bad address”.

The background is that i had run the same experiment correct before, which i use kernel 4.4.0, and MLNX_OFED 4.0-2.0.0.1, and didn’t install NVMe over Fabrics. Then my workmate install kernel 4.8.7, and NVMe over Fabrics. After then, the ib_write_bw with --use_cuda can never run collect.

Is there any question in my experiment, and experiment environment. And another question, can i use one ConnectX-3 to support NVMe over Fabrics and GPUDirect RDMA at the same time.

Thanks very much for your reply again, and looking forward to your reply.

Yours

Haizhu Shao

Topic		Replies	Views
RDMA GPUDirect//nvidia-peer-memory/cuda issue RDMA Software For GPU software-and-drivers , howto-enable-verify-and-troubleshoo	11	2266	September 12, 2019
GPUDirect RDMA at the ibverbs level. Software And Drivers iterations , bytes	4	1696	November 30, 2020
cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues CUDA Programming and Performance	11	13082	May 3, 2011
MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA RDMA Software For GPU problem , configurations	8	6767	March 28, 2020
Pinning GPU memory for RDMA failed CUDA Programming and Performance	1	621	April 3, 2022
The PEER MEMORY API RDMA Software For GPU	1	916	October 1, 2015
Trying to get GPUdirect RDMA working. CUDA Setup and Installation	2	1622	April 10, 2014
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2993	April 11, 2019
Question about cuda 4.0b unified addressing how to use cudaDeviceCanAccessPeer() CUDA Programming and Performance	10	12053	March 22, 2011
cudaDeviceEnablePeerAccess fails Enabling peer-to-peer device memory access fails CUDA Programming and Performance	1	3518	March 30, 2012

ibv_reg_mr got file exists error when used nv_peer_mem

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda

Related topics