ibv_reg_mr got file exists error when used nv_peer_mem

Hi, everyone

I like to test the GPUDirect with RDMA, so i use ConnectX-3, Nvidia-K80 to do the experiment. the environment is list bellow:

kernel-4.8.7

cuda-drivers: 384.66

cuda-toolkit: 375.26

nv_peer_mem: 1.0.5

I use perftest tool to do the expeirment.

server1: ./ib_write_bw -a -F -n10000 --use_cuda

server2: ./ib_write_bw -a -F -n10000 server1

but the server1 output error:

Couldn't allocate MR failed to create mr Failed to create MR

at last, i printout the error and errno, the error is 14, and errno is “Bad address”.

can anyone help me, tell me is there any question. thank you very much.

Hi Haizhu,

Thank you for contacting the Mellanox Community.

For your test, please install the latest Mellanox OFED version and redo the test with ib_send_bw WITHOUT cuda to check if RDMA is working properly including the option to define the device you want to use.

Example without CUDA

Server:

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits

Client:

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits

Example with CUDA

Server:

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda

Client:

ib_send_bw -d mlx5_0 -i 1 -a -F --report_gbits --use_cuda

Also we recommend following the benchmark test from the GPUDirect UM ( http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf ), Section 3.

For further support, we recommend opening a support case with Mellanox Support.

Thanks.

Cheers,

~Martijn

Hi Martijin

Thank you for your reply about the issue.

I didn’t describe the question clearly, the h/w environment is list below:

  1. Hardware:

ConnectX-3 (Mellanox Technologies MT27500 Family [ConnectX-3])

Nvidia K80

  1. Software:

ubuntu-16.04, kernel 4.8.7

nvidia-driver: nvidia-diag-driver-local-repo-ubuntu1604-384.66_1.0-1_amd64.deb (downsite: NVIDIA DRIVERS Tesla Driver for Ubuntu 16.04 Tesla Driver for Ubuntu 16.04 | 384.66 | Linux 64-bit Ubuntu 16.04 | NVIDIA )

cuda-toolkit: cuda_8.0.61_375.26_linux.run (CUDA Toolkit Download | NVIDIA Developer CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer )

MLNX_OFED: MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz http://content.mellanox.com/ofed/MLNX_OFED-4.1-1.0.2.0/MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz http://www.mellanox.com/downloads/ofed/MLNX_OFED-4.1-1.0.2.0/MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz http://www.mellanox.com/downloads/ofed/MLNX_OFED-4.1-1.0.2.0/MLNX_OFED_SRC-debian-4.1-1.0.2.0.tgz

nv_peer_mem: 1.0.5

I have two servers, with one server has a K80 GPU. I want to use perftest to test the RDMA and GPUDirect. Reference to this https://devblogs.nvidia.com/parallelforall/benchmarking-gpudirect-rdma-on-modern-server-platforms , I install nv_peer_mem in server with 80 GPU.

When i didn’t use --use_cuda, the ib_write_bw work well, but when i use --use_cuda, it hase error, and i print the error message, the ib_write_bw run into ibv_reg_mr, and then got an error: “File has opened”. If i didn’t insmod nv_peer_mem, ibv_reg_mr got an error: “Bad address”.

The background is that i had run the same experiment correct before, which i use kernel 4.4.0, and MLNX_OFED 4.0-2.0.0.1, and didn’t install NVMe over Fabrics. Then my workmate install kernel 4.8.7, and NVMe over Fabrics. After then, the ib_write_bw with --use_cuda can never run collect.

Is there any question in my experiment, and experiment environment. And another question, can i use one ConnectX-3 to support NVMe over Fabrics and GPUDirect RDMA at the same time.

Thanks very much for your reply again, and looking forward to your reply.

Yours

Haizhu Shao