Installation HPC-X question - hello_c is getting an error

​hpcx_v2.4.0-gcc-MLNX_OFED_LINUX-4.6.1.0.1.1-redhat7.6-x86_64

Folowing the readme steps, the 5th step:

“mpirun --allow-run-as-root -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c”

returns an error with the MPI_Init call (once per device, single ConnectX-5 VPI card):

ib_md.c:1773 UCX ERROR ibv_query_device(mlx5_0) returned 38: Protocol not supported

ib_md.c:1773 UCX ERROR ibv_query_device(mlx5_1) returned 38: Protocol not supported

mlx5_0 is setup as EN, mlx5_1 is IPoIB.

I’m guessing that means that UCX or a dependency isn’t installed correctly?

What am I missing? The steps in the README.txt are pretty straightforward.

I am using cuda 10.1, not 9.2… which is what ucx_info -v looks to expect… is that it maybe?

Thanks

Hi Andrew,

What is the P/N of the card? Is this a dual port card? If yes, as per VPI port configuration, if Port 1 is configured as Ethernet, Port 2 needs to be Ethernet. If port 1 is IB, port 2 can be IB/Ethernet/Auto sensing.

Thanks,

Namrata.

Hi Andrew,

In addition, support for CUDA version 10 will be available in the upcoming version of HPC-X, which is version 2.5

The targeted Release Date for this version is Sept. 2019.

Thanks,

Namrata.

​I can set it up to just use one port and I see the same error, just on one port. I’ve tried it both IB and EN.

MCX556A-ECAT Rev A8

ConnectX-5 VPI cards

For clarification, see;

2.3: https://docs.mellanox.com/pages/viewpage.action?pageId=12006291

2.4: https://docs.mellanox.com/pages/viewpage.action?pageId=12006257

Under “HPC-X Environments”:

Cuda 9 (2.4 only indicates “9”, 2.3 says 9.1)

NCCL (2.4 complains if its not there but not listed, 2.3 says its required)

GDRCopy (2.4 doesn’t list it, 2.3 says its required).

​With CUDA 9.2, NVIDIA 396.37, and NCCL 2.4.8 installed with it, plus with the additions of maybe GDRCopy 2.0-2, cppunit 1.12.1, and subunit 1.2, the simple hello samples all run without errors / warnings.

Thanks again for your help!

Thanks, Andrew for the update. I’m glad your issue is resolved. Will close this case.

Thanks,

Namrata.