HCOLL missing device on machine with ConnectX-6 Lx (RoCE)

Hello. When running the mpirun contained in HPC-X v2.18, these messages appear:

[LOG_CAT_ML] You must specify a valid HCA device by setting:

-x HCOLL_MAIN_IB=<dev_name:port> or -x UCX_NET_DEVICES=<dev_name:port>.

If no device was specified for HCOLL (or the calling library), automatic device detection will be run.

In case of unfounded HCA device please contact your system administrator.

As I understand it, HCOLL is used by Infiniband devices, but what about RoCE devices like the ConnectX-6 Lx? Is this expected behavior or does it indicate a driver issue on my machine?

Thanks!

Hello @captainspork,

Thank you for posting your query on our community. HCOLL is used to offload MPI collective operations to the HCA, it uses RDMA at verbs level and it is relevant for any traffic type, Infiniband and RoCE.

The observed error is not a real error and you just need to specify the correct device as indicated here - HCOLL - NVIDIA Docs

Hope this answers your question.

Thanks,
Bhargavi

Hi Bhargavi,

Thanks for this reply. I noticed that on some of our Infiniband systems, the HCOLL device autodetect seems to work and we don’t get this message, however on the RoCE systems it appears necessary to manually specify the device. Just wanted to make sure that the autodetect system not working wasn’t indicative of a driver issue.

Best regards