I'm trying to diagnose a problem with my mpi setup. Whenever I run an openmpi task, or try to use the ucx_perftest , I get ibv_quer_device(mlx5_0) . I tried to find line 1773 in ib_md.c , but that file only has 1600 or so lines. What else can I do?

[ggeorge@node036 ~]$ ucx_perftest localhost -t put_lat -x rc_mlx5

[1584639837.711307] [node036:92378:0] perftest.c:1358 UCX WARN CPU affinity is not set (bound to 48 cpus). Performance may be impacted.

±-------------±----------------------------±--------------------±----------------------+

| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |

±-------------±--------±--------±--------±---------±---------±----------±----------+

| # iterations | typical | average | overall | average | overall | average | overall |

±-------------±--------±--------±--------±---------±---------±----------±----------+

±-----------------------------------------------------------------------------------------+

| API: transport layer |

| Test: put latency |

| Data layout: short |

| Message size: 8 |

±-----------------------------------------------------------------------------------------+

[1584639837.742549] [node036:92374:0] ib_md.c:1773 UCX ERROR ibv_query_device(mlx5_0) returned 38: Protocol not supported

[1584639837.743237] [node036:92378:0] ib_md.c:1773 UCX ERROR ibv_query_device(mlx5_0) returned 38: Protocol not supported

[1584639837.743463] [node036:92374:0] perftest.c:1439 UCX ERROR Failed to run test: Input/output error

[1584639837.744143] [node036:92378:0] perftest.c:1439 UCX ERROR Failed to run test: Input/output error

[1]+ Exit 255 ucx_perftest -c 0

1 Like

You might start with basic:

  • ofed_info -s, be sure you are using Mellanox OFED

  • ibv_devinfo -v, verify that deice is present

  • run_perftest_loopback, run two verbs processes on the same node

  • perftest (ib_read_bw, ib_write_bw, ib_send_bw with different QP types) - to test RDMA connectivity between the nodes

  • Check latest HPC-X Toolkit version v2.6.0, that includes latest UCX library and tools.

Update, Mar 26

If I’m not mistaken, you are using hpcx-v2.4. In that version, ib_md.c has 1800+ lines. Line where it fails calls ibv_query_device that returns failure. I would expect that ibv_devinfo will fail to or there is incompatibility between components in your system . Try reinstall Mellanox OFED and use latest HPC-X toolkit corresponding to OS and Mellanox OFED version.