Configure CX-7 400GE not to change gid index

We have GPU cluster with ConnectX-7 400G Adapters。When training LLM for sometime (12 hours ), the ConnectX-7 gid index changed from 3 to 4/8 (show_gids command), which cause the training to stop。We have to restart the node to recover the gid index to 3。

The server OS is ubuntu 22.04 The CX-7 adapter driver is MLNX_OFED_LINUX-24.01-

Any suggestions to configure CX-7 not to change gid index?

RoCE and GID index 3

Most common use case on many systems, per each interface only single IPv4 address is defined.
In that case the RoCEv2 GID index that will point to GID with IPv4 address will be GID index 3.
Thus many times clusters have hardcoded values in their environment to use GID index 3.
Or it appears in many guidance documents to specify GID index 3 in command line.
But this cannot be relied on for two main reasons:

  1. Some deployments will have IPv6. Some will have several IP addresses per single interface.
    In that case GID index 3, might not be the right one.

  2. GID index table can change on the fly and indexes with it.
    Below has deeper explanation of such situation.

GID index/table change on the fly - how can it happen?

There are 2 main types of GID table changes:

  1. Hole in GID index values
    On specific device, for same IP address GID index value changes due to event like link toggle.
    So, there is basically a hole in the GID index, no more 3.

from: mlx5_0, IP-X RoCEv2 => GID index=3
to: mlx5_0, IP-X RoCEv2 => GID index=4

Gid table before change for mlx5_0:
GID index IP addr RoCE version

0 Link Local RoCEv1
1 Link Local RoCEv2
2 IP-X RoCEv1
3 IP-X RoCEv2

Some event happens, Gid table after change for mlx5_0:
GID index IP addr RoCE version

0 Link Local RoCEv1
1 Link Local RoCEv2
2 IP-X RoCEv1
4 IP-X RoCEv2

This can happen only if something is holding a lock to GID index that relates to the interface that is being toggled.
Small list of options, although others might be as well:

  • RDMA application is running RDMA traffic
  • cat /sysfs/…. That has relation to GID table
  • At specific moment when RDMA_CM message is received
    In parallel to above events link toggle happens

This is a race condition, any change in timing can impact the chance of an issue.
But it cannot happen if after reboot, nothing is running, no lock is held and a link is simply toggled.

  1. No hole, just GID index mapping changes
    On specific device, usually specific IP is mapped to specific GID index value, but sometimes
    all IPs on that device are rearranged in different order in GID index table and thus get different GID indexes.
    For example: mlx5_0
    GID index IP addr RoCE version

0 Link Local RoCEv1
1 Link Local RoCEv2
2 IP-X RoCEv1
3 IP-X RoCEv2
4 IP-Y RoCEv1
5 IP-Y RoCEv2

Some event happens, new table looks like below:
GID index IP addr RoCE version

2 Link Local RoCEv1
5 Link Local RoCEv2
0 IP-X RoCEv1
1 IP-X RoCEv2
3 IP-Y RoCEv1
4 IP-Y RoCEv2

There is no guarantee of order of IP addresses per device bring up.
GID indexes will be assigned in order of IP up events.
So if link is toggled, table rescanned, or even host is rebooted , we cannot rely on order.
Driver restart might not always help here.

What is the solution for non reliable GID table?

Many applications are moving to dynamic GID index identification.
Perftest (ib_write_bw, …) and UCX are already able to identify GID index automatically.
They still allow to specify GID index for special use cases like several IP addresses per interface,
but if no special use case, better not specify this parameter at all.

NCCL has similar capability now, starting NCCL v2.21.

Follow up questions:
I look into NCCL 2.21 doc, it seems that use NCCL_IB_ROCE_VERSION_NUM=2 to replace NCCL_IB_GID_INDEX=3 would help this scenario.
NCCL_IB_ROCE_VERSION_NUM=2 is not needed, as it’s the default already
In this setting, even GID index changes during training, NCCL would handle that. Training on pytorch/nccl/rdma stack would continue without stop. I am not sure I understand what NCCL suggest, would you suggest
That’s incorrect. We detect the GID automatically at startup, but if the GID changes, communication would still fail. The difference with previous versions though is that the next job would start and work fine even if the GID has changed.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.