Configure CX-7 400GE not to change gid index

44264334 · June 27, 2024, 7:17am

We have GPU cluster with ConnectX-7 400G Adapters。When training LLM for sometime (12 hours ), the ConnectX-7 gid index changed from 3 to 4/8 (show_gids command), which cause the training to stop。We have to restart the node to recover the gid index to 3。

The server OS is ubuntu 22.04 The CX-7 adapter driver is MLNX_OFED_LINUX-24.01-0.3.3.1-ubuntu22.04-x86_64.tgz

Any suggestions to configure CX-7 not to change gid index?

michaelbe · June 27, 2024, 9:18am

RoCE and GID index 3

Most common use case on many systems, per each interface only single IPv4 address is defined.
In that case the RoCEv2 GID index that will point to GID with IPv4 address will be GID index 3.
Thus many times clusters have hardcoded values in their environment to use GID index 3.
Or it appears in many guidance documents to specify GID index 3 in command line.
But this cannot be relied on for two main reasons:

Some deployments will have IPv6. Some will have several IP addresses per single interface.
In that case GID index 3, might not be the right one.
GID index table can change on the fly and indexes with it.
Below has deeper explanation of such situation.

GID index/table change on the fly - how can it happen?

There are 2 main types of GID table changes:

Hole in GID index values
On specific device, for same IP address GID index value changes due to event like link toggle.
So, there is basically a hole in the GID index, no more 3.

Example:
from: mlx5_0, IP-X RoCEv2 => GID index=3
to: mlx5_0, IP-X RoCEv2 => GID index=4

Gid table before change for mlx5_0:
GID index IP addr RoCE version

0 Link Local RoCEv1
1 Link Local RoCEv2
2 IP-X RoCEv1
3 IP-X RoCEv2

Some event happens, Gid table after change for mlx5_0:
GID index IP addr RoCE version

0 Link Local RoCEv1
1 Link Local RoCEv2
2 IP-X RoCEv1
4 IP-X RoCEv2

This can happen only if something is holding a lock to GID index that relates to the interface that is being toggled.
Small list of options, although others might be as well:

RDMA application is running RDMA traffic
cat /sysfs/…. That has relation to GID table
At specific moment when RDMA_CM message is received
In parallel to above events link toggle happens

This is a race condition, any change in timing can impact the chance of an issue.
But it cannot happen if after reboot, nothing is running, no lock is held and a link is simply toggled.

No hole, just GID index mapping changes
On specific device, usually specific IP is mapped to specific GID index value, but sometimes
all IPs on that device are rearranged in different order in GID index table and thus get different GID indexes.
For example: mlx5_0
GID index IP addr RoCE version

0 Link Local RoCEv1
1 Link Local RoCEv2
2 IP-X RoCEv1
3 IP-X RoCEv2
4 IP-Y RoCEv1
5 IP-Y RoCEv2

Some event happens, new table looks like below:
GID index IP addr RoCE version

2 Link Local RoCEv1
5 Link Local RoCEv2
0 IP-X RoCEv1
1 IP-X RoCEv2
3 IP-Y RoCEv1
4 IP-Y RoCEv2

There is no guarantee of order of IP addresses per device bring up.
GID indexes will be assigned in order of IP up events.
So if link is toggled, table rescanned, or even host is rebooted , we cannot rely on order.
Driver restart might not always help here.

What is the solution for non reliable GID table?

Many applications are moving to dynamic GID index identification.
Perftest (ib_write_bw, …) and UCX are already able to identify GID index automatically.
They still allow to specify GID index for special use cases like several IP addresses per interface,
but if no special use case, better not specify this parameter at all.

NCCL has similar capability now, starting NCCL v2.21.

michaelbe · June 30, 2024, 1:20pm

Follow up questions:
Question:
I look into NCCL 2.21 doc, it seems that use NCCL_IB_ROCE_VERSION_NUM=2 to replace NCCL_IB_GID_INDEX=3 would help this scenario.
Answer:
NCCL_IB_ROCE_VERSION_NUM=2 is not needed, as it’s the default already
Question:
In this setting, even GID index changes during training, NCCL would handle that. Training on pytorch/nccl/rdma stack would continue without stop. I am not sure I understand what NCCL suggest, would you suggest
Answer:
That’s incorrect. We detect the GID automatically at startup, but if the GID changes, communication would still fail. The difference with previous versions though is that the next job would start and work fine even if the GID has changed.

system · July 14, 2024, 1:20pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
RoCEv2 GID disappeared ? Mellanox OFED port	10	739	December 11, 2018
How can I change the gid_0 roce mode? Mellanox OFED	0	292	November 8, 2017
Need help with new Xid error CUDA Programming and Performance	0	2510	February 13, 2008
Change id assigned to gpu's CUDA Programming and Performance	2	4548	July 6, 2011
Patch needed to activate ROCEV2 for Connect 3X 10G card Ethernet Adapter Cards	12	510	February 8, 2017
multiple GIDs RDMA Software For GPU	1	881	January 25, 2017
GPUDirect RDMA support with CUDA 5 CUDA Programming and Performance	19	9312	May 28, 2013
Deciphering an NVRM: Xid message? CUDA Programming and Performance	27	78265	April 1, 2012
How to keep guid of HCA in vsphere SR-IOV InfiniBand/VPI Adapter Cards	2	412	March 24, 2017
ESXi Underlying device does not support requested gid/RoCE type. Failed with status: Protocol not supported Software And Drivers	1	310	July 22, 2019

Configure CX-7 400GE not to change gid index

Gid table before change for mlx5_0: GID index IP addr RoCE version

Some event happens, Gid table after change for mlx5_0: GID index IP addr RoCE version

Some event happens, new table looks like below: GID index IP addr RoCE version

Related topics

Gid table before change for mlx5_0:
GID index IP addr RoCE version

Some event happens, Gid table after change for mlx5_0:
GID index IP addr RoCE version

Some event happens, new table looks like below:
GID index IP addr RoCE version