RoCE and GID index 3
Most common use case on many systems, per each interface only single IPv4 address is defined.
In that case the RoCEv2 GID index that will point to GID with IPv4 address will be GID index 3.
Thus many times clusters have hardcoded values in their environment to use GID index 3.
Or it appears in many guidance documents to specify GID index 3 in command line.
But this cannot be relied on for two main reasons:
-
Some deployments will have IPv6. Some will have several IP addresses per single interface.
In that case GID index 3, might not be the right one.
-
GID index table can change on the fly and indexes with it.
Below has deeper explanation of such situation.
GID index/table change on the fly - how can it happen?
There are 2 main types of GID table changes:
- Hole in GID index values
On specific device, for same IP address GID index value changes due to event like link toggle.
So, there is basically a hole in the GID index, no more 3.
Example:
from: mlx5_0, IP-X RoCEv2 => GID index=3
to: mlx5_0, IP-X RoCEv2 => GID index=4
Gid table before change for mlx5_0:
GID index IP addr RoCE version
0 Link Local RoCEv1
1 Link Local RoCEv2
2 IP-X RoCEv1
3 IP-X RoCEv2
Some event happens, Gid table after change for mlx5_0:
GID index IP addr RoCE version
0 Link Local RoCEv1
1 Link Local RoCEv2
2 IP-X RoCEv1
4 IP-X RoCEv2
This can happen only if something is holding a lock to GID index that relates to the interface that is being toggled.
Small list of options, although others might be as well:
- RDMA application is running RDMA traffic
- cat /sysfs/…. That has relation to GID table
- At specific moment when RDMA_CM message is received
In parallel to above events link toggle happens
This is a race condition, any change in timing can impact the chance of an issue.
But it cannot happen if after reboot, nothing is running, no lock is held and a link is simply toggled.
- No hole, just GID index mapping changes
On specific device, usually specific IP is mapped to specific GID index value, but sometimes
all IPs on that device are rearranged in different order in GID index table and thus get different GID indexes.
For example: mlx5_0
GID index IP addr RoCE version
0 Link Local RoCEv1
1 Link Local RoCEv2
2 IP-X RoCEv1
3 IP-X RoCEv2
4 IP-Y RoCEv1
5 IP-Y RoCEv2
Some event happens, new table looks like below:
GID index IP addr RoCE version
2 Link Local RoCEv1
5 Link Local RoCEv2
0 IP-X RoCEv1
1 IP-X RoCEv2
3 IP-Y RoCEv1
4 IP-Y RoCEv2
There is no guarantee of order of IP addresses per device bring up.
GID indexes will be assigned in order of IP up events.
So if link is toggled, table rescanned, or even host is rebooted , we cannot rely on order.
Driver restart might not always help here.
What is the solution for non reliable GID table?
Many applications are moving to dynamic GID index identification.
Perftest (ib_write_bw, …) and UCX are already able to identify GID index automatically.
They still allow to specify GID index for special use cases like several IP addresses per interface,
but if no special use case, better not specify this parameter at all.
NCCL has similar capability now, starting NCCL v2.21.