P4d.24xlarge instances not reporting Fabric status

Good day.

Trying to get running the nvidia/cuda on p4d.24xlarge AWS instances with CentOS 7.9

Installed the rpm local cuda-repo-rhel7-12-3-local-12.3.0_545.23.06-1.x86_64.rpm

Installed cuda-toolkit-12-3 and nvidia-driver-latest-dkms
Later checking and finding problems with grdcopy noticed that it was expected to use the opensource version so switched them following the steps of from cuda-installation-guide-linux switching-between-driver-module-flavors

Additionally installed cuda-drivers-fabricmanager nvidia-persistenced-latest-dkms and enabled at boot both nvidia-persistenced and fabricmanager services

As per AWS document gdrcopy was compiled and installed from v2.3.tar.gz

And lastly the AWS efa drivers using the efa_installer.sh script from aws-efa-installer-1.30.0.tar.gz

I believe efa is operational as fi_pingpong works

The system is indeed loading the fabricmanager service, and showing these modules:

lsmod | egrep “nvidia|efa”

efa_nv_peermem 13472 0
nvidia_drm 72787 0
nvidia_modeset 1482612 1 nvidia_drm
drm_kms_helper 186531 1 nvidia_drm
nvidia_uvm 1296925 0
nvidia 7611026 161 efa_nv_peermem,gdrdrv,nvidia_modeset,nvidia_uvm
efa 88964 0
ib_uverbs 102208 2 efa,rdma_ucm
drm 456166 4 drm_kms_helper,nvidia,nvidia_drm
ib_core 255353 10 efa,rdma_cm,ib_cm,iw_cm,rpcrdma,ib_iser,ib_srpt,ib_uverbs,rdma_ucm,ib_isert

The nvidia devices present on the host are:

# ls -l /dev/nvi*
crw-rw-rw- 1 root root 195,   0 Feb 23 17:15 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Feb 23 17:15 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Feb 23 17:15 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Feb 23 17:15 /dev/nvidia3
crw-rw-rw- 1 root root 195,   4 Feb 23 17:15 /dev/nvidia4
crw-rw-rw- 1 root root 195,   5 Feb 23 17:15 /dev/nvidia5
crw-rw-rw- 1 root root 195,   6 Feb 23 17:15 /dev/nvidia6
crw-rw-rw- 1 root root 195,   7 Feb 23 17:15 /dev/nvidia7
crw-rw-rw- 1 root root 195, 255 Feb 23 17:15 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Feb 23 17:15 /dev/nvidia-modeset
crw-rw-rw- 1 root root 242,   0 Feb 23 17:15 /dev/nvidia-nvlink
crw-rw-rw- 1 root root 241,   0 Feb 23 17:15 /dev/nvidia-nvswitch0
crw-rw-rw- 1 root root 241,   1 Feb 23 17:15 /dev/nvidia-nvswitch1
crw-rw-rw- 1 root root 241,   2 Feb 23 17:15 /dev/nvidia-nvswitch2
crw-rw-rw- 1 root root 241,   3 Feb 23 17:15 /dev/nvidia-nvswitch3
crw-rw-rw- 1 root root 241,   4 Feb 23 17:15 /dev/nvidia-nvswitch4
crw-rw-rw- 1 root root 241,   5 Feb 23 17:15 /dev/nvidia-nvswitch5
crw-rw-rw- 1 root root 241, 255 Feb 23 17:15 /dev/nvidia-nvswitchctl
crw-rw-rw- 1 root root 240,   0 Feb 23 17:15 /dev/nvidia-uvm
crw-rw-rw- 1 root root 240,   1 Feb 23 17:15 /dev/nvidia-uvm-tools

However when running
nvidia-smi -q -i 0 | grep -i -A 2 Fabric

Both status and state report N/A while the fabricmanager documentation states it should report Success

nvidia-smi -q -i 0 | grep -i -A 2 Fabric

Fabric
    State                             : N/A
    Status                            : N/A

fa_info reports the EFA

fi_info -p efa -t FI_EP_RDM

provider: efa
fabric: efa
domain: rdmap16s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap32s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap144s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap160s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA

Interesting enough examples show the fabric value including some sort of ipv6 /MAC address like:

fabric: EFA-fe80::94:3dff:fe89:1b70
While here it just prints efa
Running nvswitch-audit prints the following

# nvswitch-audit -f

GPU Reachability Matrix
GPU Physical Id   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
		  1 X 12 12 12 12 12 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  2 12 X 12 12 12 12 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  3 12 12 X 12 12 12 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  4 12 12 12 X 12 12 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  5 12 12 12 12 X 12 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  6 12 12 12 12 12 X 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  7 12 12 12 12 12 12 X 12 -1 -1 -1 -1 -1 -1 -1 -1
		  8 12 12 12 12 12 12 12 X -1 -1 -1 -1 -1 -1 -1 -1
		  9  0  0  0  0  0  0  0  0 X  0  0  0  0  0  0  0
		 10  0  0  0  0  0  0  0  0  0 X  0  0  0  0  0  0
		 11  0  0  0  0  0  0  0  0  0  0 X  0  0  0  0  0
		 12  0  0  0  0  0  0  0  0  0  0  0 X  0  0  0  0
		 13  0  0  0  0  0  0  0  0  0  0  0  0 X  0  0  0
		 14  0  0  0  0  0  0  0  0  0  0  0  0  0 X  0  0
		 15  0  0  0  0  0  0  0  0  0  0  0  0  0  0 X  0
		 16  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 X

Note: Number of NVLinks displayed corresponds to the maximum number of GPU NVLinks
      that NVSwitches are programmed to handle. Number of GPU NVLinks might be different
      than displayed in the above matrix

But not sure that is enough to confirm the nvlinks are operating properly, or if there is anything else to test.

Is there anything I am missing?