P4d.24xlarge instances not reporting Fabric status

jorge45 · February 25, 2024, 10:40pm

Good day.

Trying to get running the nvidia/cuda on p4d.24xlarge AWS instances with CentOS 7.9

Installed the rpm local cuda-repo-rhel7-12-3-local-12.3.0_545.23.06-1.x86_64.rpm

Installed cuda-toolkit-12-3 and nvidia-driver-latest-dkms
Later checking and finding problems with grdcopy noticed that it was expected to use the opensource version so switched them following the steps of from cuda-installation-guide-linux switching-between-driver-module-flavors

Additionally installed cuda-drivers-fabricmanager nvidia-persistenced-latest-dkms and enabled at boot both nvidia-persistenced and fabricmanager services

As per AWS document gdrcopy was compiled and installed from v2.3.tar.gz

And lastly the AWS efa drivers using the efa_installer.sh script from aws-efa-installer-1.30.0.tar.gz

I believe efa is operational as fi_pingpong works

The system is indeed loading the fabricmanager service, and showing these modules:

lsmod | egrep “nvidia|efa”

efa_nv_peermem 13472 0
nvidia_drm 72787 0
nvidia_modeset 1482612 1 nvidia_drm
drm_kms_helper 186531 1 nvidia_drm
nvidia_uvm 1296925 0
nvidia 7611026 161 efa_nv_peermem,gdrdrv,nvidia_modeset,nvidia_uvm
efa 88964 0
ib_uverbs 102208 2 efa,rdma_ucm
drm 456166 4 drm_kms_helper,nvidia,nvidia_drm
ib_core 255353 10 efa,rdma_cm,ib_cm,iw_cm,rpcrdma,ib_iser,ib_srpt,ib_uverbs,rdma_ucm,ib_isert

The nvidia devices present on the host are:

# ls -l /dev/nvi*
crw-rw-rw- 1 root root 195,   0 Feb 23 17:15 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Feb 23 17:15 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Feb 23 17:15 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Feb 23 17:15 /dev/nvidia3
crw-rw-rw- 1 root root 195,   4 Feb 23 17:15 /dev/nvidia4
crw-rw-rw- 1 root root 195,   5 Feb 23 17:15 /dev/nvidia5
crw-rw-rw- 1 root root 195,   6 Feb 23 17:15 /dev/nvidia6
crw-rw-rw- 1 root root 195,   7 Feb 23 17:15 /dev/nvidia7
crw-rw-rw- 1 root root 195, 255 Feb 23 17:15 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Feb 23 17:15 /dev/nvidia-modeset
crw-rw-rw- 1 root root 242,   0 Feb 23 17:15 /dev/nvidia-nvlink
crw-rw-rw- 1 root root 241,   0 Feb 23 17:15 /dev/nvidia-nvswitch0
crw-rw-rw- 1 root root 241,   1 Feb 23 17:15 /dev/nvidia-nvswitch1
crw-rw-rw- 1 root root 241,   2 Feb 23 17:15 /dev/nvidia-nvswitch2
crw-rw-rw- 1 root root 241,   3 Feb 23 17:15 /dev/nvidia-nvswitch3
crw-rw-rw- 1 root root 241,   4 Feb 23 17:15 /dev/nvidia-nvswitch4
crw-rw-rw- 1 root root 241,   5 Feb 23 17:15 /dev/nvidia-nvswitch5
crw-rw-rw- 1 root root 241, 255 Feb 23 17:15 /dev/nvidia-nvswitchctl
crw-rw-rw- 1 root root 240,   0 Feb 23 17:15 /dev/nvidia-uvm
crw-rw-rw- 1 root root 240,   1 Feb 23 17:15 /dev/nvidia-uvm-tools

However when running
nvidia-smi -q -i 0 | grep -i -A 2 Fabric

Both status and state report N/A while the fabricmanager documentation states it should report Success

nvidia-smi -q -i 0 | grep -i -A 2 Fabric

Fabric
    State                             : N/A
    Status                            : N/A

fa_info reports the EFA

fi_info -p efa -t FI_EP_RDM

provider: efa
fabric: efa
domain: rdmap16s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap32s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap144s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap160s27-rdm
version: 119.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA

Interesting enough examples show the fabric value including some sort of ipv6 /MAC address like:

fabric: EFA-fe80::94:3dff:fe89:1b70
While here it just prints efa
Running nvswitch-audit prints the following

# nvswitch-audit -f

GPU Reachability Matrix
GPU Physical Id   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
		  1 X 12 12 12 12 12 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  2 12 X 12 12 12 12 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  3 12 12 X 12 12 12 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  4 12 12 12 X 12 12 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  5 12 12 12 12 X 12 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  6 12 12 12 12 12 X 12 12 -1 -1 -1 -1 -1 -1 -1 -1
		  7 12 12 12 12 12 12 X 12 -1 -1 -1 -1 -1 -1 -1 -1
		  8 12 12 12 12 12 12 12 X -1 -1 -1 -1 -1 -1 -1 -1
		  9  0  0  0  0  0  0  0  0 X  0  0  0  0  0  0  0
		 10  0  0  0  0  0  0  0  0  0 X  0  0  0  0  0  0
		 11  0  0  0  0  0  0  0  0  0  0 X  0  0  0  0  0
		 12  0  0  0  0  0  0  0  0  0  0  0 X  0  0  0  0
		 13  0  0  0  0  0  0  0  0  0  0  0  0 X  0  0  0
		 14  0  0  0  0  0  0  0  0  0  0  0  0  0 X  0  0
		 15  0  0  0  0  0  0  0  0  0  0  0  0  0  0 X  0
		 16  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 X

Note: Number of NVLinks displayed corresponds to the maximum number of GPU NVLinks
      that NVSwitches are programmed to handle. Number of GPU NVLinks might be different
      than displayed in the above matrix

But not sure that is enough to confirm the nvlinks are operating properly, or if there is anything else to test.

Is there anything I am missing?

Topic		Replies	Views
Cuda 12.4 Driver Version: 565.57.0 CUDA Setup and Installation	1	475	December 19, 2024
Fabricmanager dependencies (Ubuntu 22.04.1 LTS ) Linux	4	3097	December 8, 2022
Fabric Manager State: N/A Linux nvidia-smi	2	133	February 1, 2025
Fabricmanager: NVSwitches found. dcgmi: NVSwitches not found CUDA Setup and Installation cuda	2	431	October 15, 2024
Fabric Manager Installation CUDA Setup and Installation	3	9993	March 20, 2024
CUDA device not initialized error on all calls, HGX A100, Centos 7 Linux cuda	9	4604	December 6, 2021
Problem starting fabricmanager in Ubuntu 20.04 LTS CUDA Setup and Installation	9	9106	December 20, 2024
Issue when upgrading cuda driver to R470 - DGX2 DGX User Forum cuda	17	6727	July 5, 2023
Fabric manager on VM returns error CUDA Setup and Installation cuda , ubuntu	0	724	April 12, 2024
No events/metrics were profiled when use nvprof in CUDA 10.1.168 Visual Profiler and nvprof	5	5042	December 14, 2019

P4d.24xlarge instances not reporting Fabric status

lsmod | egrep “nvidia|efa”

nvidia-smi -q -i 0 | grep -i -A 2 Fabric

fi_info -p efa -t FI_EP_RDM

Related topics