System validation script (cuBB_system_checks.py) doesn't show NIC and PTP details

Hi,

I faced an issue while setting Aerial CUDA-Accelerated RAN on Supermicro MGX with GH200 and BF3 - the system validation script doesn’t show NIC and PTP details. And GPUDiretory topology shows NODE, not “SYS”. Could you please help to resolve this?

First, I ran cuBB docker container with “sudo docker run --restart unless-stopped -dP --gpus all --network host --shm-size=4096m --privileged -it --device=/dev/gdrdrv:/dev/gdrdrv -v /lib/modules:/lib/modules -v /dev/hugepages:/dev/hugepages -v ~/share:/opt/cuBB/share --userns=host --ipc=host -v /var/log/aerial:/var/log/aerial --name cuBB nvcr.io/qhrjhjrvlsbu/aerial-cuda-accelerated-ran:24-3-cubb

Then, I was following Aerial System Scripts — Aerial CUDA-Accelerated RAN.

“sudo -E python3 ./cuBB_system_checks.py” command shows the following. It seems mlx commands were not in the container. For instance, there’s no mlxfwmanager in the container.

$ sudo -E python3 ./cuBB_system_checks.py
-----General--------------------------------------
Hostname : fullpt-2
IP address : 10.0.125.31
Linux distro : “Ubuntu 22.04.5 LTS”
Linux kernel version : 6.5.0-1019-nvidia
-----Kernel Command Line--------------------------
Audit subsystem : audit=0
Clock source : N/A
HugePage count : hugepages=48
HugePage size : hugepagesz=512M
CPU idle time management : idle=poll
Max Intel C-state : N/A
Intel IOMMU : N/A
IOMMU : N/A
Isolated CPUs : N/A
Corrected errors : N/A
Adaptive-tick CPUs : nohz_full=4-64
Soft-lockup detector disable : nosoftlockup
Max processor C-state : processor.max_cstate=0
RCU callback polling : rcu_nocb_poll
No-RCU-callback CPUs : rcu_nocbs=4-64
TSC stability checks : tsc=reliable
-----CPU------------------------------------------
CPU cores : 72
Thread(s) per CPU core : 1
CPU MHz: : N/A
CPU sockets : 1
-----Environment variables------------------------
CUDA_DEVICE_MAX_CONNECTIONS : 8
cuBB_SDK : /opt/nvidia/cuBB
-----Memory---------------------------------------
HugePage count : 48
Free HugePages : 48
HugePage size : 524288 kB
Shared memory size : 240G
-----Nvidia GPUs----------------------------------
GPU driver version : 560.35.03
CUDA version : 12.6
GPU0
GPU product name : NVIDIA GH200 480GB
GPU persistence mode : Enabled
Current GPU temperature : 31 C
GPU clock frequency : 1980 MHz
Max GPU clock frequency : 1980 MHz
GPU PCIe bus id : 00000009:01:00.0
-----GPUDirect topology---------------------------
GPU0 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE NODE NODE 0-71 0 1
NIC0 NODE X PIX NODE NODE
NIC1 NODE PIX X NODE NODE
NIC2 NODE NODE NODE X PIX
NIC3 NODE NODE NODE PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3

-----Mellanox NICs--------------------------------
-----Mellanox NIC Interfaces----------------------
-----Linux PTP------------------------------------

-----Software Packages----------------------------
cmake /usr/local/bin : 3.25.1
docker : N/A
gcc /usr/local/gnu/bin : 12.3.0
git-lfs /usr/bin : 3.0.2
MOFED : N/A
meson /usr/bin : 0.61.2
ninja /usr/bin : 1.10.2
ptp4l : N/A
-----Loaded Kernel Modules------------------------
GDRCopy : gdrdrv
GPUDirect RDMA : N/A
Nvidia : nvidia
-----Non-persistent settings----------------------
VM swappiness : vm.swappiness = 60
VM zone reclaim mode : vm.zone_reclaim_mode = 0
-----Docker images--------------------------------

Regards,
Seung

Hi @seung.jang1 ,

Please see Issue in installation of Dell R760 & A100X - #7. You can find the answer.

Thank you.

1 Like