Issue in installation of Dell R760 & A100X

Hello. I am facing some issues while running Aerial CUDA-Accelerated RAN Rel.24-1.

My hardware setup consists of only one Dell R760 + A100X.

  • First, is it possible to implement Aerial cuBB using only one Dell R760 + A100X? Or is it necessary to have separate DU and RU components? I plan to purchase another Dell R760 for RU in the future, but I would like to run the simulation with my current setup.
  • When installing ptp4l and phc2sys with the current hardware setup, since the Dell R760, acting as the DU, is running independently, I set slaveOnly=0. However, I am not sure if this is the correct approach.
  • After proceeding with the setup as described, the output of the cuBB_system_checks script inside the container does not show values for Mellanox NICs, Mellanox NIC Interfaces, and Linux PTP. Additionally, Docker is not running inside the container. Is this an installation issue, or is it due to the hardware configuration?
  • If I do not resolve the cuBB_system_checks_scripts output issue, can I still proceed with cuBB and run a simulation to some extent?

Hi @jkk83 ,

  1. Our recommended setting is to run the DU and RU on separate servers. You should also be able to run the DU and the RU on the same server with a loop-back by having two NIC cards, and by configuring separate resources for the DU and the RU. However, this will require some effort and we have not tried this successfully.

Please note that you can still run cuPHY only test cases which do not require testMAC and RU_emulator.

  1. Please review the installation steps here. The setting should be slaveOnly = 1 for the DU side.

  2. The issue with system_checks script should not be a blocking issue. It can help you debug the configuration issues. If you can follow the installation steps from the link I shared above, your configuration should be ok.

What do you see as the output of the following commands?

sudo lshw -c network -businfo
sudo nvidia-smi topo --matrix
sudo ibdev2netdev -v

Thank you.

@bkecicioglu Thank you for your reply.

I’ll give you some outputs, including the command you gave me.
Except for sudo -E python3 ./cuBB_system_checks.py, these commands were entered from outside of cuBB.

cat /etc/ptp.conf

[global]
dataset_comparison              G.8275.x
G.8275.defaultDS.localPriority  128
maxStepsRemoved                 255
logAnnounceInterval             -3
logSyncInterval                 -4
logMinDelayReqInterval          -4
G.8275.portDS.localPriority     128
network_transport               L2
domainNumber                    24
tx_timestamp_timeout            30
slaveOnly 1

clock_servo pi
step_threshold 1.0
egressLatency 28
pi_proportional_const 4.65
pi_integral_const 0.1

[aerial00]
announceReceiptTimeout 3
delay_mechanism E2E
network_transport L2


sudo lshw -c network -businfo

Bus info          Device       Class          Description
=========================================================
pci@0000:01:00.0  eno8303      network        NetXtreme BCM5720 Gigabit Ethernet PCIe
pci@0000:01:00.1  eno8403      network        NetXtreme BCM5720 Gigabit Ethernet PCIe
pci@0000:0f:00.0  aerial00     network        MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
pci@0000:0f:00.1  aerial01     network        MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
pci@0000:22:00.0  eno12399np0  network        BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet
pci@0000:22:00.1  eno12409np1  network        BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet
pci@0000:22:00.2  eno12419np2  network        BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet
pci@0000:22:00.3  eno12429np3  network        BCM57504 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet


sudo nvidia-smi topo --matrix

        GPU0    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PXB     PXB     0,2,4,6,8,10    0               N/A
NIC0    PXB      X      PIX
NIC1    PXB     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1


sudo ibdev2netdev -v

# I've seen the following note.
# Aerial has been using Mellanox inbox driver instead of MOFED since the 23-4 release. MOFED must be removed if it is installed on the system.
sudo: ibdev2netdev: command not found


sudo -E python3 ./cuBB_system_checks.py ← in cuBB

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1


-----Mellanox NICs--------------------------------
-----Mellanox NIC Interfaces----------------------
-----Linux PTP------------------------------------


-----Software Packages----------------------------
cmake       /usr/local/bin         : 3.25.1
docker                             : N/A
gcc         /usr/bin               : 11.4.0
git-lfs     /usr/bin               : 3.0.2
MOFED                              : N/A
meson       /usr/bin               : 0.61.2
ninja       /usr/bin               : 1.10.2
ptp4l                              : N/A
-----Loaded Kernel Modules------------------------
GDRCopy                            : gdrdrv
GPUDirect RDMA                     : N/A
Nvidia                             : nvidia
-----Non-persistent settings----------------------
VM swappiness                      : vm.swappiness = 60
VM zone reclaim mode               : vm.zone_reclaim_mode = 0
-----Docker images--------------------------------
aerial@NEWHOSTNAME:/opt/nvidia/cuBB/cuPHY/util/cuBB_system_checks$ sudo lshw -c network -businfo
sudo: lshw: command not found
aerial@NEWHOSTNAME:/opt/nvidia/cuBB/cuPHY/util/cuBB_system_checks$ sudo nvidia-smi topo --matrix
        GPU0    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PXB     PXB     0,2,4,6,8,10    0               N/A
NIC0    PXB      X      PIX
NIC1    PXB     PIX      X 

It’s also frustrating that the result of sudo -E python3 ./cuBB_system_checks.p is different from the guide.

You are correct that the system checks should print more information, e.g. PTP service .

Can you please go over the steps here to make sure nothing is missed in your installation?

Thank you.

Hi @jkk83,

Can you please run sudo -E python3 ./cuBB_system_checks.py on the host (outside of the container)?

Thank you.

@bkecicioglu Thank you for your reply!!

My hardware setup uses the R760 alone without the GH200, so I followed the steps outlined in “24-1”.

I have completed all the steps in “Installing Tools on Dell R760” as well as “Installing and Upgrading Aerial - Installing the New Aerial cuBB Container.”

If the term “host” environment refers to a state where 'sudo docker exec -it cuBB /bin/bash' has not been executed, then I cannot run the commands provided for outside of the container.

Hi @jkk83 ,

Some information that cuBB_system_checks.py tries to get is available only on the host. For example, LinuxPTP is not running in the Aerial container; it is running on the host. The script in this version tries to get such information in the container, but it is impossible.
This is why you need to copy the script from the container to the host and run it on the host to get some information.
This is not written in the document, and it confused you. Sorry.

In the next release, Rel-25-1, we updated it to make information available from the container.

Thank you.

1 Like

@nhashimoto, Thank you for kindly explaining how to resolve additional issues that do not affect functionality.

Based on your guidance, I attempted to check elements like LinuxPTP by copying cuBB_system_checks.py from inside the container to the host using the sudo docker cp command.

At this point, I recalled the original issue I requested help with. Could you assist me with that again?

The result of cuBB_system_checks.py differs from the guide (Rel.24-1) because Docker is not detected in the ‘Software Package’ section. Is this because I accessed the container using sudo docker exec -it cuBB, or is there another issue that needs to be resolved?

Hi @jkk83 ,
When you use the docker exec command, the command you specify after docker exec -it cuBB will be executed in the cuBB container. So, there is no problem if you see some differences when you run the script in the container or on the host. It is whether the software packages are installed in the container or on the host. For example, cmake has to be installed in the Aerial container, but the output on the host depends on whether cmake is installed on the host.
The cuBB_system_checks.py script is to check if the software packages and configurations are correctly set. You can skip this step if there is no problem from the software installation and configuration perspective. If you see any issues when you run cuBB example codes, this will help you identify the cause.

Thank you.

1 Like

Thank you @nhashimoto!

But, I also have a question regarding the “Update A100X BFB Image and NIC Firmware” section.

The A100X is designed to function as a GPU + DPU + NIC, and I understand that the DPU offloads data packet processing from the CPU, reducing CPU workload.

However, based on my understanding of Rel.24-1, the BF2-as-CX process configures the DPU to stop processing data packets and instead operate like a NIC.

  • Does the BF2-as-CX process disable DPU’s data packet processing to leverage the superior CPU performance of the R760?
  • Performance-wise, is it more efficient to send AI computation results from the A100X GPU to the R760 CPU and then reintroduce them via NIC, rather than performing AI computation, data packet processing, and NIC operations within the A100X itself?

Hi @jkk83 ,

The current Aerial implementation doesn’t support offloading the Aerial tasks to the CPU on A100X DPU. This is the reason to configure the BF2-as-CX mode for A100X DPU.

Thank you.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.