We are trying to run Aerial cuBB End-to-End test case in our system, but meet some problems. We hope can get help.
Hardareware:
X86 Server( Intel Xeon W-3175X CPU) + A100 GPU + BF3 NIC(900-9D3B6-00SV-A): run TestMAC + cuphycontroller
X86 Server( Intel Xeon W-3175X CPU) + BF3 NIC(900-9D3B6-00SV-A): run RU-Emulator
Software:
Aerial CUDA-Accelerated RAN Release 24-3
We try to run the cuBB End-to-End test bench F08 as following, the case is 1 4T4R cell, DL 4streams/UL 2streams 100MHz:
/opt/nvidia/cuBB/build/cuPHY-CP/cuphycontroller/examples$ sudo -E ./cuphycontroller_scf F08_CG1
/opt/nvidia/cuBB/build/cuPHY-CP/testMAC/testMAC$ sudo ./test_mac F08 1C
/opt/nvidia/cuBB/build/cuPHY-CP/ru-emulator/ru_emulator$ sudo ./ru_emulator F08 1Ce
In the case of DL working OK but UL being abnormal with errors shown in your screenshot, please make sure the followings are configured and working properly,
PTP sync between DU and RU-emulator.
check the status of ptp4l.service and phy2sys.service on both DU and RU-emulator. Please check the outputs of the following commands
$sudo systemctl status ptp4l.service
$sudo systemctl status phc2sys.service
$timedatectl
pcie addresses of FH ports and MAC address configurations in DU/DU-emulator
make sure nic address in cuphycontroller_F08*.yaml to be the pcie address of DU FH port connecting to RU-
emulator (below is an example)
nics:
nic: ‘0000:b3:00.0’
make sure the nic address in ru-emulator/config/config.yaml to be pcie of the port connecting to DU FH port and peers/peerethaddr is the MAC address of DU FH port. (below is an example)
nics:
nic_interface: ‘0000:ca:00.0’
peers:
peerethaddr: 58:a2:e1:84:7c:68
make sure the cpu cores used by cuphycontroller, testmac are on the same numa node.
If the server is x86 server with two numa nodes such as dell r750/r760, recommend using cuphycontroller_F08_R750.yaml.
If the server is x86 server with single numa node, recommend using cuphycontroller_F08.yaml.
Among the configuration files, cuphycontroller_F08_R750.yaml, cuphycontroller_F08.yaml and cuphycontroller_F08_CG1.yaml, the default cpu core are different. It is OK to use cuphycontroller_F08_CG1.yaml for x86 server setup, just need to make sure CPU core to be allocated to available cpu core on the same numa node.
The errors suggest the PTP is likely in-sync still. please capture the info for the following commands:
get the system info
$ cd cuPHY/util/cuBB_system_checks/
$ python3 cuBB_system_checks.py
get the PTP4l and phyc2sys status
sudo systemctl restart ptp4l.service
sudo systemctl status ptp4l.service
sudo systemctl restart phyc2sys.service
sudo systemctl status phyc2sys.service
Hi,
As your suggestion, we rebuilt the docker and reconfig. It seems be better now, no “tick_received too late ERROR” any more.
But, when we run the case in 《Aerial CUDA-Accelerated RAN Release 24-3》 1.5.4.5.2 Running testMAC + cuPHYController_SCF + RU Emulator.
on the RU-Emulator side, payload Validation ERROR. Logs are as following:
check the plp4l and phc2sys, they are seems ok. logs are attached.(DUptp4l.service.log、DUphc2sys.service.log、RUptp4l.service.log、RUphc2sys.service.log)
run the system_check script. We are not sure it’s ok or not. Please help to check it. (log:system_checks.txt)
check the phy_controller’s log of SFN0.3. No error or warning is found. It seems OK. Please help to check it. (log:phySFN0.3.log.txt)
trace the GPU with the script: sudo -E nsys profile --gpu-metric-device=all --trace=cuda -o report ./cuphycontroller_scf F08_CG1. And analyse it using Nsight System.
Based on the system check info, I assume the CPU cores are hyperthreaded on your system. In current configurations you have, there is cpu affinity confilcts.
Please try changing the configurations as shown below to see if it helps.
in cuphycontroller_F08_CG1.ymal
workers_ul: [5,6]
workers_dl: [11,12,14]