Aerial cuBB End-to-End Test Issue

We are trying to run Aerial cuBB End-to-End test case in our system, but meet some problems. We hope can get help.

Hardareware:
X86 Server( Intel Xeon W-3175X CPU) + A100 GPU + BF3 NIC(900-9D3B6-00SV-A): run TestMAC + cuphycontroller
X86 Server( Intel Xeon W-3175X CPU) + BF3 NIC(900-9D3B6-00SV-A): run RU-Emulator
Software:
Aerial CUDA-Accelerated RAN Release 24-3

We try to run the cuBB End-to-End test bench F08 as following, the case is 1 4T4R cell, DL 4streams/UL 2streams 100MHz:
/opt/nvidia/cuBB/build/cuPHY-CP/cuphycontroller/examples$ sudo -E ./cuphycontroller_scf F08_CG1
/opt/nvidia/cuBB/build/cuPHY-CP/testMAC/testMAC$ sudo ./test_mac F08 1C
/opt/nvidia/cuBB/build/cuPHY-CP/ru-emulator/ru_emulator$ sudo ./ru_emulator F08 1Ce

and the output is :
11:30:10.132465 WRN 7536 0 [MAC.FAPI] Cell 0 | DL 1586.28 Mbps 1600 Slots | UL 0.00 Mbps 124 Slots | Prmb 0 | HARQ 0 | SR 0 | CSI1 0 | CSI2 0 | SRS 0 | ERR 378 | INV 248 | Slots 2000
11:30:10.384025 WRN 7537 0 [MAC.SCF] Slot missed: SFN curr: 130.11 last: 125.8
11:30:11.183964 WRN 7536 0 [MAC.FAPI] Cell 0 | DL 1586.28 Mbps 1600 Slots | UL 0.00 Mbps 122 Slots | Prmb 0 | HARQ 0 | SR 0 | CSI1 0 | CSI2 0 | SRS 0 | ERR 376 | INV 244 | Slots 4000
11:30:11.384026 WRN 7537 0 [MAC.SCF] Slot missed: SFN curr: 230.11 last: 225.7

It looks like DL is OK, but UL is abnormal. We find following logs from cuphycontroller logfile and make some error indications in red:


So, our quetions are:

  1. What’s the meaning of “order kernel timeout error”? what are the possible reasons of this error? Is it related to RU-Emulator?
  2. Can you check the attached logs and provide some suggestion on solving the problem?

Thanks for your help!
testmac.log (9.2 KB)
phy.log (3.8 MB)
l2_adapter_config_F08_CG1.yaml.txt (4.9 KB)
test_mac_config.yaml.txt (9.4 KB)
cuphycontroller_F08_CG1.yaml.txt (34.9 KB)

@fltang
Welcome to Aerial Forum!

In the case of DL working OK but UL being abnormal with errors shown in your screenshot, please make sure the followings are configured and working properly,

  1. PTP sync between DU and RU-emulator.
    check the status of ptp4l.service and phy2sys.service on both DU and RU-emulator. Please check the outputs of the following commands
    $sudo systemctl status ptp4l.service
    $sudo systemctl status phc2sys.service
    $timedatectl

  2. pcie addresses of FH ports and MAC address configurations in DU/DU-emulator

  • make sure nic address in cuphycontroller_F08*.yaml to be the pcie address of DU FH port connecting to RU-
    emulator (below is an example)
    nics:
    nic: ‘0000:b3:00.0’

  • make sure the nic address in ru-emulator/config/config.yaml to be pcie of the port connecting to DU FH port and peers/peerethaddr is the MAC address of DU FH port. (below is an example)
    nics:

    • nic_interface: ‘0000:ca:00.0’
      peers:
      • peerethaddr: 58:a2:e1:84:7c:68
  1. make sure the cpu cores used by cuphycontroller, testmac are on the same numa node.
  • If the server is x86 server with two numa nodes such as dell r750/r760, recommend using cuphycontroller_F08_R750.yaml.
  • If the server is x86 server with single numa node, recommend using cuphycontroller_F08.yaml.

Among the configuration files, cuphycontroller_F08_R750.yaml, cuphycontroller_F08.yaml and cuphycontroller_F08_CG1.yaml, the default cpu core are different. It is OK to use cuphycontroller_F08_CG1.yaml for x86 server setup, just need to make sure CPU core to be allocated to available cpu core on the same numa node.

Hi,
Thanks for the reply and we follow it to make the PTP start to work. But there are still many errors.

Then we run a simple case in 《Aerial CUDA-Accelerated RAN Release 24-3》 1.5.4.5.1 Running testMAC + SCF L2 Adapter Standalone

We find the L2Adaptor prints out a lot of errors as following:

08:40:10.751795 ERR timer_thread 0 [AERIAL_L2ADAPTER_EVENT] [SCF.PHY] Send LATE SLOT error indication SFN=66 slot=7
08:40:10.751797 ERR timer_thread 0 [AERIAL_L2ADAPTER_EVENT] [SCF.PHY] Send Err.ind for SFN 66.7 cell_id=0 msg_id=0x82 err_code=0x34
08:40:10.751812 WRN timer_thread 0 [L2A.MODULE] tick_received: tick_err 49311464
08:40:10.751812 ERR timer_thread 0 [AERIAL_L2ADAPTER_EVENT] [SCF.PHY] Send LATE SLOT error indication SFN=66 slot=8
08:40:10.751814 ERR timer_thread 0 [AERIAL_L2ADAPTER_EVENT] [SCF.PHY] Send Err.ind for SFN 66.8 cell_id=0 msg_id=0x82 err_code=0x34
08:40:10.751825 WRN timer_thread 0 [L2A.MODULE] tick_received: tick_err 48825031

We think the reason is : tick _received too late, but we don’t know how to fix it.

Can you give some adivces? logs are attached.

Thanks!

Hi @fltang

The errors suggest the PTP is likely in-sync still. please capture the info for the following commands:

  1. get the system info
    $ cd cuPHY/util/cuBB_system_checks/
    $ python3 cuBB_system_checks.py

  2. get the PTP4l and phyc2sys status
    sudo systemctl restart ptp4l.service
    sudo systemctl status ptp4l.service
    sudo systemctl restart phyc2sys.service
    sudo systemctl status phyc2sys.service

Hi,
As your suggestion, we rebuilt the docker and reconfig. It seems be better now, no “tick_received too late ERROR” any more.

But, when we run the case in 《Aerial CUDA-Accelerated RAN Release 24-3》 1.5.4.5.2 Running testMAC + cuPHYController_SCF + RU Emulator.
on the RU-Emulator side, payload Validation ERROR. Logs are as following:

06:43:51.482639 WRN 2156 0 [RU] PDSCH Complete Cell 0 3GPP slot 5 F0 S2 S1 Payload Validation ERROR
06:43:51.483138 WRN 2158 0 [RU] PDSCH Complete Cell 0 3GPP slot 6 F0 S3 S0 Payload Validation ERROR
06:43:51.483166 WRN 2155 0 [RU] PDCCH_DL Complete Cell 0 3GPP slot 7 F0 S3 S1 validation error
06:43:51.483168 WRN 2155 0 [RU] PDCCH_UL Complete Cell 0 3GPP slot 7 F0 S3 S1 validation errore

We try to find out the reason:

  1. check the plp4l and phc2sys, they are seems ok. logs are attached.(DUptp4l.service.log、DUphc2sys.service.log、RUptp4l.service.log、RUphc2sys.service.log)
  2. run the system_check script. We are not sure it’s ok or not. Please help to check it. (log:system_checks.txt)
  3. check the phy_controller’s log of SFN0.3. No error or warning is found. It seems OK. Please help to check it. (log:phySFN0.3.log.txt)
  4. trace the GPU with the script: sudo -E nsys profile --gpu-metric-device=all --trace=cuda -o report ./cuphycontroller_scf F08_CG1. And analyse it using Nsight System.

We think two points are strange:

  1. no PDSCH related kernels are traced. It doesn’t work?
  2. PUSCH channel estimation kernel is not traced, but other kernels are traced.

So, can you give us some suggestion? Thanks very much!

phySFN0.3.log.txt (10.5 KB)
DUptp4l.service.log (1.3 KB)
DUphc2sys.service.log (2.0 KB)
system_checks.txt (3.8 KB)
RUptp4l.service.log (2.2 KB)
RUphc2sys.service.log (932 Bytes)

Hi @fltang

Based on the system check info, I assume the CPU cores are hyperthreaded on your system. In current configurations you have, there is cpu affinity confilcts.
Please try changing the configurations as shown below to see if it helps.

  1. in cuphycontroller_F08_CG1.ymal
    workers_ul: [5,6]
    workers_dl: [11,12,14]

  2. l2_adapter_config_F08_CG1.yaml
    timer_thread_config:
    name: timer_thread
    cpu_affinity: 7
    sched_priority: 99
    message_thread_config:
    name: msg_processing
    #core assignment
    cpu_affinity: 7

Thanks