Hi Aerial Team,
We are currently conducting performance tests using the Aerial SDK 25-3 (cubb) on a Supermicro GH200 server. We have successfully validated operation with 20 cells, but we are facing critical Uplink (UL) failures when attempting to scale beyond 20 cells (specifically testing 21 cells now).
phy.log (86.0 MB)
ru_config.txt (18.1 KB)
cuphycontroller_F08_CG1_config.txt (64.2 KB)
Here is our setup and the issue details:
Hardware Setup:
-
L1/L2 Server: Supermicro GH200 (Grace Hopper)
-
RU Emulator: Dell R750 with
ru_emulatorrunning -
Traffic Pattern: 60c (average) pattern
Resource Utilization (Baseline: 20 Cells - Working Fine):
-
GPU Utilization: ~70%
-
GPU Memory: ~110,000 MiB used / 146,831 MiB total
-
There seems to be sufficient headroom in both compute and memory resources.
Steps Taken to Enable 21+ Cells: We have applied the following configurations to rule out resource limits and soft caps:
-
Recompiled SDK for 64 Cells:
- Build command:
${cuBB_SDK}/testBenches/phase4_test_scripts/build_aerial_sdk.sh --preset perf -- -DENABLE_64C=ON
- Build command:
-
L1 Configuration (
cuphycontroller_config.yaml):-
cell_group_num: Increased to 40 -
total_num_srs_chest_buffers: Increased to 12288 -
max_harq_pools: Increased to 1024 -
Assigned unique eAxC IDs (60-series) to the 21st cell (Cell 20) to prevent packet classification conflicts.
-
-
L2 Adapter Configuration:
-
Increased
mempool_size(cpu_data pool_len: 2048/4096) andring_len(32768). -
Separated
timer_threadCPU affinity to avoid contention with L1 worker threads.
-
-
System:
- Cleaned
/dev/shmbefore every run.
- Cleaned
The Issue: Despite these changes, when running 21 cells, Cell 20 (the 21st cell) consistently shows 0.00 Mbps UL throughput with 100% CRC errors. Shortly after start, we observe Order kernel timeout errors specifically on Cell 20, followed by cascading C-plane errors on all cells.
Key Log Snippets (phy.log):
-
Throughput Status (Cell 20 failing UL):
Plaintext
... 06:19:06.320011 CON timer_thread 0 [SCF.PHY] Cell 19 | DL 558.90 Mbps 1600 Slots | UL 74.74 Mbps 392 Slots CRC 16 ( 232) | Tick 30000 06:19:06.320011 CON timer_thread 0 [SCF.PHY] Cell 20 | DL 558.90 Mbps 1600 Slots | UL 0.00 Mbps 392 Slots CRC 392 ( 5884) | Tick 30000 -
The Root Cause Error (Order Kernel Timeout):
Plaintext
04:21:16.224674 ERR UlPhyDriver07 0 [AERIAL_CUPHY_API_EVENT] [DRV.FUNC_UL] SFN 26.4 Slot Map 104 Order kernel timeout error (exit condition 4) for cell index 20 Dyn index 20! Attempting PUSCH pipeline termination 04:21:16.225054 ERR UlPhyDriver06 0 [AERIAL_CUPHY_API_EVENT] [DRV.FUNC_UL] SFN 26.4 Slot Map 104 PUSCH Pre Early Harq Wait kernel timeout! 04:21:16.227842 ERR UlPhyDriver06 0 [AERIAL_CUPHY_API_EVENT] [DRV.FUNC_UL] SFN 26.5 Slot Map 105 cell index 20 Dyn index 20 setting as unhealthy! -
Cascading C-Plane Errors:
Plaintext
04:21:16.228379 ERR UlPhyDriver07 0 [AERIAL_CUPHYDRV_API_EVENT] [DRV.FUNC_UL] UL C-plane send error for cell index 0,error type 2 Map 106 Abort UL Tasks! ... 04:21:16.228385 ERR UlPhyDriver07 0 [AERIAL_CUPHYDRV_API_EVENT] [DRV.FUNC_UL] UL C-plane send error for cell index 20,error type 2 Map 106 Abort UL Tasks!
Questions:
-
Since we compiled with
-DENABLE_64C=ON, we assume the hard limit of 20 cells is removed. Is there any other hidden parameter or macro (e.g., related toOrder kernelorGreen Contexts) that needs to be tuned for >20 cells on GH200? -
The error
Order kernel timeout error (exit condition 4)suggests the GPU couldn’t process the UL task in time. Given the low GPU utilization (70%), could this be a configuration issue withmps_sm_ul_orderor thread priority?
Any insights or suggestions on what to check next would be greatly appreciated.
Best regards,