[GH200] Uplink Traffic Failure on 21+ Cells: "Order kernel timeout" & CRC Errors despite "-DENABLE_64C=ON"

Hi Aerial Team,

We are currently conducting performance tests using the Aerial SDK 25-3 (cubb) on a Supermicro GH200 server. We have successfully validated operation with 20 cells, but we are facing critical Uplink (UL) failures when attempting to scale beyond 20 cells (specifically testing 21 cells now).

phy.log (86.0 MB)

ru_config.txt (18.1 KB)

cuphycontroller_F08_CG1_config.txt (64.2 KB)

Here is our setup and the issue details:

Hardware Setup:

  • L1/L2 Server: Supermicro GH200 (Grace Hopper)

  • RU Emulator: Dell R750 with ru_emulator running

  • Traffic Pattern: 60c (average) pattern

Resource Utilization (Baseline: 20 Cells - Working Fine):

  • GPU Utilization: ~70%

  • GPU Memory: ~110,000 MiB used / 146,831 MiB total

  • There seems to be sufficient headroom in both compute and memory resources.

Steps Taken to Enable 21+ Cells: We have applied the following configurations to rule out resource limits and soft caps:

  1. Recompiled SDK for 64 Cells:

    • Build command: ${cuBB_SDK}/testBenches/phase4_test_scripts/build_aerial_sdk.sh --preset perf -- -DENABLE_64C=ON
  2. L1 Configuration (cuphycontroller_config.yaml):

    • cell_group_num: Increased to 40

    • total_num_srs_chest_buffers: Increased to 12288

    • max_harq_pools: Increased to 1024

    • Assigned unique eAxC IDs (60-series) to the 21st cell (Cell 20) to prevent packet classification conflicts.

  3. L2 Adapter Configuration:

    • Increased mempool_size (cpu_data pool_len: 2048/4096) and ring_len (32768).

    • Separated timer_thread CPU affinity to avoid contention with L1 worker threads.

  4. System:

    • Cleaned /dev/shm before every run.

The Issue: Despite these changes, when running 21 cells, Cell 20 (the 21st cell) consistently shows 0.00 Mbps UL throughput with 100% CRC errors. Shortly after start, we observe Order kernel timeout errors specifically on Cell 20, followed by cascading C-plane errors on all cells.

Key Log Snippets (phy.log):

  1. Throughput Status (Cell 20 failing UL):

    Plaintext

    ...
    06:19:06.320011 CON timer_thread 0 [SCF.PHY] Cell 19 | DL  558.90 Mbps 1600 Slots | UL   74.74 Mbps  392 Slots CRC  16 (   232) | Tick 30000
    06:19:06.320011 CON timer_thread 0 [SCF.PHY] Cell 20 | DL  558.90 Mbps 1600 Slots | UL    0.00 Mbps  392 Slots CRC 392 (  5884) | Tick 30000
    
    
  2. The Root Cause Error (Order Kernel Timeout):

    Plaintext

    04:21:16.224674 ERR UlPhyDriver07 0 [AERIAL_CUPHY_API_EVENT] [DRV.FUNC_UL] SFN 26.4 Slot Map 104 Order kernel timeout error (exit condition 4) for cell index 20 Dyn index 20! Attempting PUSCH pipeline termination
    04:21:16.225054 ERR UlPhyDriver06 0 [AERIAL_CUPHY_API_EVENT] [DRV.FUNC_UL] SFN 26.4 Slot Map 104 PUSCH Pre Early Harq Wait kernel timeout!
    04:21:16.227842 ERR UlPhyDriver06 0 [AERIAL_CUPHY_API_EVENT] [DRV.FUNC_UL] SFN 26.5 Slot Map 105 cell index 20 Dyn index 20 setting as unhealthy!
    
    
  3. Cascading C-Plane Errors:

    Plaintext

    04:21:16.228379 ERR UlPhyDriver07 0 [AERIAL_CUPHYDRV_API_EVENT] [DRV.FUNC_UL] UL C-plane send error for cell index 0,error type 2 Map 106 Abort UL Tasks!
    ...
    04:21:16.228385 ERR UlPhyDriver07 0 [AERIAL_CUPHYDRV_API_EVENT] [DRV.FUNC_UL] UL C-plane send error for cell index 20,error type 2 Map 106 Abort UL Tasks!
    
    

Questions:

  1. Since we compiled with -DENABLE_64C=ON, we assume the hard limit of 20 cells is removed. Is there any other hidden parameter or macro (e.g., related to Order kernel or Green Contexts) that needs to be tuned for >20 cells on GH200?

  2. The error Order kernel timeout error (exit condition 4) suggests the GPU couldn’t process the UL task in time. Given the low GPU utilization (70%), could this be a configuration issue with mps_sm_ul_order or thread priority?

Any insights or suggestions on what to check next would be greatly appreciated.

Best regards,

Hi @hojoon.won ,

Aerial SDK 25-3 supports up to 20 peak cells with the channel configuration as shared in our release notes. Please see here.

Purpose of the 64C compilation flag is to experiment with higher cell counts and for lower traffic per cell.

Thank you.

Hi,

Thank you for the clarification. I understand that the official support is up to 20 peak cells and that the 64C flag is intended for experimental purposes with lower traffic per cell.

However, I noticed that the SDK includes dedicated configuration files for 40 cells (e.g., cuphycontroller_40C.yaml, ruemulator_config_40C.yaml) along with the 64C compilation flag. This led me to believe that testing beyond 20 cells is feasible under certain conditions.

I am currently attempting to run this experiment using the 60c pattern (Average usage), not the peak load. Could you please share any reference configurations, specific parameter tuning advice, or known limitations for running 20+ cells with this “average” traffic pattern?

Any guidance on enabling this experimental setup would be very helpful.

Best regards,

Hi @hojoon.won ,

The average cell capacity is also 20 cells (here).

Thank you.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.