Hello,
I am trying to run the cuBB E2E test in the 25-1-cuBB environment and have followed the steps to set up the related SDK and firmware on my GH200 server. However, when running cuphycontroller_scf show the log about DOCA error not support as below.
Based on the information shown, the DOCA version should not be the issue.
$ ofed_info -s
OFED-internal-24.04-0.6.6:
And my BF3 informatino is as below:
$ sudo mlxfwmanager -d 0002:01:00.0
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: BlueField3
Part Number: 900-9D3B6-00SV-A_Ax
Description: NVIDIA BlueField-3 B3220 P-Series FHHL DPU; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Disabled
PSID: MT_0000000965
PCI Device Name: 0002:01:00.0
Base MAC: 9c63c0b82600
Versions: Current Available
FW 32.41.1000 N/A
PXE 3.7.0400 N/A
UEFI 14.34.0012 N/A
UEFI Virtio blk 22.4.0013 N/A
UEFI Virtio net 21.4.0013 N/A
Status: No matching image found
The error log of cuphycontroller as below:
aerial@airan:/opt/nvidia/cuBB/build/cuPHY-CP/cuphycontroller/examples$ sudo -E ./cuphycontroller_scf F08_CG1
Started cuphycontroller on CPU core 70
Aerial metrics backend address: 127.0.0.1:8081
07:24:32.803751 CON phy_init 0 [CTL.SCF] Config file: /opt/nvidia/cuBB/cuPHY-CP/cuphycontroller/config/cuphycontroller_F08_CG1.yaml
07:24:32.804426 CON phy_init 0 [CTL.SCF] low_priority_core=23
07:24:32.804440 CON phy_init 0 [APP.CONFIG] Current TAI offset: 0s
07:24:32.804474 CON phy_init 0 [NVLOG.CPP] Using /opt/nvidia/cuBB/cuPHY/nvlog/config/nvlog_config.yaml for nvlog configuration
07:24:32.804709 CON phy_init 0 [NVLOG.CPP] Output log file path /tmp/phy.log
YAML invalid key: enable_l1_param_sanity_check Using default value of 0 to YAML_PARAM_ENABLE_L1_PARAM_SANITY_CHECK
YAML invalid key: bfw_beta_prescaler Using default value of 1 for YAML_PARAM_BFW_BETA_PRESCALER 2048
YAML invalid key: pusch_nMaxLdpcHetConfigs Using default value of 32 to PUSCH-N-MAX-LDPC-HET-CONFIGS
07:24:32.816210 CON phy_init 0 [CTL.YAML] cuphycontroller config. yaml does not have gpu_init_comms_via_cpu key; defaulting to 0.
07:24:32.816211 CON phy_init 0 [CTL.YAML] cuphycontroller config. yaml does not have cpu_init_comms key; defaulting to 0.
07:24:32.816367 CON phy_init 0 [CTL.YAML] cell_id 1 nic_index :0
07:24:32.816470 CON phy_init 0 [CTL.YAML] Num Slots: 8
07:24:32.816470 CON phy_init 0 [CTL.YAML] Enable UL cuPHY Graphs: 1
07:24:32.816470 CON phy_init 0 [CTL.YAML] Enable DL cuPHY Graphs: 1
07:24:32.816471 CON phy_init 0 [CTL.YAML] Accurate TX scheduling clock resolution (ns): 500
07:24:32.816471 CON phy_init 0 [CTL.YAML] DPDK core: 23
07:24:32.816471 CON phy_init 0 [CTL.YAML] Prometheus core: -1
07:24:32.816471 CON phy_init 0 [CTL.YAML] UL cores:
07:24:32.816472 CON phy_init 0 [CTL.YAML] - 5
07:24:32.816472 CON phy_init 0 [CTL.YAML] - 6
07:24:32.816472 CON phy_init 0 [CTL.YAML] DL cores:
07:24:32.816472 CON phy_init 0 [CTL.YAML] - 11
07:24:32.816472 CON phy_init 0 [CTL.YAML] - 12
07:24:32.816472 CON phy_init 0 [CTL.YAML] - 13
07:24:32.816472 CON phy_init 0 [CTL.YAML] Debug worker: -1
07:24:32.816473 CON phy_init 0 [CTL.YAML] Data Lake core: -1
07:24:32.816473 CON phy_init 0 [CTL.YAML] SRS starting Section ID: 3072
07:24:32.816473 CON phy_init 0 [CTL.YAML] PRACH starting Section ID: 2048
07:24:32.816473 CON phy_init 0 [CTL.YAML] USE GREEN CONTEXTS: 0
07:24:32.816473 CON phy_init 0 [CTL.YAML] USE BATCHED MEMCPY: 1
07:24:32.816473 CON phy_init 0 [CTL.YAML] MPS SM PUSCH: 100
07:24:32.816473 CON phy_init 0 [CTL.YAML] MPS SM PUCCH: 2
07:24:32.816473 CON phy_init 0 [CTL.YAML] MPS SM PRACH: 2
07:24:32.816474 CON phy_init 0 [CTL.YAML] MPS SM UL ORDER: 20
07:24:32.816474 CON phy_init 0 [CTL.YAML] MPS SM PDSCH: 102
07:24:32.816474 CON phy_init 0 [CTL.YAML] MPS SM PDCCH: 10
07:24:32.816474 CON phy_init 0 [CTL.YAML] MPS SM PBCH: 2
07:24:32.816474 CON phy_init 0 [CTL.YAML] MPS SM GPU_COMMS: 16
07:24:32.816474 CON phy_init 0 [CTL.YAML] PDSCH fallback: 0
07:24:32.816474 CON phy_init 0 [CTL.YAML] Massive MIMO enable: 0
07:24:32.816474 CON phy_init 0 [CTL.YAML] mMIMO_enable feature 0
07:24:32.816474 CON phy_init 0 [CTL.YAML] Enable SRS : 0
07:24:32.816475 CON phy_init 0 [CTL.YAML] ul_order_timeout_gpu_log_enable: 0
07:24:32.816475 CON phy_init 0 [CTL.YAML] ue_mode: 0
07:24:32.816476 CON phy_init 0 [CTL.YAML] Aggr Obj Non-availability threshold: 5
07:24:32.816476 CON phy_init 0 [CTL.YAML] sendCPlane_timing_error_th_ns: 0
07:24:32.816476 CON phy_init 0 [CTL.YAML] pusch_aggr_per_ctx: 3
07:24:32.816476 CON phy_init 0 [CTL.YAML] prach_aggr_per_ctx: 2
07:24:32.816477 CON phy_init 0 [CTL.YAML] pucch_aggr_per_ctx: 4
07:24:32.816477 CON phy_init 0 [CTL.YAML] srs_aggr_per_ctx: 3
07:24:32.816477 CON phy_init 0 [CTL.YAML] max_harq_pools: 384
07:24:32.816477 CON phy_init 0 [CTL.YAML] ul_input_buffer_per_cell: 10
07:24:32.816477 CON phy_init 0 [CTL.YAML] ul_input_buffer_per_cell_srs: 6
07:24:32.816477 CON phy_init 0 [CTL.YAML] max_ru_unhealthy_ul_slots: 0
07:24:32.816477 CON phy_init 0 [CTL.YAML] srs_chest_algo_type: 0
07:24:32.816477 CON phy_init 0 [CTL.YAML] srs_chest_tol2_normalization_algo_type: 1
07:24:32.816477 CON phy_init 0 [CTL.YAML] srs_chest_tol2_constant_scaler: 32768
07:24:32.816478 CON phy_init 0 [CTL.YAML] bfw_power_normalization_alg_selector: 0
07:24:32.816478 CON phy_init 0 [CTL.YAML] bfw_beta_prescaler: 2048
07:24:32.816478 CON phy_init 0 [CTL.YAML] total_num_srs_chest_buffers: 6144
07:24:32.816478 CON phy_init 0 [CTL.YAML] ul_pcap_capture_enable: 1
07:24:32.816478 CON phy_init 0 [CTL.YAML] ul_pcap_capture_thread_cpu_affinity: 19
07:24:32.816478 CON phy_init 0 [CTL.YAML] ul_pcap_capture_thread_sched_priority: 95
07:24:32.816478 CON phy_init 0 [CTL.YAML] ul_order_timeout_gpu_log_enable: 0
07:24:32.816479 CON phy_init 0 [CTL.YAML] pusch_workCancelMode: 2
07:24:32.816479 CON phy_init 0 [CTL.YAML] GPU-initiated comms DL: 1
07:24:32.816479 CON phy_init 0 [CTL.YAML] GPU-initiated comms (via CPU): 0
07:24:32.816479 CON phy_init 0 [CTL.YAML] CPU-initiated comms : 0
07:24:32.816479 CON phy_init 0 [CTL.YAML] Cell group: 1
07:24:32.816479 CON phy_init 0 [CTL.YAML] Cell group num: 1
07:24:32.816480 CON phy_init 0 [CTL.YAML] puxchPolarDcdrListSz: 8
07:24:32.816480 CON phy_init 0 [CTL.YAML] split_ul_cuda_streams: 0
07:24:32.816480 CON phy_init 0 [CTL.YAML] serialize_pucch_pusch: 0
07:24:32.816480 CON phy_init 0 [CTL.YAML] Number of Cell Configs: 1
07:24:32.816480 CON phy_init 0 [CTL.YAML] L2Adapter config file: /opt/nvidia/cuBB/cuPHY-CP/cuphycontroller/config/l2_adapter_config_F08_CG1.yaml
07:24:32.816480 CON phy_init 0 [CTL.YAML] Cell name: O-RU 0
07:24:32.816481 CON phy_init 0 [CTL.YAML] MU: 1
07:24:32.816481 CON phy_init 0 [CTL.YAML] ID: 1
07:24:32.816481 CON phy_init 0 [CTL.YAML] Number of MPlane Configs: 1
07:24:32.816481 CON phy_init 0 [CTL.YAML] Mplane ID: 1
07:24:32.816481 CON phy_init 0 [CTL.YAML] VLAN ID: 2
07:24:32.816481 CON phy_init 0 [CTL.YAML] Source Eth Address: 00:00:00:00:00:00
07:24:32.816481 CON phy_init 0 [CTL.YAML] Destination Eth Address: 20:04:9b:9e:27:a3
07:24:32.816481 CON phy_init 0 [CTL.YAML] NIC port: 0002:01:00.0
07:24:32.816482 CON phy_init 0 [CTL.YAML] RU Type: 3
07:24:32.816482 CON phy_init 0 [CTL.YAML] U-plane TXQs: 1
07:24:32.816482 CON phy_init 0 [CTL.YAML] DL compression method: 1
07:24:32.816482 CON phy_init 0 [CTL.YAML] DL iq bit width: 9
07:24:32.816482 CON phy_init 0 [CTL.YAML] UL compression method: 1
07:24:32.816482 CON phy_init 0 [CTL.YAML] UL iq bit width: 9
07:24:32.816482 CON phy_init 0 [CTL.YAML]
07:24:32.816482 CON phy_init 0 [CTL.YAML] Flow list SSB/PBCH:
07:24:32.816482 CON phy_init 0 [CTL.YAML] 8
07:24:32.816482 CON phy_init 0 [CTL.YAML] 0
07:24:32.816483 CON phy_init 0 [CTL.YAML] 1
07:24:32.816483 CON phy_init 0 [CTL.YAML] 2
07:24:32.816483 CON phy_init 0 [CTL.YAML] Flow list PDCCH:
07:24:32.816483 CON phy_init 0 [CTL.YAML] 8
07:24:32.816483 CON phy_init 0 [CTL.YAML] 0
07:24:32.816483 CON phy_init 0 [CTL.YAML] 1
07:24:32.816483 CON phy_init 0 [CTL.YAML] 2
07:24:32.816483 CON phy_init 0 [CTL.YAML] Flow list PDSCH:
07:24:32.816483 CON phy_init 0 [CTL.YAML] 8
07:24:32.816483 CON phy_init 0 [CTL.YAML] 0
07:24:32.816483 CON phy_init 0 [CTL.YAML] 1
07:24:32.816483 CON phy_init 0 [CTL.YAML] 2
07:24:32.816483 CON phy_init 0 [CTL.YAML] Flow list CSIRS:
07:24:32.816483 CON phy_init 0 [CTL.YAML] 8
07:24:32.816483 CON phy_init 0 [CTL.YAML] 0
07:24:32.816483 CON phy_init 0 [CTL.YAML] 1
07:24:32.816484 CON phy_init 0 [CTL.YAML] 2
07:24:32.816484 CON phy_init 0 [CTL.YAML] Flow list PUSCH:
07:24:32.816484 CON phy_init 0 [CTL.YAML] 8
07:24:32.816484 CON phy_init 0 [CTL.YAML] 0
07:24:32.816484 CON phy_init 0 [CTL.YAML] 1
07:24:32.816484 CON phy_init 0 [CTL.YAML] 2
07:24:32.816484 CON phy_init 0 [CTL.YAML] Flow list PUCCH:
07:24:32.816484 CON phy_init 0 [CTL.YAML] 8
07:24:32.816484 CON phy_init 0 [CTL.YAML] 0
07:24:32.816484 CON phy_init 0 [CTL.YAML] 1
07:24:32.816484 CON phy_init 0 [CTL.YAML] 2
07:24:32.816484 CON phy_init 0 [CTL.YAML] Flow list SRS:
07:24:32.816484 CON phy_init 0 [CTL.YAML] 8
07:24:32.816485 CON phy_init 0 [CTL.YAML] 0
07:24:32.816485 CON phy_init 0 [CTL.YAML] 1
07:24:32.816485 CON phy_init 0 [CTL.YAML] 2
07:24:32.816485 CON phy_init 0 [CTL.YAML] Flow list PRACH:
07:24:32.816485 CON phy_init 0 [CTL.YAML] 15
07:24:32.816485 CON phy_init 0 [CTL.YAML] 7
07:24:32.816485 CON phy_init 0 [CTL.YAML] 0
07:24:32.816485 CON phy_init 0 [CTL.YAML] 1
07:24:32.816485 CON phy_init 0 [CTL.YAML] PUSCH TV: /opt/nvidia/cuBB/testVectors/cuPhyChEstCoeffs.h5
07:24:32.816485 CON phy_init 0 [CTL.YAML] SRS TV: /opt/nvidia/cuBB/testVectors/cuPhyChEstCoeffs.h5
07:24:32.816485 CON phy_init 0 [CTL.YAML] Section_3 time offset: 58369
07:24:32.816485 CON phy_init 0 [CTL.YAML] nMaxRxAnt: 4
07:24:32.816485 CON phy_init 0 [CTL.YAML] PUSCH PRBs Stride: 273
07:24:32.816486 CON phy_init 0 [CTL.YAML] PRACH PRBs Stride: 12
07:24:32.816486 CON phy_init 0 [CTL.YAML] SRS PRBs Stride: 273
07:24:32.816486 CON phy_init 0 [CTL.YAML] PUSCH nMaxPrb: 273
07:24:32.816486 CON phy_init 0 [CTL.YAML] PUSCH nMaxRx: 4
07:24:32.816486 CON phy_init 0 [CTL.YAML] UL Gain Calibration: 48.68
07:24:32.816486 CON phy_init 0 [CTL.YAML] Lower guard bw: 845
07:24:32.985697 CON phy_init 0 [CTL.SCF] Network interface for PCIe address 0002:01:00.0 : enP2s2f0np0
07:24:32.985749 CON phy_init 0 [APP.UTILS] PHC clock: 1756193070.517354396
07:24:32.985762 CON phy_init 0 [APP.UTILS] CLOCK_TAI: 1756193072.985756401
07:24:32.985762 CON phy_init 0 [APP.UTILS] CLOCK_REALTIME: 1756193072.985756369
07:24:32.985762 CON phy_init 0 [APP.UTILS] TAI/REALTIME offset: 0 seconds
07:24:32.988110 CON phy_init 0 [DRV.CTX] CUDA_DEVICE_MAX_CONNECTIONS 8
07:24:32.988114 CON phy_init 0 [DRV.CTX] use_green_contexts 0
EAL: Detected CPU lcores: 72
EAL: Detected NUMA nodes: 1
EAL: Detected shared linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/cuphycontroller/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: VFIO support initialized
EAL: Probe PCI driver: gpu_cuda (10de:2342) device: 0009:01:00.0 (socket 0)
[07:24:34:282035][385344][DOCA][ERR][doca_eth_txq.c:3528][doca_eth_txq_cap_get_wait_on_time_offload_supported] DEVINFO 0x3bb18d65160: Failed to get wait_on_time_offload_supported: querying capabilities failed. err=DOCA_ERROR_NOT_SUPPORTED
EAL: Probe PCI driver: mlx5_pci (15b3:a2dc) device: 0002:01:00.0 (socket 0)
07:24:34.282045 ERR phy_init 0 [AERIAL_DPDK_API_EVENT] [FH.NIC] doca_eth_txq_get_wait_on_time_offload_supported returned Operation not supported
[07:24:34:679971][385344][DOCA][ERR][doca_eth_txq.c:3528][doca_eth_txq_cap_get_wait_on_time_offload_supported] DEVINFO 0x3bb18d65160: Failed to get wait_on_time_offload_supported: querying capabilities failed. err=DOCA_ERROR_NOT_SUPPORTED
[07:24:34:679984][385344][DOCA][ERR][doca_eth_txq.c:2169][eth_txq_start_gpu_ctx] ETH_TXQ 0x3bb19d89f80: Failed to start txq: failed to get wait on time offload capability. err=DOCA_ERROR_NOT_SUPPORTED
[07:24:34:679988][385344][DOCA][ERR][doca_pe.cpp:1016][start_context] Progress engine 0x3bb18d684c0: Failed to start context=0x3bb19d89f80. err=DOCA_ERROR_NOT_SUPPORTED
07:24:34.702634 FATAL exit: Thread [phy_drv_init] on core 23 file /opt/nvidia/cuBB/cuPHY-CP/cuphycontroller/examples/cuphycontroller_scf.cpp line 342: additional info: NULL
07:24:34.679990 ERR phy_init 0 [AERIAL_INVALID_PARAM_EVENT] [FH.DOCA] Failed doca_ctx_start: Operation not supported
07:24:34.691628 ERR phy_init 0 [AERIAL_ORAN_FH_EVENT] [FH.LIB] Exception! Failed to setup DOCA GPU TxQ #60 on NIC 0002:01:00.0 because doca_create_tx_queue was a failure
07:24:34.691636 ERR phy_init 0 [AERIAL_ORAN_FH_EVENT] [DRV.FH] Failed to add NIC 0002:01:00.0
07:24:34.702596 ERR phy_init 0 [AERIAL_CUPHYDRV_API_EVENT] [DRV.EXCP] /opt/nvidia/cuBB/cuPHY-CP/cuphydriver/src/common/cuphydriver_api.cpp l1_init line 132 exception: NIC registration error
07:24:34.702611 ERR phy_init 0 [AERIAL_CUPHYDRV_API_EVENT] [CTL.DRV] Error l1_init
07:24:34.702613 ERR phy_init 0 [AERIAL_CUPHYDRV_API_EVENT] [CTL.SCF] pc_init_phydriver error -1
07:24:34.702648 FAT phy_init 0 [AERIAL_SYSTEM_API_EVENT] [NVLOG.EXIT_HANDLER] FATAL exit: Thread [phy_drv_init] on core 23 file /opt/nvidia/cuBB/cuPHY-CP/cuphycontroller/examples/cuphycontroller_scf.cpp line 342: additional info: NULL
Stack trace (most recent call last):
#6 Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in
#5 Object "/opt/nvidia/cuBB/build/cuPHY-CP/cuphycontroller/examples/cuphycontroller_scf", at 0x41e1ef, in _start
#4 Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xff8d0a2874cb, in __libc_start_main
#3 Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xff8d0a2873fb, in
#2 Object "/opt/nvidia/cuBB/build/cuPHY-CP/cuphycontroller/examples/cuphycontroller_scf", at 0x411e6b, in main
#1 Object "/opt/nvidia/cuBB/build/cuPHY/nvlog/libnvlog.so", at 0xff8d0a6fcf2b, in exit_handler::test_trigger_exit(char const*, int, char const*)
07:24:34.802719 CON phy_init 0 [DRV.API] Triggering L1 exit handler
#0 Source "/opt/nvidia/cuBB/cuPHY-CP/cuphydriver/src/common/cuphydriver_api.cpp", line 2856, in l1_exit_handler
2853: //PhyDriver initialization failure
2854: if(l1_getPhydriverHandle() == nullptr)
2855: {
>2856: AERIAL_PRINT_BACKTRACE(32ULL);
2857: std::exit(EXIT_FAILURE); //Exit immediately
2858: }
Could you please give me some advice to solve the problem?