cuBB failed to start

Here is the failure message:

aerial@mit-b32-gnb3:~/openairinterface5g/ci-scripts/yaml_files/sa_gh_gnb$ docker compose -f docker-compose-gnb.yaml up
WARN[0000] Found orphan containers ([oai-upf oai-smf oai-amf oai-ausf oai-udm oai-udr oai-ext-dn oai-nrf mysql asterisk-ims]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
[+] Running 2/0
 ✔ Container nv-cubb           Created                                                                            0.0s
 ✔ Container c_oai-gnb-aerial  Created                                                                            0.0s
Attaching to c_oai-gnb-aerial, nv-cubb
nv-cubb           |
nv-cubb           | ==========
nv-cubb           | == CUDA ==
nv-cubb           | ==========
nv-cubb           |
nv-cubb           | CUDA Version 12.6.2
nv-cubb           |
nv-cubb           | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
nv-cubb           |
nv-cubb           | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
nv-cubb           | By pulling and using the container, you accept the terms and conditions of this license:
nv-cubb           | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
nv-cubb           |
nv-cubb           | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
nv-cubb           |
nv-cubb           | Cannot find MPS control daemon process
nv-cubb           | Supermicro-G1SMH-G
nv-cubb           | Started cuphycontroller on CPU core 2
nv-cubb           | AERIAL_LOG_PATH set to /var/log/aerial
nv-cubb           | Log file set to /var/log/aerial/phy.log
nv-cubb           | Aerial metrics backend address: 127.0.0.1:8081
nv-cubb           | 21:05:20.624345 WRN phy_init 0 [CTL.SCF] Config file: /opt/nvidia/cuBB/cuPHY-CP/cuphycontroller/config/cuphycontroller_P5G_FXN_GH.yaml
nv-cubb           | 21:05:20.624730 WRN phy_init 0 [CTL.SCF] low_priority_core=10
nv-cubb           | 21:05:20.624744 WRN phy_init 0 [APP.CONFIG] Current TAI offset: 0s
nv-cubb           | 21:05:20.625037 WRN phy_init 0 [NVLOG.CPP] Using /opt/nvidia/cuBB/cuPHY/nvlog/config/nvlog_config.yaml for nvlog configuration
nv-cubb           | 21:05:20.625051 WRN phy_init 0 [NVLOG.CPP] Output log file path /var/log/aerial/phy.log
nv-cubb           | YAML invalid key: enable_l1_param_sanity_check Using default value of 0 to YAML_PARAM_ENABLE_L1_PARAM_SANITY_CHECK
nv-cubb           | YAML invalid key: pmu_metrics Using default value of 0 to YAML_PARAM_PMU_METRICS
nv-cubb           | YAML invalid key: ul_order_max_rx_pkts Using default value of 512 to UL_ORDER_MAX_RX_PKTS
nv-cubb           | YAML invalid key: ul_order_rx_pkts_timeout_ns Using default value of 100us to YAML_PARAM_UL_ORDER_RX_PKTS_TIMEOUT_NS
nv-cubb           | 21:05:20.649981 FATAL exit: Thread [phy_init] on core 10 file /opt/nvidia/cuBB/cuPHY/src/cuphy/cuphy_pti.cpp line 46: additional info: CUDA Runtime Error: {}:{}:{}
nv-cubb           | 21:05:20.636389 WRN phy_init 0 [CTL.YAML] cuphycontroller config. yaml does not have gpu_init_comms_via_cpu key; defaulting to 0.
nv-cubb           | 21:05:20.636390 WRN phy_init 0 [CTL.YAML] cuphycontroller config. yaml does not have cpu_init_comms key; defaulting to 0.
nv-cubb           | 21:05:20.636496 WRN phy_init 0 [CTL.YAML] cuphycontroller config. yaml does not have pusch_workCancelMode key (experimental feature); defaulting to 0.
nv-cubb           | 21:05:20.636549 WRN phy_init 0 [CTL.YAML] cell_id 1 nic_index :0
nv-cubb           | 21:05:20.636645 WRN phy_init 0 [CTL.YAML] Num Slots: 8
nv-cubb           | 21:05:20.636646 WRN phy_init 0 [CTL.YAML] Enable UL cuPHY Graphs: 1
nv-cubb           | 21:05:20.636646 WRN phy_init 0 [CTL.YAML] Enable DL cuPHY Graphs: 1
nv-cubb           | 21:05:20.636646 WRN phy_init 0 [CTL.YAML] Accurate TX scheduling clock resolution (ns): 500
nv-cubb           | 21:05:20.636647 WRN phy_init 0 [CTL.YAML] DPDK core: 10
nv-cubb           | 21:05:20.636647 WRN phy_init 0 [CTL.YAML] Prometheus core: -1
nv-cubb           | 21:05:20.636647 WRN phy_init 0 [CTL.YAML] UL cores:
nv-cubb           | 21:05:20.636647 WRN phy_init 0 [CTL.YAML]   - 4
nv-cubb           | 21:05:20.636647 WRN phy_init 0 [CTL.YAML]   - 5
nv-cubb           | 21:05:20.636647 WRN phy_init 0 [CTL.YAML] DL cores:
nv-cubb           | 21:05:20.636647 WRN phy_init 0 [CTL.YAML]   - 6
nv-cubb           | 21:05:20.636647 WRN phy_init 0 [CTL.YAML]   - 7
nv-cubb           | 21:05:20.636647 WRN phy_init 0 [CTL.YAML]   - 8
nv-cubb           | 21:05:20.636648 WRN phy_init 0 [CTL.YAML] Debug worker: -1
nv-cubb           | 21:05:20.636648 WRN phy_init 0 [CTL.YAML] Data Lake core: -1
nv-cubb           | 21:05:20.636648 WRN phy_init 0 [CTL.YAML] SRS starting Section ID: 3072
nv-cubb           | 21:05:20.636648 WRN phy_init 0 [CTL.YAML] PRACH starting Section ID: 2048
nv-cubb           | 21:05:20.636648 WRN phy_init 0 [CTL.YAML] USE GREEN CONTEXTS: 0
nv-cubb           | 21:05:20.636648 WRN phy_init 0 [CTL.YAML] MPS SM PUSCH: 82
nv-cubb           | 21:05:20.636648 WRN phy_init 0 [CTL.YAML] MPS SM PUCCH: 20
nv-cubb           | 21:05:20.636648 WRN phy_init 0 [CTL.YAML] MPS SM PRACH: 2
nv-cubb           | 21:05:20.636648 WRN phy_init 0 [CTL.YAML] MPS SM UL ORDER: 20
nv-cubb           | 21:05:20.636648 WRN phy_init 0 [CTL.YAML] MPS SM PDSCH: 102
nv-cubb           | 21:05:20.636648 WRN phy_init 0 [CTL.YAML] MPS SM PDCCH: 10
nv-cubb           | 21:05:20.636649 WRN phy_init 0 [CTL.YAML] MPS SM PBCH: 2
nv-cubb           | 21:05:20.636649 WRN phy_init 0 [CTL.YAML] MPS SM GPU_COMMS: 16
nv-cubb           | 21:05:20.636649 WRN phy_init 0 [CTL.YAML] PDSCH fallback: 0
nv-cubb           | 21:05:20.636649 WRN phy_init 0 [CTL.YAML] Massive MIMO enable: 0
nv-cubb           | 21:05:20.636649 WRN phy_init 0 [CTL.YAML] Enable SRS : 1
nv-cubb           | 21:05:20.636649 WRN phy_init 0 [CTL.YAML] ul_order_timeout_gpu_log_enable: 0
nv-cubb           | 21:05:20.636649 WRN phy_init 0 [CTL.YAML] ue_mode: 0
nv-cubb           | 21:05:20.636649 WRN phy_init 0 [CTL.YAML] Aggr Obj Non-availability threshold: 5
nv-cubb           | 21:05:20.636650 WRN phy_init 0 [CTL.YAML] sendCPlane_timing_error_th_ns: 0
nv-cubb           | 21:05:20.636650 WRN phy_init 0 [CTL.YAML] pusch_aggr_per_ctx: 3
nv-cubb           | 21:05:20.636650 WRN phy_init 0 [CTL.YAML] prach_aggr_per_ctx: 2
nv-cubb           | 21:05:20.636650 WRN phy_init 0 [CTL.YAML] pucch_aggr_per_ctx: 4
nv-cubb           | 21:05:20.636650 WRN phy_init 0 [CTL.YAML] srs_aggr_per_ctx: 3
nv-cubb           | 21:05:20.636650 WRN phy_init 0 [CTL.YAML] max_harq_pools: 384
nv-cubb           | 21:05:20.636650 WRN phy_init 0 [CTL.YAML] ul_input_buffer_per_cell: 10
nv-cubb           | 21:05:20.636650 WRN phy_init 0 [CTL.YAML] ul_input_buffer_per_cell_srs: 6
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] max_ru_unhealthy_ul_slots: 0
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] srs_chest_algo_type: 0
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] ul_order_timeout_gpu_log_enable: 0
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] pusch_workCancelMode: 0
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] GPU-initiated comms DL: 1
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] GPU-initiated comms (via CPU): 0
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] CPU-initiated comms : 0
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] Cell group: 1
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] Cell group num: 1
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] puxchPolarDcdrListSz: 8
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] split_ul_cuda_streams: 0
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] serialize_pucch_pusch: 0
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] Number of Cell Configs: 1
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] L2Adapter config file: /opt/nvidia/cuBB/cuPHY-CP/cuphycontroller/config/l2_adapter_config_P5G_GH.yaml
nv-cubb           | 21:05:20.636651 WRN phy_init 0 [CTL.YAML] Cell name: O-RU 0
nv-cubb           | 21:05:20.636652 WRN phy_init 0 [CTL.YAML]   MU: 1
nv-cubb           | 21:05:20.636652 WRN phy_init 0 [CTL.YAML]   ID: 1
nv-cubb           | 21:05:20.636652 WRN phy_init 0 [CTL.YAML] Number of MPlane Configs: 1
nv-cubb           | 21:05:20.636652 WRN phy_init 0 [CTL.YAML]   Mplane ID: 1
nv-cubb           | 21:05:20.636652 WRN phy_init 0 [CTL.YAML]   VLAN ID: 2
nv-cubb           | 21:05:20.636652 WRN phy_init 0 [CTL.YAML]   Source Eth Address: 00:00:00:00:00:00
nv-cubb           | 21:05:20.636652 WRN phy_init 0 [CTL.YAML]   Destination Eth Address: 6c:ad:ad:00:0c:40
nv-cubb           | 21:05:20.636652 WRN phy_init 0 [CTL.YAML]   NIC port: 0000:01:00.0
nv-cubb           | 21:05:20.636653 WRN phy_init 0 [CTL.YAML]   RU Type: 1
nv-cubb           | 21:05:20.636653 WRN phy_init 0 [CTL.YAML]   U-plane TXQs: 1
nv-cubb           | 21:05:20.636653 WRN phy_init 0 [CTL.YAML]   DL compression method: 1
nv-cubb           | 21:05:20.636653 WRN phy_init 0 [CTL.YAML]   DL iq bit width: 9
nv-cubb           | 21:05:20.636653 WRN phy_init 0 [CTL.YAML]   UL compression method: 1
nv-cubb           | 21:05:20.636653 WRN phy_init 0 [CTL.YAML]   UL iq bit width: 9
nv-cubb           | 21:05:20.636653 WRN phy_init 0 [CTL.YAML]
nv-cubb           | 21:05:20.636653 WRN phy_init 0 [CTL.YAML]   Flow list SSB/PBCH:
nv-cubb           | 21:05:20.636653 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 21:05:20.636654 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 21:05:20.636654 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 21:05:20.636654 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 21:05:20.636654 WRN phy_init 0 [CTL.YAML]   Flow list PDCCH:
nv-cubb           | 21:05:20.636654 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 21:05:20.636654 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 21:05:20.636654 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 21:05:20.636654 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 21:05:20.636654 WRN phy_init 0 [CTL.YAML]   Flow list PDSCH:
nv-cubb           | 21:05:20.636654 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]   Flow list CSIRS:
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]   Flow list PUSCH:
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]   Flow list PUCCH:
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]   Flow list SRS:
nv-cubb           | 21:05:20.636655 WRN phy_init 0 [CTL.YAML]           8
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]           9
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]           10
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]           11
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]   Flow list PRACH:
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]           4
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]           5
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]           6
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]           7
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]   PUSCH TV: /opt/nvidia/cuBB/testVectors/cuPhyChEstCoeffs.h5
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]   SRS TV: /opt/nvidia/cuBB/testVectors/cuPhyChEstCoeffs.h5
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]   Section_3 time offset: 58369
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]   nMaxRxAnt: 4
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]   PUSCH PRBs Stride: 273
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]   PRACH PRBs Stride: 12
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]   SRS PRBs Stride: 273
nv-cubb           | 21:05:20.636656 WRN phy_init 0 [CTL.YAML]   PUSCH nMaxPrb: 273
nv-cubb           | 21:05:20.636657 WRN phy_init 0 [CTL.YAML]   PUSCH nMaxRx: 4
nv-cubb           | 21:05:20.636657 WRN phy_init 0 [CTL.YAML]   UL Gain Calibration: 78.68
nv-cubb           | 21:05:20.636657 WRN phy_init 0 [CTL.YAML]   Lower guard bw: 845
nv-cubb           | 21:05:20.649966 ERR phy_init 0 [AERIAL_INTERNAL_EVENT] [CUPHY.PTI] CUDA Runtime Error: /opt/nvidia/cuBB/cuPHY/src/cuphy/cuphy_pti.cpp:46:MPS client failed to connect to the MPS control daemon or the MPS server
nv-cubb           | 21:05:20.649993 ERR phy_init 0 [AERIAL_SYSTEM_API_EVENT] [NVLOG.EXIT_HANDLER] FATAL exit: Thread [phy_init] on core 10 file /opt/nvidia/cuBB/cuPHY/src/cuphy/cuphy_pti.cpp line 46: additional info: CUDA Runtime Error: {}:{}:{}
nv-cubb           | Stack trace (most recent call last):
nv-cubb           | #7    Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in
nv-cubb           | #6    Object "/opt/nvidia/cuBB/build/cuPHY-CP/cuphycontroller/examples/cuphycontroller_scf", at 0x41276f, in _start
nv-cubb           | #5    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xe972995674cb, in __libc_start_main
nv-cubb           | #4    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xe972995673fb, in
nv-cubb           | #3    Object "/opt/nvidia/cuBB/build/cuPHY-CP/cuphycontroller/examples/cuphycontroller_scf", at 0x40f873, in main
nv-cubb           | #2    Object "/opt/nvidia/cuBB/build/cuPHY/src/cuphy/libcuphy.so", at 0xe972aa9812ab, in cuphy_pti_init
nv-cubb           | #1    Object "/opt/nvidia/cuBB/build/cuPHY/nvlog/libnvlog.so", at 0xe972999ccabb, in exit_handler::test_trigger_exit(char const*, int, char const*)
nv-cubb           | #0    Source "/opt/nvidia/cuBB/cuPHY-CP/cuphydriver/src/common/cuphydriver_api.cpp", line 2773, in l1_exit_handler
nv-cubb           |        2770:     //PhyDriver initialization failure
nv-cubb           |        2771:     if(l1_getPhydriverHandle() == nullptr)
nv-cubb           |        2772:     {
nv-cubb           |       >2773:         AERIAL_PRINT_BACKTRACE(32ULL);
nv-cubb           |        2774:         exit(EXIT_FAILURE); //Exit immediately
nv-cubb           |        2775:     }
nv-cubb           | 21:05:20.750061 WRN phy_init 0 [DRV.API] Trigging L1 exit handler
nv-cubb           | [C]: Usage: ./build/cuPHY-CP/gt_common_libs/nvIPC/tests/pcap/pcap_collect <name> <destination path>
nv-cubb           |
nv-cubb           | [C]: Current run: ./build/cuPHY-CP/gt_common_libs/nvIPC/tests/pcap/pcap_collect name=nvipc dest_path=/var/log/aerial
nv-cubb           |
nv-cubb           | [I]: shmlogger_collect: save /var/log/aerial/nvipc_pcap and /dev/shm/nvipc_pcap logs to /var/log/aerial/nvipc_pcap
nv-cubb           | [E][AERIAL_SYSTEM_API_EVENT]: ipc_shm_open: shm_open nvipc_pcap failed error -1
nv-cubb           | [E][AERIAL_NVIPC_API_EVENT]: nv_ipc_shm_open: primary=0 name=nvipc_pcap size=8388680 Failed
nv-cubb           | [E][AERIAL_SYSTEM_API_EVENT]: ipc_shm_close: close shm_fd failed
nv-cubb           | [E][AERIAL_NVIPC_API_EVENT]: shmlogger_open: nv_ipc_shm_open failed
nv-cubb           | [I]: shmlogger_collect: no /dev/shm/nvipc_pcap, logger may have been closed normally
Gracefully stopping... (press Ctrl+C again to force)
dependency failed to start: container nv-cubb exited (0)

Please provide any feedback.

Hi @subhams ,

Please initiate MPS service before starting cuphycontroller. You can find the instructions here .

This should only be run on the cuphycontroller terminal and not for test_mac.

Thanks.

The error repeats after following these steps:

# Export variables
export CUDA_DEVICE_MAX_CONNECTIONS=8
export CUDA_MPS_PIPE_DIRECTORY=/var
export CUDA_MPS_LOG_DIRECTORY=/var

# Stop existing MPS
sudo -E echo quit | sudo -E nvidia-cuda-mps-control

# Start MPS
sudo -E nvidia-cuda-mps-control -d
sudo -E echo start_server -uid 0 | sudo -E nvidia-cuda-mps-control

Here is the error log:

aerial@mit-b32-gnb3:~/openairinterface5g/ci-scripts/yaml_files/sa_gh_gnb$ docker compose -f docker-compose-gnb.yaml up
WARN[0000] Found orphan containers ([oai-upf oai-smf oai-amf oai-ausf oai-udm oai-udr oai-ext-dn oai-nrf mysql asterisk-ims]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
[+] Running 1/0
 ✔ Container nv-cubb  Created                                                                                                                           0.0s
Attaching to c_oai-gnb-aerial, nv-cubb
nv-cubb           |
nv-cubb           | ==========
nv-cubb           | == CUDA ==
nv-cubb           | ==========
nv-cubb           |
nv-cubb           | CUDA Version 12.6.2
nv-cubb           |
nv-cubb           | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
nv-cubb           |
nv-cubb           | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
nv-cubb           | By pulling and using the container, you accept the terms and conditions of this license:
nv-cubb           | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
nv-cubb           |
nv-cubb           | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
nv-cubb           |
nv-cubb           | Cannot find MPS control daemon process
nv-cubb           | Supermicro-G1SMH-G
nv-cubb           | Started cuphycontroller on CPU core 69
nv-cubb           | AERIAL_LOG_PATH set to /var/log/aerial
nv-cubb           | Log file set to /var/log/aerial/phy.log
nv-cubb           | Aerial metrics backend address: 127.0.0.1:8081
nv-cubb           | 23:12:56.432324 WRN phy_init 0 [CTL.SCF] Config file: /opt/nvidia/cuBB/cuPHY-CP/cuphycontroller/config/cuphycontroller_P5G_FXN_GH.yaml
nv-cubb           | 23:12:56.432725 WRN phy_init 0 [CTL.SCF] low_priority_core=10
nv-cubb           | 23:12:56.432739 WRN phy_init 0 [APP.CONFIG] Current TAI offset: 0s
nv-cubb           | 23:12:56.432954 WRN phy_init 0 [NVLOG.CPP] Using /opt/nvidia/cuBB/cuPHY/nvlog/config/nvlog_config.yaml for nvlog configuration
nv-cubb           | 23:12:56.432967 WRN phy_init 0 [NVLOG.CPP] Output log file path /var/log/aerial/phy.log
nv-cubb           | YAML invalid key: enable_l1_param_sanity_check Using default value of 0 to YAML_PARAM_ENABLE_L1_PARAM_SANITY_CHECK
nv-cubb           | YAML invalid key: pmu_metrics Using default value of 0 to YAML_PARAM_PMU_METRICS
nv-cubb           | YAML invalid key: ul_order_max_rx_pkts Using default value of 512 to UL_ORDER_MAX_RX_PKTS
nv-cubb           | YAML invalid key: ul_order_rx_pkts_timeout_ns Using default value of 100us to YAML_PARAM_UL_ORDER_RX_PKTS_TIMEOUT_NS
nv-cubb           | 23:12:56.457426 FATAL exit: Thread [phy_init] on core 10 file /opt/nvidia/cuBB/cuPHY/src/cuphy/cuphy_pti.cpp line 46: additional info: CUDA Runtime Error: {}:{}:{}
nv-cubb           | 23:12:56.444264 WRN phy_init 0 [CTL.YAML] cuphycontroller config. yaml does not have gpu_init_comms_via_cpu key; defaulting to 0.
nv-cubb           | 23:12:56.444265 WRN phy_init 0 [CTL.YAML] cuphycontroller config. yaml does not have cpu_init_comms key; defaulting to 0.
nv-cubb           | 23:12:56.444368 WRN phy_init 0 [CTL.YAML] cuphycontroller config. yaml does not have pusch_workCancelMode key (experimental feature); defaulting to 0.
nv-cubb           | 23:12:56.444415 WRN phy_init 0 [CTL.YAML] cell_id 1 nic_index :0
nv-cubb           | 23:12:56.444507 WRN phy_init 0 [CTL.YAML] Num Slots: 8
nv-cubb           | 23:12:56.444507 WRN phy_init 0 [CTL.YAML] Enable UL cuPHY Graphs: 1
nv-cubb           | 23:12:56.444507 WRN phy_init 0 [CTL.YAML] Enable DL cuPHY Graphs: 1
nv-cubb           | 23:12:56.444507 WRN phy_init 0 [CTL.YAML] Accurate TX scheduling clock resolution (ns): 500
nv-cubb           | 23:12:56.444508 WRN phy_init 0 [CTL.YAML] DPDK core: 10
nv-cubb           | 23:12:56.444508 WRN phy_init 0 [CTL.YAML] Prometheus core: -1
nv-cubb           | 23:12:56.444508 WRN phy_init 0 [CTL.YAML] UL cores:
nv-cubb           | 23:12:56.444508 WRN phy_init 0 [CTL.YAML]   - 4
nv-cubb           | 23:12:56.444508 WRN phy_init 0 [CTL.YAML]   - 5
nv-cubb           | 23:12:56.444508 WRN phy_init 0 [CTL.YAML] DL cores:
nv-cubb           | 23:12:56.444509 WRN phy_init 0 [CTL.YAML]   - 6
nv-cubb           | 23:12:56.444509 WRN phy_init 0 [CTL.YAML]   - 7
nv-cubb           | 23:12:56.444509 WRN phy_init 0 [CTL.YAML]   - 8
nv-cubb           | 23:12:56.444509 WRN phy_init 0 [CTL.YAML] Debug worker: -1
nv-cubb           | 23:12:56.444509 WRN phy_init 0 [CTL.YAML] Data Lake core: -1
nv-cubb           | 23:12:56.444509 WRN phy_init 0 [CTL.YAML] SRS starting Section ID: 3072
nv-cubb           | 23:12:56.444510 WRN phy_init 0 [CTL.YAML] PRACH starting Section ID: 2048
nv-cubb           | 23:12:56.444510 WRN phy_init 0 [CTL.YAML] USE GREEN CONTEXTS: 0
nv-cubb           | 23:12:56.444510 WRN phy_init 0 [CTL.YAML] MPS SM PUSCH: 82
nv-cubb           | 23:12:56.444510 WRN phy_init 0 [CTL.YAML] MPS SM PUCCH: 20
nv-cubb           | 23:12:56.444510 WRN phy_init 0 [CTL.YAML] MPS SM PRACH: 2
nv-cubb           | 23:12:56.444510 WRN phy_init 0 [CTL.YAML] MPS SM UL ORDER: 20
nv-cubb           | 23:12:56.444510 WRN phy_init 0 [CTL.YAML] MPS SM PDSCH: 102
nv-cubb           | 23:12:56.444510 WRN phy_init 0 [CTL.YAML] MPS SM PDCCH: 10
nv-cubb           | 23:12:56.444510 WRN phy_init 0 [CTL.YAML] MPS SM PBCH: 2
nv-cubb           | 23:12:56.444510 WRN phy_init 0 [CTL.YAML] MPS SM GPU_COMMS: 16
nv-cubb           | 23:12:56.444510 WRN phy_init 0 [CTL.YAML] PDSCH fallback: 0
nv-cubb           | 23:12:56.444511 WRN phy_init 0 [CTL.YAML] Massive MIMO enable: 0
nv-cubb           | 23:12:56.444511 WRN phy_init 0 [CTL.YAML] Enable SRS : 1
nv-cubb           | 23:12:56.444511 WRN phy_init 0 [CTL.YAML] ul_order_timeout_gpu_log_enable: 0
nv-cubb           | 23:12:56.444511 WRN phy_init 0 [CTL.YAML] ue_mode: 0
nv-cubb           | 23:12:56.444511 WRN phy_init 0 [CTL.YAML] Aggr Obj Non-availability threshold: 5
nv-cubb           | 23:12:56.444512 WRN phy_init 0 [CTL.YAML] sendCPlane_timing_error_th_ns: 0
nv-cubb           | 23:12:56.444512 WRN phy_init 0 [CTL.YAML] pusch_aggr_per_ctx: 3
nv-cubb           | 23:12:56.444512 WRN phy_init 0 [CTL.YAML] prach_aggr_per_ctx: 2
nv-cubb           | 23:12:56.444512 WRN phy_init 0 [CTL.YAML] pucch_aggr_per_ctx: 4
nv-cubb           | 23:12:56.444512 WRN phy_init 0 [CTL.YAML] srs_aggr_per_ctx: 3
nv-cubb           | 23:12:56.444512 WRN phy_init 0 [CTL.YAML] max_harq_pools: 384
nv-cubb           | 23:12:56.444512 WRN phy_init 0 [CTL.YAML] ul_input_buffer_per_cell: 10
nv-cubb           | 23:12:56.444512 WRN phy_init 0 [CTL.YAML] ul_input_buffer_per_cell_srs: 6
nv-cubb           | 23:12:56.444513 WRN phy_init 0 [CTL.YAML] max_ru_unhealthy_ul_slots: 0
nv-cubb           | 23:12:56.444513 WRN phy_init 0 [CTL.YAML] srs_chest_algo_type: 0
nv-cubb           | 23:12:56.444513 WRN phy_init 0 [CTL.YAML] ul_order_timeout_gpu_log_enable: 0
nv-cubb           | 23:12:56.444513 WRN phy_init 0 [CTL.YAML] pusch_workCancelMode: 0
nv-cubb           | 23:12:56.444513 WRN phy_init 0 [CTL.YAML] GPU-initiated comms DL: 1
nv-cubb           | 23:12:56.444513 WRN phy_init 0 [CTL.YAML] GPU-initiated comms (via CPU): 0
nv-cubb           | 23:12:56.444513 WRN phy_init 0 [CTL.YAML] CPU-initiated comms : 0
nv-cubb           | 23:12:56.444513 WRN phy_init 0 [CTL.YAML] Cell group: 1
nv-cubb           | 23:12:56.444513 WRN phy_init 0 [CTL.YAML] Cell group num: 1
nv-cubb           | 23:12:56.444513 WRN phy_init 0 [CTL.YAML] puxchPolarDcdrListSz: 8
nv-cubb           | 23:12:56.444514 WRN phy_init 0 [CTL.YAML] split_ul_cuda_streams: 0
nv-cubb           | 23:12:56.444514 WRN phy_init 0 [CTL.YAML] serialize_pucch_pusch: 0
nv-cubb           | 23:12:56.444514 WRN phy_init 0 [CTL.YAML] Number of Cell Configs: 1
nv-cubb           | 23:12:56.444514 WRN phy_init 0 [CTL.YAML] L2Adapter config file: /opt/nvidia/cuBB/cuPHY-CP/cuphycontroller/config/l2_adapter_config_P5G_GH.yaml
nv-cubb           | 23:12:56.444514 WRN phy_init 0 [CTL.YAML] Cell name: O-RU 0
nv-cubb           | 23:12:56.444514 WRN phy_init 0 [CTL.YAML]   MU: 1
nv-cubb           | 23:12:56.444514 WRN phy_init 0 [CTL.YAML]   ID: 1
nv-cubb           | 23:12:56.444514 WRN phy_init 0 [CTL.YAML] Number of MPlane Configs: 1
nv-cubb           | 23:12:56.444515 WRN phy_init 0 [CTL.YAML]   Mplane ID: 1
nv-cubb           | 23:12:56.444515 WRN phy_init 0 [CTL.YAML]   VLAN ID: 2
nv-cubb           | 23:12:56.444515 WRN phy_init 0 [CTL.YAML]   Source Eth Address: 00:00:00:00:00:00
nv-cubb           | 23:12:56.444515 WRN phy_init 0 [CTL.YAML]   Destination Eth Address: 6c:ad:ad:00:0c:40
nv-cubb           | 23:12:56.444515 WRN phy_init 0 [CTL.YAML]   NIC port: 0000:01:00.0
nv-cubb           | 23:12:56.444515 WRN phy_init 0 [CTL.YAML]   RU Type: 1
nv-cubb           | 23:12:56.444516 WRN phy_init 0 [CTL.YAML]   U-plane TXQs: 1
nv-cubb           | 23:12:56.444516 WRN phy_init 0 [CTL.YAML]   DL compression method: 1
nv-cubb           | 23:12:56.444516 WRN phy_init 0 [CTL.YAML]   DL iq bit width: 9
nv-cubb           | 23:12:56.444516 WRN phy_init 0 [CTL.YAML]   UL compression method: 1
nv-cubb           | 23:12:56.444516 WRN phy_init 0 [CTL.YAML]   UL iq bit width: 9
nv-cubb           | 23:12:56.444516 WRN phy_init 0 [CTL.YAML]
nv-cubb           | 23:12:56.444516 WRN phy_init 0 [CTL.YAML]   Flow list SSB/PBCH:
nv-cubb           | 23:12:56.444517 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 23:12:56.444517 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 23:12:56.444517 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 23:12:56.444517 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 23:12:56.444517 WRN phy_init 0 [CTL.YAML]   Flow list PDCCH:
nv-cubb           | 23:12:56.444517 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 23:12:56.444517 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 23:12:56.444517 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 23:12:56.444517 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 23:12:56.444517 WRN phy_init 0 [CTL.YAML]   Flow list PDSCH:
nv-cubb           | 23:12:56.444518 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 23:12:56.444518 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 23:12:56.444518 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 23:12:56.444518 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 23:12:56.444518 WRN phy_init 0 [CTL.YAML]   Flow list CSIRS:
nv-cubb           | 23:12:56.444518 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 23:12:56.444518 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 23:12:56.444518 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 23:12:56.444518 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 23:12:56.444518 WRN phy_init 0 [CTL.YAML]   Flow list PUSCH:
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]   Flow list PUCCH:
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]           0
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]           1
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]           2
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]           3
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]   Flow list SRS:
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]           8
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]           9
nv-cubb           | 23:12:56.444519 WRN phy_init 0 [CTL.YAML]           10
nv-cubb           | 23:12:56.444520 WRN phy_init 0 [CTL.YAML]           11
nv-cubb           | 23:12:56.444520 WRN phy_init 0 [CTL.YAML]   Flow list PRACH:
nv-cubb           | 23:12:56.444520 WRN phy_init 0 [CTL.YAML]           4
nv-cubb           | 23:12:56.444520 WRN phy_init 0 [CTL.YAML]           5
nv-cubb           | 23:12:56.444520 WRN phy_init 0 [CTL.YAML]           6
nv-cubb           | 23:12:56.444520 WRN phy_init 0 [CTL.YAML]           7
nv-cubb           | 23:12:56.444520 WRN phy_init 0 [CTL.YAML]   PUSCH TV: /opt/nvidia/cuBB/testVectors/cuPhyChEstCoeffs.h5
nv-cubb           | 23:12:56.444520 WRN phy_init 0 [CTL.YAML]   SRS TV: /opt/nvidia/cuBB/testVectors/cuPhyChEstCoeffs.h5
nv-cubb           | 23:12:56.444520 WRN phy_init 0 [CTL.YAML]   Section_3 time offset: 58369
nv-cubb           | 23:12:56.444520 WRN phy_init 0 [CTL.YAML]   nMaxRxAnt: 4
nv-cubb           | 23:12:56.444521 WRN phy_init 0 [CTL.YAML]   PUSCH PRBs Stride: 273
nv-cubb           | 23:12:56.444521 WRN phy_init 0 [CTL.YAML]   PRACH PRBs Stride: 12
nv-cubb           | 23:12:56.444521 WRN phy_init 0 [CTL.YAML]   SRS PRBs Stride: 273
nv-cubb           | 23:12:56.444521 WRN phy_init 0 [CTL.YAML]   PUSCH nMaxPrb: 273
nv-cubb           | 23:12:56.444521 WRN phy_init 0 [CTL.YAML]   PUSCH nMaxRx: 4
nv-cubb           | 23:12:56.444521 WRN phy_init 0 [CTL.YAML]   UL Gain Calibration: 78.68
nv-cubb           | 23:12:56.444521 WRN phy_init 0 [CTL.YAML]   Lower guard bw: 845
nv-cubb           | 23:12:56.457410 ERR phy_init 0 [AERIAL_INTERNAL_EVENT] [CUPHY.PTI] CUDA Runtime Error: /opt/nvidia/cuBB/cuPHY/src/cuphy/cuphy_pti.cpp:46:MPS client failed to connect to the MPS control daemon or the MPS server
nv-cubb           | 23:12:56.457437 ERR phy_init 0 [AERIAL_SYSTEM_API_EVENT] [NVLOG.EXIT_HANDLER] FATAL exit: Thread [phy_init] on core 10 file /opt/nvidia/cuBB/cuPHY/src/cuphy/cuphy_pti.cpp line 46: additional info: CUDA Runtime Error: {}:{}:{}
nv-cubb           | Stack trace (most recent call last):
nv-cubb           | #7    Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in
nv-cubb           | #6    Object "/opt/nvidia/cuBB/build/cuPHY-CP/cuphycontroller/examples/cuphycontroller_scf", at 0x41276f, in _start
nv-cubb           | #5    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xeab9490474cb, in __libc_start_main
nv-cubb           | #4    Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xeab9490473fb, in
nv-cubb           | #3    Object "/opt/nvidia/cuBB/build/cuPHY-CP/cuphycontroller/examples/cuphycontroller_scf", at 0x40f873, in main
nv-cubb           | #2    Object "/opt/nvidia/cuBB/build/cuPHY/src/cuphy/libcuphy.so", at 0xeab95a4612ab, in cuphy_pti_init
nv-cubb           | #1    Object "/opt/nvidia/cuBB/build/cuPHY/nvlog/libnvlog.so", at 0xeab9494acabb, in exit_handler::test_trigger_exit(char const*, int, char const*)
nv-cubb           | #0    Source "/opt/nvidia/cuBB/cuPHY-CP/cuphydriver/src/common/cuphydriver_api.cpp", line 2773, in l1_exit_handler
nv-cubb           |        2770:     //PhyDriver initialization failure
nv-cubb           |        2771:     if(l1_getPhydriverHandle() == nullptr)
nv-cubb           |        2772:     {
nv-cubb           |       >2773:         AERIAL_PRINT_BACKTRACE(32ULL);
nv-cubb           |        2774:         exit(EXIT_FAILURE); //Exit immediately
nv-cubb           |        2775:     }
nv-cubb           | 23:12:56.557507 WRN phy_init 0 [DRV.API] Trigging L1 exit handler
nv-cubb           | [C]: Usage: ./build/cuPHY-CP/gt_common_libs/nvIPC/tests/pcap/pcap_collect <name> <destination path>
nv-cubb           |
nv-cubb           | [C]: Current run: ./build/cuPHY-CP/gt_common_libs/nvIPC/tests/pcap/pcap_collect name=nvipc dest_path=/var/log/aerial
nv-cubb           |
nv-cubb           | [I]: shmlogger_collect: save /var/log/aerial/nvipc_pcap and /dev/shm/nvipc_pcap logs to /var/log/aerial/nvipc_pcap
nv-cubb           | [E][AERIAL_SYSTEM_API_EVENT]: ipc_shm_open: shm_open nvipc_pcap failed error -1
nv-cubb           | [E][AERIAL_NVIPC_API_EVENT]: nv_ipc_shm_open: primary=0 name=nvipc_pcap size=8388680 Failed
nv-cubb           | [E][AERIAL_SYSTEM_API_EVENT]: ipc_shm_close: close shm_fd failed
nv-cubb           | [E][AERIAL_NVIPC_API_EVENT]: shmlogger_open: nv_ipc_shm_open failed
nv-cubb           | [I]: shmlogger_collect: no /dev/shm/nvipc_pcap, logger may have been closed normally
nv-cubb exited with code 0
Gracefully stopping... (press Ctrl+C again to force)
dependency failed to start: container nv-cubb exited (0)

There is still an issue initiating MPS service.

After starting MPS service, can you check if it is running?

ps -ef | grep nvidia-cuda-mps-control

Can you also check if the daemon is running?

echo get_server_list | nvidia-cuda-mps-control

When I ran the commands above, I got the following output:

aerial@mit-b32-gnb3:~$ ps -ef | grep nvidia-cuda-mps-control
root      170029       1  0 Apr04 ?        00:00:00 nvidia-cuda-mps-control -d
aerial   2334725 2332746  0 14:02 pts/2    00:00:00 grep --color=auto nvidia-cuda-mps-control
aerial@mit-b32-gnb3:~$ echo get_server_list | nvidia-cuda-mps-control
Cannot find MPS control daemon process

@subhams Can you share the script that you use to configure and start the cubb container?

Here is the script:

aerial@mit-b32-gnb3:~/openairinterface5g/ci-scripts/yaml_files/sa_gh_gnb$ cat docker-compose-gnb.yaml
services:
  nv-cubb:
    container_name: nv-cubb
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu
    network_mode: host
    shm_size: 4096m
    privileged: true
    stdin_open: true
    tty: true
    volumes:
      - /lib/modules:/lib/modules
      - /dev/hugepages:/dev/hugepages
      - /usr/src:/usr/src
      - ./aerial_l1_entrypoint.sh:/opt/nvidia/cuBB/aerial_l1_entrypoint.sh
      - /var/log/aerial:/var/log/aerial
      - ../../../cmake_targets/share:/opt/cuBB/share
    userns_mode: host
    ipc: "shareable"
    image: cubb-build:24-3
    environment:
      - cuBB_SDK=/opt/nvidia/cuBB
    command: bash -c "sudo rm -rf /tmp/phy.log && sudo chmod +x /opt/nvidia/cuBB/aerial_l1_entrypoint.sh && /opt/nvidia/cuBB/aerial_l1_entrypoint.sh"
    healthcheck:
      test: ["CMD-SHELL",'grep -q "L1 is ready!" /tmp/phy.log && echo 0 || echo 1']
      interval: 20s
      timeout: 5s
      retries: 5
  c_oai-gnb-aerial:
    image: oai-gnb-aerial:latest
    depends_on:
      nv-cubb:
        condition: service_healthy
    privileged: true
    ipc: "container:nv-cubb"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu
    network_mode: host
    shm_size: 4096m
    stdin_open: true
    tty: true
    volumes:
      - /lib/modules:/lib/modules
      - /dev/hugepages:/dev/hugepages
      - /usr/src:/usr/src
      - ~/share:/opt/nvidia/cuBB/share
      - /var/log/aerial:/var/log/aerial
      # Use this for CBRS radios
      #- ../../../targets/PROJECTS/GENERIC-NR-5GC/CONF/gnb-vnf.sa.cbrs.aerial.conf:/opt/oai-gnb/etc/gnb.conf
      - ../../../targets/PROJECTS/GENERIC-NR-5GC/CONF/gnb-vnf.sa.band78.273prb.aerial.conf:/opt/oai-gnb/etc/gnb.conf
    container_name: c_oai-gnb-aerial
    command: bash -c "chrt -r 1 taskset -c 11-16 chrt -f 95 /opt/oai-gnb/bin/nr-softmodem -O /opt/oai-gnb/etc/gnb.conf | tee /var/log/aerial/oai.log"
      #cpuset: 11-18
    healthcheck:
      test: /bin/bash -c "ps aux | grep -v grep | grep -c softmodem"
      interval: 10s
      timeout: 5s
      retries: 5

@subhams, did you make any changes to the aerial_l1_entrypoint.sh script here ?

Which version of ARC are you using?

Thank you.

We had updated the entrypoint script with new interface ID and mac ID of the RU.

We are using 24-3.

Thank you.

@subhams
can you capture the outputs of the following commands:?
nvidia-smi
lsmod | grep -i nvidia
ps -ef | grep -i mps

Here are the outputs:

aerial@mit-b32-gnb3:~$ nvidia-smi
Mon Apr  7 18:23:43 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 480GB             On  |   00000009:01:00.0 Off |                    0 |
| N/A   33C    P0            131W /  900W |     113MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    170458      C   nvidia-cuda-mps-server                        104MiB |
+-----------------------------------------------------------------------------------------+
aerial@mit-b32-gnb3:~$ lsmod | grep -i nvidia
nvidia_uvm           4784128  2
nvidia_drm            262144  0
nvidia_modeset       1835008  1 nvidia_drm
nvidia               9043968  74 nvidia_uvm,gdrdrv,nvidia_modeset
video                 262144  1 nvidia_modeset
ecc                   196608  1 nvidia
drm_kms_helper        327680  4 ast,nvidia_drm
drm                   983040  6 drm_kms_helper,ast,drm_shmem_helper,nvidia,nvidia_drm
aerial@mit-b32-gnb3:~$ ps -ef | grep -i mps
root      170029       1  0 Apr04 ?        00:00:00 nvidia-cuda-mps-control -d
root      170458  170029  0 Apr04 ?        00:00:27 nvidia-cuda-mps-server
aerial   1509018 1509000  0 18:25 pts/2    00:00:00 grep --color=auto -i mps

@subhams
would you please upgrade GPU driver to 560.35.03 and recheck again?

  1. following the steps below to unload the current driver modules

Unload the current driver modules

$ for m in $(lsmod | awk “/[1]*(nvidia|nv_|gdrdrv)/ {print $1}”); do echo Unload $m…; sudo rmmod $m; done

Remove the driver if it was installed by runfile installer before.

$ sudo /usr/bin/nvidia-uninstall
2) purge the
sudo apt-get --purge remove “cublas” “cufft” “curand” “cusolver” “cusparse” “npp” “nvjpeg” “cuda*” “nsight*” “nvidia
sudo apt-get autoremove
3) power-cycle the server
4) following the steps here to install the gpu-driver 560-35.03
https://docs.nvidia.com/aerial/cuda-accelerated-ran/aerial_cubb/cubb_install/installing_tools_gh.html#install-cuda-driver


  1. [1] ↩︎

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.