Request for Guidance on Configuring MIG for ARC-OTA: Separating RAN and AI Workloads

Hello everyone,

I am currently researching the development and deployment of AI and RAN (Radio Access Network) applications, aiming to explore how to effectively separate AI applications from RAN workloads to optimize performance.

Since AI applications often consume a significant portion of H2D bandwidth, we plan to leverage NVIDIA’s MIG (Multi-Instance GPU) technology to allocate RAN and AI workloads to different GPU instances, enabling efficient resource utilization and isolation.

However, during practical implementation, we encountered some issues:

After configuring according to the official NVIDIA MIG guide, we attempted to launch Docker containers related to ARC-OTA but found that the containers failed to start.

The ARC-OTA configuration file is attached.

The logs do not provide clear error messages, but it appears to be related to MIG instance binding issues.

Here are some questions I would like to ask:

  • Are there best practices for MIG configuration in ARC-OTA scenarios?
  • Are there specific guidelines for allocating RAN and AI workloads to MIG instances?
  • Are there successful cases or debugging tips for similar scenarios that you can share?

System Environment Information:

GPU Configuration:

  • GPU Model: NVIDIA A100X
  • Driver Version: 555.42.02
  • CUDA Version: 12.5
  • MIG Configuration:
    • GPU 0: MIG mode enabled, with two MIG instances
      • MIG 1: 38MiB / 40192MiB
      • MIG 2: 50MiB / 40192MiB
    • GPU 1: MIG mode not enabled

CPU Configuration:

  • CPU Model: Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz
  • Core Count: 48 (24 cores per socket, 2 sockets total)
  • NUMA Nodes: 2 (NUMA Node 0: even cores, NUMA Node 1: odd cores)

Other Configuration Information:

  • Docker Runtime: NVIDIA Container Runtime
  • ARC-OTA Version: 1.5
  • AI Application Framework and Version: vLLM llava-v1.6-mistral-7b-hf

Thank you for your assistance! Any suggestions or resources would greatly benefit my current research and development work.

Thanks again, and I look forward to your response!

sit.zip (11.1 KB)

Additional Information:
arc-gnb@arc-gnb:~/arc-ota/sit$ docker compose up
[+] Running 2/0
✔ Container nv-cubb Created 0.0s
✔ Container oai-gnb-aerial Created 0.0s
Attaching to nv-cubb, oai-gnb-aerial
nv-cubb |
nv-cubb | ==========
nv-cubb | == CUDA ==
nv-cubb | ==========
nv-cubb |
nv-cubb | CUDA Version 12.2.2
nv-cubb |
nv-cubb | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
nv-cubb |
nv-cubb | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
nv-cubb | By pulling and using the container, you accept the terms and conditions of this license:
nv-cubb | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
nv-cubb |
nv-cubb | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
nv-cubb |
nv-cubb | Cannot find MPS control daemon process
nv-cubb | Started cuphycontroller on CPU core 4
nv-cubb | AERIAL_LOG_PATH set to /var/log/aerial
nv-cubb | Log file set to /var/log/aerial/phy.log
nv-cubb | Aerial metrics backend address: 127.0.0.1:8081
nv-cubb | 17:40:52.591366 WRN phy_init 0 [CTL.SCF] Config file: /opt/nvidia/cuBB/cuPHY-CP/cuphycontroller/config/cuphycontroller_F08_R750_1C_L4NIC_OAI.yaml
nv-cubb | 17:40:52.591749 WRN phy_init 0 [CTL.SCF] low_priority_core=14
nv-cubb | 17:40:52.597261 WRN phy_init 0 [NVLOG.CPP] Using /opt/nvidia/cuBB/cuPHY/nvlog/config/nvlog_config.yaml for nvlog configuration
nv-cubb | 17:40:52.609081 WRN phy_init 0 [CTL.YAML] cell_id 0 nic_index :0
nv-cubb | 17:40:52.609172 WRN phy_init 0 [CTL.YAML] pusch_nMaxRx not set in config file, using default value of 0
nv-cubb | 17:40:52.609194 WRN phy_init 0 [CTL.YAML] Num Slots: 8
nv-cubb | 17:40:52.609194 WRN phy_init 0 [CTL.YAML] Enable UL cuPHY Graphs: 1
nv-cubb | 17:40:52.609194 WRN phy_init 0 [CTL.YAML] Enable DL cuPHY Graphs: 1
nv-cubb | 17:40:52.609194 WRN phy_init 0 [CTL.YAML] Accurate TX scheduling clock resolution (ns): 500
nv-cubb | 17:40:52.609194 WRN phy_init 0 [CTL.YAML] DPDK core: 14
nv-cubb | 17:40:52.609194 WRN phy_init 0 [CTL.YAML] Prometheus core: -1
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] UL cores:
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] - 4
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] - 6
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] - 8
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] - 10
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] DL cores:
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] - 16
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] - 18
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] - 20
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] - 22
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] - 24
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] Debug worker: -1
nv-cubb | 17:40:52.609195 WRN phy_init 0 [CTL.YAML] Data Lake core: -1
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] SRS starting Section ID: 3072
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] PRACH starting Section ID: 2048
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] MPS SM PUSCH: 66
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] MPS SM PUCCH: 16
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] MPS SM PRACH: 2
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] MPS SM UL ORDER: 16
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] MPS SM PDSCH: 102
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] MPS SM PDCCH: 10
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] MPS SM PBCH: 2
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] MPS SM GPU_COMMS: 16
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] PDSCH fallback: 0
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] Massive MIMO enable: 0
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] Enable SRS : 0
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] ul_order_timeout_gpu_log_enable: 1
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] ue_mode: 0
nv-cubb | 17:40:52.609196 WRN phy_init 0 [CTL.YAML] Aggr Obj Non-availability threshold: 5
nv-cubb | 17:40:52.609197 WRN phy_init 0 [CTL.YAML] sendCPlane_timing_error_th_ns: 50000
nv-cubb | 17:40:52.609197 WRN phy_init 0 [CTL.YAML] ul_order_timeout_gpu_log_enable: 1
nv-cubb | 17:40:52.609198 WRN phy_init 0 [CTL.YAML] GPU-initiated comms DL: 1
nv-cubb | 17:40:52.609198 WRN phy_init 0 [CTL.YAML] Cell group: 1
nv-cubb | 17:40:52.609198 WRN phy_init 0 [CTL.YAML] Cell group num: 1
nv-cubb | 17:40:52.609198 WRN phy_init 0 [CTL.YAML] puxchPolarDcdrListSz: 1
nv-cubb | 17:40:52.609198 WRN phy_init 0 [CTL.YAML] split_ul_cuda_streams: 1
nv-cubb | 17:40:52.609198 WRN phy_init 0 [CTL.YAML] serialize_pucch_pusch: 0
nv-cubb | 17:40:52.609198 WRN phy_init 0 [CTL.YAML] Number of Cell Configs: 1
nv-cubb | 17:40:52.609198 WRN phy_init 0 [CTL.YAML] L2Adapter config file: /opt/nvidia/cuBB/cuPHY-CP/cuphycontroller/config/l2_adapter_config_F08_R750_OAI.yaml
nv-cubb | 17:40:52.609198 WRN phy_init 0 [CTL.YAML] Cell name: O-RU 0
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] MU: 1
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] ID: 0
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] Number of MPlane Configs: 1
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] Mplane ID: 0
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] VLAN ID: 6
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] Source Eth Address: 00:00:00:00:00:00
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] Destination Eth Address: 00:0b:0c:0c:0d:0a
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] NIC port: 0000:19:00.0
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] RU Type: 1
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] U-plane TXQs: 1
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] DL compression method: 1
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] DL iq bit width: 9
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] UL compression method: 1
nv-cubb | 17:40:52.609199 WRN phy_init 0 [CTL.YAML] UL iq bit width: 9
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML]
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] Flow list SSB/PBCH:
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 0
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 1
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 2
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 3
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] Flow list PDCCH:
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 0
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 1
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 2
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 3
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] Flow list PDSCH:
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 0
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 1
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 2
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] 3
nv-cubb | 17:40:52.609200 WRN phy_init 0 [CTL.YAML] Flow list CSIRS:
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 0
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 1
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 2
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 3
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] Flow list PUSCH:
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 0
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 1
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 2
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 3
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] Flow list PUCCH:
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 0
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 1
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 2
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 3
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] Flow list SRS:
nv-cubb | 17:40:52.609201 WRN phy_init 0 [CTL.YAML] 0
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] 1
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] 2
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] 3
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] Flow list PRACH:
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] 4
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] 5
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] 6
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] 7
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] PUSCH TV: /opt/nvidia/cuBB/testVectors/cuPhyChEstCoeffs.h5
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] SRS TV: /opt/nvidia/cuBB/testVectors/cuPhyChEstCoeffs.h5
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] Section_3 time offset: 58369
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] nMaxRxAnt: 4
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] PUSCH PRBs Stride: 273
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] PRACH PRBs Stride: 12
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] SRS PRBs Stride: 12
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] PUSCH nMaxPrb: 273
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] PUSCH nMaxRx: 0
nv-cubb | 17:40:52.609202 WRN phy_init 0 [CTL.YAML] UL Gain Calibration: 77.68
nv-cubb | 17:40:52.609203 WRN phy_init 0 [CTL.YAML] Lower guard bw: 845
nv-cubb | 17:40:52.609203 WRN phy_init 0 [CTL.SCF] Cuda Set Device: 0
nv-cubb | 17:40:52.924959 WRN phy_init 0 [CTL.SCF] Cuphy PTI Init: 0000:19:00.0
nv-cubb | 17:40:52.994513 WRN phy_init 0 [CTL.SCF] Init PHYDriver: 140252042031104
nv-cubb | 17:40:52.994517 WRN phy_init 0 [CTL.SCF] pusch max harq tx: 4
nv-cubb | terminate called after throwing an instance of ‘pd_exc_h’
nv-cubb | what(): Invalid pointer: StaticConversion can’t return nullptr
nv-cubb | /opt/nvidia/cuBB/aerial_l1_entrypoint.sh: line 36: 50 Aborted sudo -E “$cuBB_Path”/build/cuPHY-CP/cuphycontroller/examples/cuphycontroller_scf “$argument”
nv-cubb | [C]: Usage: ./build/cuPHY-CP/gt_common_libs/nvIPC/tests/pcap/pcap_collect
nv-cubb |
nv-cubb | [C]: Current run: ./build/cuPHY-CP/gt_common_libs/nvIPC/tests/pcap/pcap_collect name=nvipc dest_path=.
nv-cubb |
nv-cubb | [E][AERIAL_SYSTEM_API_EVENT]: ipc_shm_open: shm_open nvipc_cfg_app_config failed error -1
nv-cubb | [E][AERIAL_NVIPC_API_EVENT]: nv_ipc_shm_open: primary=0 name=nvipc_cfg_app_config size=40 Failed
nv-cubb | [E][AERIAL_SYSTEM_API_EVENT]: ipc_shm_close: close shm_fd failed
nv-cubb | [E][AERIAL_NVIPC_API_EVENT]: nv_ipc_app_config_shmpool_open: primary=0, prefix=nvipc
nv-cubb | [E][AERIAL_NVIPC_API_EVENT]: nv_ipc_app_config_get: failed: configs=(nil) item=0
nv-cubb | [E][AERIAL_NVIPC_API_EVENT]: nv_ipc_app_config_get: failed: configs=(nil) item=1
nv-cubb | [E][AERIAL_NVIPC_API_EVENT]: nv_ipc_app_config_get: failed: configs=(nil) item=8
nv-cubb | [E][AERIAL_NVIPC_API_EVENT]: nv_ipc_app_config_get: failed: configs=(nil) item=9
nv-cubb | [I]: load_debug_config: fapi_type=0 fapi_tb_loc=0 pcap_max_msg_size=0 pcap_max_data_size=0
nv-cubb | [I]: shmlogger_collect: save /var/log/aerial/nvipc_pcap and /dev/shm/nvipc_pcap logs to ./nvipc_pcap
nv-cubb | [E][AERIAL_SYSTEM_API_EVENT]: ipc_shm_open: shm_open nvipc_pcap failed error -1
nv-cubb | [E][AERIAL_NVIPC_API_EVENT]: nv_ipc_shm_open: primary=0 name=nvipc_pcap size=70368752566304 Failed
nv-cubb | [E][AERIAL_SYSTEM_API_EVENT]: ipc_shm_close: close shm_fd failed
nv-cubb | [E][AERIAL_NVIPC_API_EVENT]: shmlogger_open: nv_ipc_shm_open failed
nv-cubb | [I]: shmlogger_collect: no /dev/shm/nvipc_pcap, logger may have been closed normally
nv-cubb exited with code 0
Gracefully stopping… (press Ctrl+C again to force)
dependency failed to start: container nv-cubb exited (0)

Hi and welcome
In the docker compose file you have to add the MIG instance you would like to use.

$ nvidia-smi -L
GPU 0: NVIDIA GH200 480GB (UUID: GPU-b920718b-56bf-fa42-6a19-bc1b92658556)
MIG 3g.48gb Device 0: (UUID: MIG-1dba3a4b-52e5-5d46-b9f0-84a1e21d0faf)
MIG 4g.48gb Device 1: (UUID: MIG-8cd00ab5-9b6c-52d5-b3e9-d357b82c450c)

And in the docker compose file add in the cubb section:

environment:
  - NVIDIA_VISIBLE_DEVICES=MIG-1dba3a4b-52e5-5d46-b9f0-84a1e21d0faf

And also depending on the MIG setup you have to change the number of SMs in the cuphycontrol.yaml file.

mps_sm_pusch: 82
mps_sm_pucch: 20
mps_sm_prach: 2
mps_sm_ul_order: 20
mps_sm_pdsch: 102
mps_sm_pdcch: 10
mps_sm_pbch: 2
mps_sm_gpu_comms: 16
mps_sm_srs: 16

Thank you for your response and helpful suggestions!

However, I am currently using the NVIDIA A100X GPU, configured in MIG mode with two instances:

  • MIG 1: 38MiB / 40192MiB
  • MIG 2: 50MiB / 40192MiB

I would like to know if there are any recommended Docker Compose configuration parameters and SM settings in the cuphycontrol.yaml file specifically for the A100X. Particularly, I’m looking for advice on efficiently allocating RAN and AI workloads. Are there any specific guidelines or reference configurations for this scenario?

Thank you again for your help, and I look forward to your guidance!

For the case you parted the GPU of A100X, assume the partition is 3g.40gb and 4g.40gb. The corresponding SM number of each MIG device should be SM=42 and 56 for 3g.40gb and 4g.40gb, repectively.


example MIG device ID:
nvidia-smi -L
GPU 0: NVIDIA A100X (UUID: GPU-aad78ddf-0187-6c32-a882-b3b50bd46976)
MIG 4g.40gb Device 0: (UUID: MIG-4292c581-87da-5569-8483-1b43dff9795b)
MIG 3g.40gb Device 1: (UUID: MIG-ac313f7c-2ea5-5f74-8f07-93ab05ea6bf0)

  1. In the Docker compose, replace the environment NVIDIA_VISIBLE_DEVICES with the UUID of the MIG device you choose to use. For example, choose mig-3g.40gb for RAN
    environment:
  • NVIDIA_VISIBLE_DEVICES=MIG-1dba3a4b-52e5-5d46-b9f0-84a1e21d0faf
  1. in cuphycontroller_P5G_FXN.yaml, scale down the mps_sm_* values accordingly based on the available SM (ex: SM= 42 for mig-3g.48gb). A reference configuration as shown below,
    mps_sm_pusch: 30
    mps_sm_pucch: 16
    mps_sm_prach: 2
    mps_sm_ul_order: 10
    mps_sm_pdsch: 40
    mps_sm_pdcch: 10
    mps_sm_pbch: 2
    mps_sm_gpu_comms: 10
    mps_sm_srs: 10
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.