Rror getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory

Hello -

I’m trying to enable vGPUs on OpenShift 4.11 following these docs NVIDIA GPU Operator with OpenShift Virtualization — gpu-operator 23.6.1 documentation

oc get pods -n nvidia-gpu-operator 
NAME                               READY   STATUS     RESTARTS   AGE
gpu-operator-fbb6ffcc8-gzddt       1/1     Running    0          4h56m
nvidia-vgpu-device-manager-2b5r5   1/1     Running    0          13m
nvidia-vgpu-device-manager-f4rnr   0/1     Init:0/1   0          12m
nvidia-vgpu-device-manager-knx9v   0/1     Init:0/1   0          12m

Current error, which probably means it can’t find this device from the config file https://github.com/NVIDIA/vgpu-device-manager/blob/main/examples/config-example.yaml#L1848

oc logs -f nvidia-vgpu-device-manager-2b5r5 
Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init)
W0926 20:19:37.905184       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-09-26T20:19:37Z" level=info msg="Updating to vGPU config: RTXA5000-1Q"
time="2023-09-26T20:19:37Z" level=info msg="Asserting that the requested configuration is present in the configuration file"
time="2023-09-26T20:19:37Z" level=info msg="Selected vGPU device configuration is valid"
time="2023-09-26T20:19:37Z" level=info msg="Checking if the selected vGPU device configuration is currently applied or not"
time="2023-09-26T20:19:37Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
time="2023-09-26T20:19:37Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label"
time="2023-09-26T20:19:37Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin=paused-for-vgpu-change'"
time="2023-09-26T20:19:37Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-validator' node label"
time="2023-09-26T20:19:37Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-validator=paused-for-vgpu-change'"
time="2023-09-26T20:19:37Z" level=info msg="Getting current value of 'nvidia.com/vgpu.config.state' node label"
time="2023-09-26T20:19:37Z" level=info msg="Current value of 'nvidia.com/vgpu.config.state=failed'"
time="2023-09-26T20:19:37Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'pending'"
time="2023-09-26T20:19:37Z" level=info msg="Shutting down all GPU operands in Kubernetes by disabling their component-specific nodeSelector labels"
time="2023-09-26T20:19:38Z" level=info msg="Waiting for sandbox-device-plugin to shutdown"
time="2023-09-26T20:19:38Z" level=info msg="Waiting for sandbox-validator to shutdown"
time="2023-09-26T20:19:38Z" level=info msg="Applying the selected vGPU device configuration to the node"
time="2023-09-26T20:19:38Z" level=debug msg="Parsing config file..."
time="2023-09-26T20:19:38Z" level=debug msg="Selecting specific vGPU config..."
time="2023-09-26T20:19:38Z" level=debug msg="Checking current vGPU device configuration..."
time="2023-09-26T20:19:38Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-26T20:19:38Z" level=debug msg="  GPU 0: 0x223110DE"
time="2023-09-26T20:19:38Z" level=info msg="Applying vGPU device configuration..."
time="2023-09-26T20:19:38Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-26T20:19:38Z" level=debug msg="  GPU 0: 0x223110DE"
time="2023-09-26T20:19:38Z" level=fatal msg="error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory"
time="2023-09-26T20:19:38Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'failed'"
time="2023-09-26T20:19:38Z" level=error msg="ERROR: unable to apply config 'RTXA5000-1Q': exit status 1"
time="2023-09-26T20:19:38Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"

I do have 2 different GPUs RTX5000 and A100. The deployment get stuck on the RXT5000 and doesn’t continue to try to configure the rest of the nodes.

The nodes containing A100, I have set a label nvidia.com/vgpu.config=A10-3Q and accordingly RXT5000 I have set nvidia.com/vgpu.config=RTXA5000-1Q.

clusterpolicy config

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  creationTimestamp: '2023-09-26T19:59:16Z'
  generation: 1
  managedFields:
    - apiVersion: nvidia.com/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          .: {}
          'f:namespace': {}
          'f:state': {}
      manager: Go-http-client
      operation: Update
      subresource: status
      time: '2023-09-26T19:59:16Z'
    - apiVersion: nvidia.com/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:spec':
          'f:gds':
            .: {}
            'f:enabled': {}
          'f:vgpuManager':
            .: {}
            'f:driverManager':
              .: {}
              'f:image': {}
              'f:repository': {}
              'f:version': {}
            'f:enabled': {}
          'f:vfioManager':
            .: {}
            'f:enabled': {}
          'f:daemonsets':
            .: {}
            'f:rollingUpdate':
              .: {}
              'f:maxUnavailable': {}
            'f:updateStrategy': {}
          'f:sandboxWorkloads':
            .: {}
            'f:defaultWorkload': {}
            'f:enabled': {}
          'f:nodeStatusExporter':
            .: {}
            'f:enabled': {}
          'f:toolkit':
            .: {}
            'f:enabled': {}
            'f:installDir': {}
          'f:vgpuDeviceManager':
            .: {}
            'f:config':
              .: {}
              'f:default': {}
            'f:enabled': {}
          .: {}
          'f:gfd':
            .: {}
            'f:enabled': {}
          'f:cdi':
            .: {}
            'f:default': {}
            'f:enabled': {}
          'f:migManager':
            .: {}
            'f:config':
              .: {}
              'f:default': {}
              'f:name': {}
            'f:enabled': {}
          'f:kataManager':
            .: {}
            'f:config':
              .: {}
              'f:artifactsDir': {}
          'f:mig':
            .: {}
            'f:strategy': {}
          'f:operator':
            .: {}
            'f:defaultRuntime': {}
            'f:initContainer': {}
            'f:runtimeClass': {}
            'f:use_ocp_driver_toolkit': {}
          'f:dcgm':
            .: {}
            'f:enabled': {}
          'f:dcgmExporter':
            .: {}
            'f:config':
              .: {}
              'f:name': {}
            'f:enabled': {}
            'f:serviceMonitor':
              .: {}
              'f:enabled': {}
          'f:sandboxDevicePlugin':
            .: {}
            'f:enabled': {}
          'f:driver':
            .: {}
            'f:certConfig':
              .: {}
              'f:name': {}
            'f:enabled': {}
            'f:kernelModuleConfig':
              .: {}
              'f:name': {}
            'f:licensingConfig':
              .: {}
              'f:configMapName': {}
              'f:nlsEnabled': {}
            'f:repoConfig':
              .: {}
              'f:configMapName': {}
            'f:upgradePolicy':
              .: {}
              'f:autoUpgrade': {}
              'f:drain':
                .: {}
                'f:deleteEmptyDir': {}
                'f:enable': {}
                'f:force': {}
                'f:timeoutSeconds': {}
              'f:maxParallelUpgrades': {}
              'f:maxUnavailable': {}
              'f:podDeletion':
                .: {}
                'f:deleteEmptyDir': {}
                'f:force': {}
                'f:timeoutSeconds': {}
              'f:waitForCompletion':
                .: {}
                'f:timeoutSeconds': {}
            'f:virtualTopology':
              .: {}
              'f:config': {}
          'f:devicePlugin':
            .: {}
            'f:config':
              .: {}
              'f:default': {}
              'f:name': {}
            'f:enabled': {}
          'f:validator':
            .: {}
            'f:plugin':
              .: {}
              'f:env': {}
      manager: kubectl-create
      operation: Update
      time: '2023-09-26T19:59:16Z'
  name: gpu-cluster-policy
  resourceVersion: '387476060'
  uid: 34ea9517-1d7a-44b1-b7af-b84db66615a7
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    enabled: true
    serviceMonitor:
      enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: licensing-config
      nlsEnabled: true
    repoConfig:
      configMapName: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: vm-vgpu
    enabled: true
  gds:
    enabled: false
  vgpuManager:
    driverManager:
      image: vgpu-manager
      repository: default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing
      version: 535.104.06-rhcos4.11
    enabled: true
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
status:
  namespace: nvidia-gpu-operator
  state: notReady

I can see the GPUs on the physical nodes

oc debug node/gpu1  -- chroot /host lspci -nnk -d 10de:  
Starting pod/gpu1ocp4pocsite-debug ...
To use host binaries, run `chroot /host`
0000:31:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:147e]
	Kernel driver in use: nvidia
	Kernel modules: nouveau

oc debug node/gpu3 -- chroot /host lspci -nnk -d 10de:  
Starting pod/gpu3ocp4pocsite-debug ...
To use host binaries, run `chroot /host`
1b:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:1533]
	Kernel driver in use: nvidia
	Kernel modules: nouveau
1c:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:1533]
	Kernel driver in use: nvidia
	Kernel modules: nouveau

Any help?

My device manager pod is in the same state vGPU device manager pod stuck in init container phase · Issue #554 · NVIDIA/gpu-operator · GitHub

I removed this labelnvidia.com/vgpu.config and let the default configuration to take over.

I also label my RTX5000 GPU to be pass-through using oc label node gpu1 --overwrite nvidia.com/gpu.workload.config=vm-passthrough and this GPU get successfully recognized

oc logs -f nvidia-vfio-manager-dhpxj
Defaulted container "nvidia-vfio-manager" out of: nvidia-vfio-manager, k8s-driver-manager (init)
binding device 0000:31:00.0
binding device 0000:31:00.1

But the rest two nodes that are intended for vm-vgpu or vGPU usage are stuck.

oc logs -f nvidia-vgpu-device-manager-hmqjt -c vgpu-manager-validation 
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
^C
oc logs -f nvidia-vgpu-device-manager-q8khn  -c vgpu-manager-validation 
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
^C

If I disable sandboxWorloads.enabled option the the GPUs are getting recognized, but the vGPU configuration fails nvidia.com/vgpu.config.state=failed

oc describe node gpu4 | grep -i nvidia.com 
                    nvidia.com/cuda.driver.major=535
                    nvidia.com/cuda.driver.minor=104
                    nvidia.com/cuda.driver.rev=05
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=2
                    nvidia.com/gfd.timestamp=1695828096
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=0
                    nvidia.com/gpu.count=2
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.nvsm=
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.machine=G292-280-IAY1-000
                    nvidia.com/gpu.memory=81920
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-A100-80GB-PCIe
                    nvidia.com/gpu.replicas=1
                    nvidia.com/gpu.workload.config=vm-vgpu
                    nvidia.com/mig.capable=true
                    nvidia.com/mig.config=all-disabled
                    nvidia.com/mig.config.state=success
                    nvidia.com/mig.strategy=single
                    **nvidia.com/vgpu.config.state=failed**
                    nvidia.com/gpu-driver-upgrade-enabled: true
  nvidia.com/A100:                 0
  nvidia.com/gpu:                  2
  nvidia.com/A100:                 0
  nvidia.com/gpu:                  2
  nvidia.com/A100                 1             1
  nvidia.com/gpu                  0             0

The error message sounds like the Mediated Devices feature is not enabled.
Probably a silly question, but have you enabled the IOMMU driver on the host?

This is documented here: NVIDIA GPU Operator with OpenShift Virtualization — gpu-operator 23.6.1 documentation.
Note that the kernel argument for AMD is “amd_iommu=on”.

Yes we have the intel_iommu=on enabled.

Hello to both of you @ppetkov12 and @fdupont and welcome to the NVIDIA developer forums.

I am not the expert on vGPU and I think you will have better success getting some suggestions if you post in our dedicated vGPU forums.

If you don’t mind, I can move this whole topi over there for you?

I am just not sure if it is a better fit for General discussion or rather the driver category?
Any preference?

Thanks!

Sounds good to me. Thank you.

Sounds good.

I had the same issue. I had the following labels on my two workers with A40 cards:

nvidia.com/gpu.deploy.sandbox-device-plugin=paused for vgpu change
nvidia.com/gpu.deploy.sandbox-validator=paused for vgpu change

I manually changed the label to true and the vgpu-device-manager came up