Hello -
I’m trying to enable vGPUs on OpenShift 4.11 following these docs NVIDIA GPU Operator with OpenShift Virtualization — gpu-operator 23.6.1 documentation
oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-fbb6ffcc8-gzddt 1/1 Running 0 4h56m
nvidia-vgpu-device-manager-2b5r5 1/1 Running 0 13m
nvidia-vgpu-device-manager-f4rnr 0/1 Init:0/1 0 12m
nvidia-vgpu-device-manager-knx9v 0/1 Init:0/1 0 12m
Current error, which probably means it can’t find this device from the config file https://github.com/NVIDIA/vgpu-device-manager/blob/main/examples/config-example.yaml#L1848
oc logs -f nvidia-vgpu-device-manager-2b5r5
Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init)
W0926 20:19:37.905184 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2023-09-26T20:19:37Z" level=info msg="Updating to vGPU config: RTXA5000-1Q"
time="2023-09-26T20:19:37Z" level=info msg="Asserting that the requested configuration is present in the configuration file"
time="2023-09-26T20:19:37Z" level=info msg="Selected vGPU device configuration is valid"
time="2023-09-26T20:19:37Z" level=info msg="Checking if the selected vGPU device configuration is currently applied or not"
time="2023-09-26T20:19:37Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
time="2023-09-26T20:19:37Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label"
time="2023-09-26T20:19:37Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin=paused-for-vgpu-change'"
time="2023-09-26T20:19:37Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-validator' node label"
time="2023-09-26T20:19:37Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-validator=paused-for-vgpu-change'"
time="2023-09-26T20:19:37Z" level=info msg="Getting current value of 'nvidia.com/vgpu.config.state' node label"
time="2023-09-26T20:19:37Z" level=info msg="Current value of 'nvidia.com/vgpu.config.state=failed'"
time="2023-09-26T20:19:37Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'pending'"
time="2023-09-26T20:19:37Z" level=info msg="Shutting down all GPU operands in Kubernetes by disabling their component-specific nodeSelector labels"
time="2023-09-26T20:19:38Z" level=info msg="Waiting for sandbox-device-plugin to shutdown"
time="2023-09-26T20:19:38Z" level=info msg="Waiting for sandbox-validator to shutdown"
time="2023-09-26T20:19:38Z" level=info msg="Applying the selected vGPU device configuration to the node"
time="2023-09-26T20:19:38Z" level=debug msg="Parsing config file..."
time="2023-09-26T20:19:38Z" level=debug msg="Selecting specific vGPU config..."
time="2023-09-26T20:19:38Z" level=debug msg="Checking current vGPU device configuration..."
time="2023-09-26T20:19:38Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-26T20:19:38Z" level=debug msg=" GPU 0: 0x223110DE"
time="2023-09-26T20:19:38Z" level=info msg="Applying vGPU device configuration..."
time="2023-09-26T20:19:38Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-26T20:19:38Z" level=debug msg=" GPU 0: 0x223110DE"
time="2023-09-26T20:19:38Z" level=fatal msg="error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory"
time="2023-09-26T20:19:38Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'failed'"
time="2023-09-26T20:19:38Z" level=error msg="ERROR: unable to apply config 'RTXA5000-1Q': exit status 1"
time="2023-09-26T20:19:38Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"
I do have 2 different GPUs RTX5000 and A100. The deployment get stuck on the RXT5000 and doesn’t continue to try to configure the rest of the nodes.
The nodes containing A100, I have set a label nvidia.com/vgpu.config=A10-3Q and accordingly RXT5000 I have set nvidia.com/vgpu.config=RTXA5000-1Q.
clusterpolicy config
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
creationTimestamp: '2023-09-26T19:59:16Z'
generation: 1
managedFields:
- apiVersion: nvidia.com/v1
fieldsType: FieldsV1
fieldsV1:
'f:status':
.: {}
'f:namespace': {}
'f:state': {}
manager: Go-http-client
operation: Update
subresource: status
time: '2023-09-26T19:59:16Z'
- apiVersion: nvidia.com/v1
fieldsType: FieldsV1
fieldsV1:
'f:spec':
'f:gds':
.: {}
'f:enabled': {}
'f:vgpuManager':
.: {}
'f:driverManager':
.: {}
'f:image': {}
'f:repository': {}
'f:version': {}
'f:enabled': {}
'f:vfioManager':
.: {}
'f:enabled': {}
'f:daemonsets':
.: {}
'f:rollingUpdate':
.: {}
'f:maxUnavailable': {}
'f:updateStrategy': {}
'f:sandboxWorkloads':
.: {}
'f:defaultWorkload': {}
'f:enabled': {}
'f:nodeStatusExporter':
.: {}
'f:enabled': {}
'f:toolkit':
.: {}
'f:enabled': {}
'f:installDir': {}
'f:vgpuDeviceManager':
.: {}
'f:config':
.: {}
'f:default': {}
'f:enabled': {}
.: {}
'f:gfd':
.: {}
'f:enabled': {}
'f:cdi':
.: {}
'f:default': {}
'f:enabled': {}
'f:migManager':
.: {}
'f:config':
.: {}
'f:default': {}
'f:name': {}
'f:enabled': {}
'f:kataManager':
.: {}
'f:config':
.: {}
'f:artifactsDir': {}
'f:mig':
.: {}
'f:strategy': {}
'f:operator':
.: {}
'f:defaultRuntime': {}
'f:initContainer': {}
'f:runtimeClass': {}
'f:use_ocp_driver_toolkit': {}
'f:dcgm':
.: {}
'f:enabled': {}
'f:dcgmExporter':
.: {}
'f:config':
.: {}
'f:name': {}
'f:enabled': {}
'f:serviceMonitor':
.: {}
'f:enabled': {}
'f:sandboxDevicePlugin':
.: {}
'f:enabled': {}
'f:driver':
.: {}
'f:certConfig':
.: {}
'f:name': {}
'f:enabled': {}
'f:kernelModuleConfig':
.: {}
'f:name': {}
'f:licensingConfig':
.: {}
'f:configMapName': {}
'f:nlsEnabled': {}
'f:repoConfig':
.: {}
'f:configMapName': {}
'f:upgradePolicy':
.: {}
'f:autoUpgrade': {}
'f:drain':
.: {}
'f:deleteEmptyDir': {}
'f:enable': {}
'f:force': {}
'f:timeoutSeconds': {}
'f:maxParallelUpgrades': {}
'f:maxUnavailable': {}
'f:podDeletion':
.: {}
'f:deleteEmptyDir': {}
'f:force': {}
'f:timeoutSeconds': {}
'f:waitForCompletion':
.: {}
'f:timeoutSeconds': {}
'f:virtualTopology':
.: {}
'f:config': {}
'f:devicePlugin':
.: {}
'f:config':
.: {}
'f:default': {}
'f:name': {}
'f:enabled': {}
'f:validator':
.: {}
'f:plugin':
.: {}
'f:env': {}
manager: kubectl-create
operation: Update
time: '2023-09-26T19:59:16Z'
name: gpu-cluster-policy
resourceVersion: '387476060'
uid: 34ea9517-1d7a-44b1-b7af-b84db66615a7
spec:
vgpuDeviceManager:
config:
default: default
enabled: true
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: ''
enabled: true
serviceMonitor:
enabled: true
cdi:
default: false
enabled: false
driver:
certConfig:
name: ''
enabled: true
kernelModuleConfig:
name: ''
licensingConfig:
configMapName: licensing-config
nlsEnabled: true
repoConfig:
configMapName: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
virtualTopology:
config: ''
devicePlugin:
config:
default: ''
name: ''
enabled: true
kataManager:
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: vm-vgpu
enabled: true
gds:
enabled: false
vgpuManager:
driverManager:
image: vgpu-manager
repository: default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing
version: 535.104.06-rhcos4.11
enabled: true
vfioManager:
enabled: true
toolkit:
enabled: true
installDir: /usr/local/nvidia
status:
namespace: nvidia-gpu-operator
state: notReady
I can see the GPUs on the physical nodes
oc debug node/gpu1 -- chroot /host lspci -nnk -d 10de:
Starting pod/gpu1ocp4pocsite-debug ...
To use host binaries, run `chroot /host`
0000:31:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:147e]
Kernel driver in use: nvidia
Kernel modules: nouveau
oc debug node/gpu3 -- chroot /host lspci -nnk -d 10de:
Starting pod/gpu3ocp4pocsite-debug ...
To use host binaries, run `chroot /host`
1b:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1533]
Kernel driver in use: nvidia
Kernel modules: nouveau
1c:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:1533]
Kernel driver in use: nvidia
Kernel modules: nouveau
Any help?