GPU Operator Validator Pods Are Failing

amlan.hackassin · September 1, 2022, 3:47pm

Hi Team,

Please find the following details of my environment

OS: CentOS 8
GPU: Tesla T4
Driver Version: 515.65

Unfortunately, some of the GPU Operator pods have been crashing in my case. Those are mainly the operator and device plugin validators.

On performing a describe on one of the validators, I found that it has to do something with the Nvidia runtime config not be discovered by the operator. Is there any way to circumvent this problem?

smerla · September 1, 2022, 5:09pm

Can you confirm you are using toolkit version as v1.10.0-ubi8? Also can you paste the complete output of “kubectl describe pod -l app=nvidia-device-plugin-daemonset -n gpu-operator” and “kubectl logs --all-containers -l app=nvidia-operator-validator -n gpu-operator”?.

amlan.hackassin · September 1, 2022, 6:56pm

No, the toolkit version I used was 1.7.1-centos8. PFA attached outputs for your reference.
nv_device_plugin_ds.txt (12.7 KB)
nv_op_val.txt (4.0 KB)

smerla · September 1, 2022, 7:09pm

Can you update the toolkit to above mentioned version and also share output of validator pods that are failing. “kubectl logs --all-containers -n gpu-operator” and “kubectl describe pod -n gpu-operator”. The only difference between cuda-validation and plugin-validation is we ensure that GPU resources are advertised by the plugin to the kubelet and run “nvidia-smi” with explicit resource request. The logs might indicate why that is failing.

amlan.hackassin · September 2, 2022, 8:10am

Please find the attached logs and output. I updated it to the aforementioned toolkit version.

desc_po.txt (136.0 KB)
validator_all_logs.txt (4.6 KB)

Topic		Replies	Views
GPU-Operator 1.3.0 throws: nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06 Docker and NVIDIA Docker ubuntu	2	2373	December 11, 2020
GPUOperator Support on CentOS 7.8 - GLIBC_2.27 Docker and NVIDIA Docker	0	1896	August 14, 2020
Verifying Kata Manager, Confidential Computing Manager, and VFIO Manager FAILED Confidential Computing	0	325	January 5, 2024
Is the vectoradd-cuda container for 11.4 available? CUDA Programming and Performance	6	1950	August 4, 2021
Running Cuda on Docker CUDA Setup and Installation	7	17367	May 23, 2016
Toolkit on Customer Computer CUDA Programming and Performance	10	807	September 24, 2020
Getting cudaRuntimeGetVersion() failed with error #35 for CUDA Version 7.5.18 with 361.42 driver CUDA Setup and Installation	4	5104	September 6, 2016
Docker pull nvcr.io/nvidia/driver:550.54.15-amzn2 not found Docker and NVIDIA Docker	4	993	May 2, 2024
Nvidia docker decoder and cuda function performance issue on multiple cards General Discussion	0	953	November 2, 2022
Cuda won't recognize my GPU as being supported. [Solved] CUDA Setup and Installation	10	24488	December 10, 2017

GPU Operator Validator Pods Are Failing

Related topics