The question of running "kubectl delete crd clusterpolicies.nvidia.com"

When I installed Tao-Toolkit-API, I run the command kubectl delete crd clusterpolicies.nvidia.com in the stage TASK [Waiting for the Cluster to become available]

However. I found that most of nvidia-gpu-operator was missed after I run this command.

  1. Before running kubectl delete crd clusterpolicies.nvidia.com

  2. After running kubectl delete crd clusterpolicies.nvidia.com

Besides, I found that there is an error about ClusterPolicy like the logs below:

1.6793996848505962e+09  INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8080"}
1.6793996848525527e+09  INFO    setup   starting manager
1.6793996848537874e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6793996848538554e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0321 11:54:44.853941       1 leaderelection.go:248] attempting to acquire leader lease nvidia-gpu-operator/53822513.nvidia.com...
I0321 11:55:02.134623       1 leaderelection.go:258] successfully acquired lease nvidia-gpu-operator/53822513.nvidia.com
1.679399702134646e+09   DEBUG   events  Normal  {"object": {"kind":"ConfigMap","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"03372ca9-1fd1-44bc-99ea-8a98e1cf415c","apiVersion":"v1","resourceVersion":"1922"}, "reason": "LeaderElection", "message": "gpu-operator-7bfc5f55-wcmrf_8eec5cee-5770-491d-bfbc-29640144bd7e became leader"}
1.679399702134731e+09   DEBUG   events  Normal  {"object": {"kind":"Lease","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"095e5442-8470-445e-8c7f-b750964ac866","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1923"}, "reason": "LeaderElection", "message": "gpu-operator-7bfc5f55-wcmrf_8eec5cee-5770-491d-bfbc-29640144bd7e became leader"}
1.679399702134777e+09   INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.ClusterPolicy"}
1.6793997021348252e+09  INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.Node"}
1.6793997021348305e+09  INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.DaemonSet"}
1.6793997021348343e+09  INFO    controller.clusterpolicy-controller     Starting Controller
1.679399702235648e+09   INFO    controllers.ClusterPolicy       Reconciliate ClusterPolicies after node label update        {"nb": 1}
1.679399702235739e+09   INFO    controller.clusterpolicy-controller     Starting workers        {"worker count": 1}
1.6793997022375412e+09  INFO    controllers.ClusterPolicy       Operator metrics initialized.
1.6793997022376037e+09  INFO    controllers.ClusterPolicy       Getting assets from:    {"path:": "/opt/gpu-operator/pre-requisites"}
1.6793997022377877e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "RuntimeClass", "in path:": "/opt/gpu-operator/pre-requisites"}
1.6793997022379386e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "PodSecurityPolicy", "in path:": "/opt/gpu-operator/pre-requisites"}
1.6793997022382555e+09  INFO    controllers.ClusterPolicy       Getting assets from:    {"path:": "/opt/gpu-operator/state-operator-metrics"}
1.6793997022384405e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "Service", "in path:": "/opt/gpu-operator/state-operator-metrics"}
1.6793997022386987e+09  INFO    controllers.ClusterPolicy       Getting assets from:    {"path:": "/opt/gpu-operator/state-driver"}
1.679399702238845e+09   INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-driver"}
1.6793997022389421e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "Role", "in path:": "/opt/gpu-operator/state-driver"}
1.6793997022391186e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ClusterRole", "in path:": "/opt/gpu-operator/state-driver"}
1.6793997022392702e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-driver"}
1.6793997022394092e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ClusterRoleBinding", "in path:": "/opt/gpu-operator/state-driver"}
1.6793997022395024e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-driver"}
1.6793997022415798e+09  INFO    controllers.ClusterPolicy       Getting assets from:    {"path:": "/opt/gpu-operator/state-container-toolkit"}
1.6793997022417176e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-container-toolkit"}
1.6793997022417707e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "Role", "in path:": "/opt/gpu-operator/state-container-toolkit"}
1.6793997022418787e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-container-toolkit"}
1.6793997022419577e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-container-toolkit"}
1.6793997022423468e+09  INFO    controllers.ClusterPolicy       Getting assets from:    {"path:": "/opt/gpu-operator/state-operator-validation"}
1.679399702242489e+09   INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-operator-validation"}
1.6793997022425394e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "Role", "in path:": "/opt/gpu-operator/state-operator-validation"}
1.6793997022426913e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ClusterRole", "in path:": "/opt/gpu-operator/state-operator-validation"}
1.679399702242786e+09   INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-operator-validation"}
1.6793997022428567e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ClusterRoleBinding", "in path:": "/opt/gpu-operator/state-operator-validation"}
1.6793997022429276e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-operator-validation"}
1.6793997022446988e+09  INFO    controllers.ClusterPolicy       Getting assets from:    {"path:": "/opt/gpu-operator/state-device-plugin"}
1.6793997022448952e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-device-plugin"}
1.6793997022449656e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "Role", "in path:": "/opt/gpu-operator/state-device-plugin"}
1.6793997022450728e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-device-plugin"}
1.679399702245151e+09   INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-device-plugin"}
1.679399702245509e+09   INFO    controllers.ClusterPolicy       Getting assets from:    {"path:": "/opt/gpu-operator/state-dcgm-exporter"}
1.6793997022456825e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-dcgm-exporter"}
1.6793997022457643e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "Role", "in path:": "/opt/gpu-operator/state-dcgm-exporter"}
1.6793997022458937e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-dcgm-exporter"}
1.6793997022459788e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "Service", "in path:": "/opt/gpu-operator/state-dcgm-exporter"}
1.6793997022460837e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-dcgm-exporter"}
1.6793997022463932e+09  INFO    controllers.ClusterPolicy       Getting assets from:    {"path:": "/opt/gpu-operator/gpu-feature-discovery"}
1.679399702246506e+09   INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
1.6793997022465627e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "Role", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
1.6793997022466557e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
1.679399702246727e+09   INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/gpu-feature-discovery"}
1.679399702247027e+09   INFO    controllers.ClusterPolicy       Getting assets from:    {"path:": "/opt/gpu-operator/state-mig-manager"}
1.6793997022472117e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ServiceAccount", "in path:": "/opt/gpu-operator/state-mig-manager"}
1.6793997022472625e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "Role", "in path:": "/opt/gpu-operator/state-mig-manager"}
1.679399702247414e+09   INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ClusterRole", "in path:": "/opt/gpu-operator/state-mig-manager"}
1.6793997022474875e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "RoleBinding", "in path:": "/opt/gpu-operator/state-mig-manager"}
1.6793997022475688e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ClusterRoleBinding", "in path:": "/opt/gpu-operator/state-mig-manager"}
1.6793997022476468e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ConfigMap", "in path:": "/opt/gpu-operator/state-mig-manager"}
1.6793997022478027e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "ConfigMap", "in path:": "/opt/gpu-operator/state-mig-manager"}
1.6793997022478821e+09  INFO    controllers.ClusterPolicy       DEBUG: Looking for      {"Kind": "DaemonSet", "in path:": "/opt/gpu-operator/state-mig-manager"}
1.6793997022484095e+09  INFO    controllers.ClusterPolicy       Checking GPU state labels on the node   {"NodeName": "admin-ops01"}
1.6793997022484212e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.container-toolkit", " value=": "true"}
1.6793997022484248e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.device-plugin", " value=": "true"}
1.6793997022484279e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.dcgm", " value=": "true"}
1.6793997022484303e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.dcgm-exporter", " value=": "true"}
1.679399702248433e+09   INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.node-status-exporter", " value=": "true"}
1.6793997022484362e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.operator-validator", " value=": "true"}
1.6793997022484434e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.driver", " value=": "true"}
1.6793997022484462e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.gpu-feature-discovery", " value=": "true"}
1.6793997022484522e+09  INFO    controllers.ClusterPolicy       Number of nodes with GPU label  {"NodeCount": 1}
1.6793997022484884e+09  INFO    controllers.ClusterPolicy       Using container runtime: containerd
1.6793997022496712e+09  INFO    KubeAPIWarningLogger    node.k8s.io/v1beta1 RuntimeClass is deprecated in v1.22+, unavailable in v1.25+
1.6793997023495245e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"RuntimeClass": "nvidia"}
1.6793997023519964e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "pre-requisites", "status": "ready"}
1.6793997024530985e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"Service": "gpu-operator", "Namespace": "nvidia-gpu-operator"}
1.6793997024557748e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-operator-metrics", "status": "ready"}
1.6793997024580107e+09  INFO    controllers.ClusterPolicy       Found Resource, skipping update {"ServiceAccount": "nvidia-driver", "Namespace": "nvidia-gpu-operator"}
1.6793997024605198e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"Role": "nvidia-driver", "Namespace": "nvidia-gpu-operator"}
1.67939970246576e+09    INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ClusterRole": "nvidia-driver", "Namespace": "nvidia-gpu-operator"}
1.6793997024696126e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"RoleBinding": "nvidia-driver", "Namespace": "nvidia-gpu-operator"}
1.6793997024730349e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ClusterRoleBinding": "nvidia-driver", "Namespace": "nvidia-gpu-operator"}
1.6793997025756822e+09  INFO    controllers.ClusterPolicy       5.4.0-77-generic        {"Request.Namespace": "default", "Request.Name": "Node"}
1.6793997025763001e+09  INFO    controllers.ClusterPolicy       DaemonSet identical, skipping update    {"DaemonSet": "nvidia-driver-daemonset", "Namespace": "nvidia-gpu-operator", "name": "nvidia-driver-daemonset"}
1.679399702576313e+09   INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"LabelSelector": "app=nvidia-driver-daemonset"}
1.6793997025763402e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberOfDaemonSets": 1}
1.6793997025763452e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberUnavailable": 1}
1.6793997025763497e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-driver", "status": "notReady"}
1.679399702579127e+09   INFO    controllers.ClusterPolicy       Found Resource, skipping update {"ServiceAccount": "nvidia-container-toolkit", "Namespace": "nvidia-gpu-operator"}
1.6793997025814555e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"Role": "nvidia-container-toolkit", "Namespace": "nvidia-gpu-operator"}
1.6793997025849078e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"RoleBinding": "nvidia-container-toolkit", "Namespace": "nvidia-gpu-operator"}
1.6793997025871568e+09  INFO    controllers.ClusterPolicy       DaemonSet identical, skipping update    {"DaemonSet": "nvidia-container-toolkit-daemonset", "Namespace": "nvidia-gpu-operator", "name": "nvidia-container-toolkit-daemonset"}
1.6793997025871735e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"LabelSelector": "app=nvidia-container-toolkit-daemonset"}
1.6793997025872092e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberOfDaemonSets": 1}
1.6793997025872154e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberUnavailable": 1}
1.679399702587219e+09   INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-container-toolkit", "status": "notReady"}
1.679399702589382e+09   INFO    controllers.ClusterPolicy       Found Resource, skipping update {"ServiceAccount": "nvidia-operator-validator", "Namespace": "nvidia-gpu-operator"}
1.6793997025912428e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"Role": "nvidia-operator-validator", "Namespace": "nvidia-gpu-operator"}
1.6793997025946085e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ClusterRole": "nvidia-operator-validator", "Namespace": "nvidia-gpu-operator"}
1.6793997026150193e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"RoleBinding": "nvidia-operator-validator", "Namespace": "nvidia-gpu-operator"}
1.6793997026281643e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ClusterRoleBinding": "nvidia-operator-validator", "Namespace": "nvidia-gpu-operator"}
1.6793997026304219e+09  INFO    controllers.ClusterPolicy       DaemonSet identical, skipping update    {"DaemonSet": "nvidia-operator-validator", "Namespace": "nvidia-gpu-operator", "name": "nvidia-operator-validator"}
1.6793997026304383e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"LabelSelector": "app=nvidia-operator-validator"}
1.6793997026304753e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberOfDaemonSets": 1}
1.6793997026304796e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberUnavailable": 1}
1.6793997026304848e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-operator-validation", "status": "notReady"}
1.6793997026374981e+09  INFO    controllers.ClusterPolicy       Found Resource, skipping update {"ServiceAccount": "nvidia-device-plugin", "Namespace": "nvidia-gpu-operator"}
1.6793997026450212e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"Role": "nvidia-device-plugin", "Namespace": "nvidia-gpu-operator"}
1.6793997026487129e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"RoleBinding": "nvidia-device-plugin", "Namespace": "nvidia-gpu-operator"}
1.6793997026506011e+09  INFO    controllers.ClusterPolicy       DaemonSet identical, skipping update    {"DaemonSet": "nvidia-device-plugin-daemonset", "Namespace": "nvidia-gpu-operator", "name": "nvidia-device-plugin-daemonset"}
1.6793997026506176e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"LabelSelector": "app=nvidia-device-plugin-daemonset"}
1.6793997026506524e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberOfDaemonSets": 1}
1.6793997026506586e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberUnavailable": 1}
1.6793997026506624e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-device-plugin", "status": "notReady"}
1.6793997026528385e+09  INFO    controllers.ClusterPolicy       Found Resource, skipping update {"ServiceAccount": "nvidia-dcgm-exporter", "Namespace": "nvidia-gpu-operator"}
1.6793997026547954e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"Role": "nvidia-dcgm-exporter", "Namespace": "nvidia-gpu-operator"}
1.6793997026581383e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"RoleBinding": "nvidia-dcgm-exporter", "Namespace": "nvidia-gpu-operator"}
1.679399702659721e+09   INFO    controllers.ClusterPolicy       Found Resource, updating...     {"Service": "nvidia-dcgm-exporter", "Namespace": "nvidia-gpu-operator"}
1.679399702661916e+09   INFO    controllers.ClusterPolicy       DaemonSet identical, skipping update    {"DaemonSet": "nvidia-dcgm-exporter", "Namespace": "nvidia-gpu-operator", "name": "nvidia-dcgm-exporter"}
1.679399702661933e+09   INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"LabelSelector": "app=nvidia-dcgm-exporter"}
1.6793997026619601e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberOfDaemonSets": 1}
1.6793997026619678e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberUnavailable": 1}
1.6793997026619718e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-dcgm-exporter", "status": "notReady"}
1.6793997026638792e+09  INFO    controllers.ClusterPolicy       Found Resource, skipping update {"ServiceAccount": "nvidia-gpu-feature-discovery", "Namespace": "nvidia-gpu-operator"}
1.6793997026656861e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"Role": "nvidia-gpu-feature-discovery", "Namespace": "nvidia-gpu-operator"}
1.6793997026689951e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"RoleBinding": "nvidia-gpu-feature-discovery", "Namespace": "nvidia-gpu-operator"}
1.6793997026706855e+09  INFO    controllers.ClusterPolicy       DaemonSet identical, skipping update    {"DaemonSet": "gpu-feature-discovery", "Namespace": "nvidia-gpu-operator", "name": "gpu-feature-discovery"}
1.6793997026707032e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"LabelSelector": "app=gpu-feature-discovery"}
1.6793997026707256e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberOfDaemonSets": 1}
1.6793997026707299e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberUnavailable": 1}
1.6793997026707332e+09  INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "gpu-feature-discovery", "status": "notReady"}
1.6793997026725569e+09  INFO    controllers.ClusterPolicy       Found Resource, skipping update {"ServiceAccount": "nvidia-mig-manager", "Namespace": "nvidia-gpu-operator"}
1.6793997026743934e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"Role": "nvidia-mig-manager", "Namespace": "nvidia-gpu-operator"}
1.6793997026778245e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ClusterRole": "nvidia-mig-manager", "Namespace": "nvidia-gpu-operator"}
1.6793997026810489e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"RoleBinding": "nvidia-mig-manager", "Namespace": "nvidia-gpu-operator"}
1.6793997026845417e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ClusterRoleBinding": "nvidia-mig-manager", "Namespace": "nvidia-gpu-operator"}
1.6793997026880655e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ConfigMap": "default-mig-parted-config", "Namespace": "nvidia-gpu-operator"}
1.6793997026913178e+09  INFO    controllers.ClusterPolicy       Found Resource, updating...     {"ConfigMap": "default-gpu-clients", "Namespace": "nvidia-gpu-operator"}
1.6793997026933584e+09  INFO    controllers.ClusterPolicy       DaemonSet identical, skipping update    {"DaemonSet": "nvidia-mig-manager", "Namespace": "nvidia-gpu-operator", "name": "nvidia-mig-manager"}
1.6793997026933737e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"LabelSelector": "app=nvidia-mig-manager"}
1.6793997026934001e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberOfDaemonSets": 1}
1.6793997026934044e+09  INFO    controllers.ClusterPolicy       DEBUG: DaemonSet        {"NumberUnavailable": 0}
1.679399702693413e+09   INFO    controllers.ClusterPolicy       INFO: ClusterPolicy step completed      {"state:": "state-mig-manager", "status": "ready"}
1.6793997026934242e+09  INFO    controllers.ClusterPolicy       ClusterPolicy isn't ready       {"states not ready": ["state-driver", "state-container-toolkit", "state-operator-validation", "state-device-plugin", "state-dcgm-exporter", "gpu-feature-discovery"]}
E0321 11:55:06.433653       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
W0321 11:55:07.274819       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
E0321 11:55:07.274843       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
W0321 11:55:10.026237       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
E0321 11:55:10.026261       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
W0321 11:55:15.394434       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
E0321 11:55:15.394467       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
W0321 11:55:24.354989       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
E0321 11:55:24.355010       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
W0321 11:55:43.379049       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
E0321 11:55:43.379074       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
W0321 11:56:10.261695       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
E0321 11:56:10.261721       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
W0321 11:56:44.050761       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
E0321 11:56:44.050784       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
W0321 11:57:39.880717       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
E0321 11:57:39.880741       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
W0321 11:58:34.520072       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
E0321 11:58:34.520094       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
W0321 11:59:05.309809       1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
E0321 11:59:05.309831       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)

How could I do to deal with the problem ?

So, there is not any failed pod , right?

You can ignore it if there is not any failed pod now.

However. When I ran kubectl describe node, the message shows that my node didn’t allocate GPU and only cpu.

If I would like to train AutoML, it is necessary that my api node is allocated GPU.

Could I restore those missing resources of nvidia-gpu-operator via helm install ?

Please uninstall the driver.

sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean

Then, run below
$ bash setup.sh check-inventory.yml
$ bash setup.sh install

And share with the logs.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.