I got the error message: clusterpolicies.nvidia.com not found
Did you set up TAO API successfully?
Refer to AutoML - NVIDIA Docs
and blog https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/
Not needed. Can you share below log as well?
$ kubectl describe pod -n nvidia-gpu-operator gpu-operator-7bfc5f55-8577v
Name: gpu-operator-7bfc5f55-8577v
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Node: admin-ops01/192.168.101.8
Start Time: Mon, 20 Mar 2023 08:54:16 +0000
Labels: app=gpu-operator
app.kubernetes.io/component=gpu-operator
pod-template-hash=7bfc5f55
Annotations: cni.projectcalico.org/containerID: 20a1fb8cccdaaefeada46ef94eeb1902c00f063dd06f17c8db2e9ba49b6a98cb
cni.projectcalico.org/podIP: 192.168.33.118/32
cni.projectcalico.org/podIPs: 192.168.33.118/32
openshift.io/scc: restricted-readonly
Status: Running
IP: 192.168.33.118
IPs:
IP: 192.168.33.118
Controlled By: ReplicaSet/gpu-operator-7bfc5f55
Containers:
gpu-operator:
Container ID: containerd://3f2ec1c212505150c32e325401d9441ae44b291bdc8e378ded60da1c9a01b5ca
Image: nvcr.io/nvidia/gpu-operator:v1.10.1
Image ID: nvcr.io/nvidia/gpu-operator@sha256:c7f9074c1a7f58947c807f23f2eece3a8b04e11175127919156f8e864821d45a
Port: 8080/TCP
Host Port: 0/TCP
Command:
gpu-operator
Args:
--leader-elect
State: Running
Started: Tue, 21 Mar 2023 08:56:58 +0000
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 21 Mar 2023 08:49:28 +0000
Finished: Tue, 21 Mar 2023 08:51:46 +0000
Ready: True
Restart Count: 199
Limits:
cpu: 500m
memory: 350Mi
Requests:
cpu: 200m
memory: 100Mi
Liveness: http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
Environment:
WATCH_NAMESPACE:
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
Mounts:
/host-etc/os-release from host-os-release (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7r8cx (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
kube-api-access-7r8cx:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 85s (x4695 over 23h) kubelet Back-off restarting failed container
Could I
bash setup uninstall
and then run the methods you mentioned before ?
Yes, you can for double check.
Then, to see if there is still failed pod when run “kubectl get pods -A” .
Just setup TAO-API again. Not needed to run AutoML and its notebook.
Is there any methods that don’t need to reinstall ?
I think you already fix the failed pod issue.
Can you share “kubectl get pods -A” ?
No, I haven’t fixed yet. The failed pod still shows the error message:
if kind is a CRD, it should be installed before calling Start {“kind”: “ClusterPolicy.nvidia.com”, “error”: “no matches for kind "ClusterPolicy" in version "nvidia.com/v1"”}
problem running manager {“error”: “failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced”}
main.main
and the message from the command kubectl get pods -A is still like this post
OK. And do you have other kind of machine on hand?
I find that https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#bare-metal-setup mentions
- 1 NVIDIA Discrete GPU: Volta, Turing, Ampere, Hopper architecture
More, could you upload the log when you setup TAO-API?
$ bash setup.sh check-inventory.yml
$ bash setup.sh install
You can upload via button
The machine I am using now is 4 NVIDIA Tesla P100 SXM2 16GB.
How should I do to store the log when I setup TAO-API?
You can copy the log from the terminal and then upload it as a txt file.
Hi,
Please uninstall the driver.
sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean
Then, run below
$ bash setup.sh check-inventory.yml
$ bash setup.sh install
And share with the logs.
In the stage of TASK [Waiting for the Cluster to become available], I found that I could not rmmod nvidia module.
When I ran the command: sudo rmmod nvidia, I got the error message
rmmod: ERROR: Module nvidia is in use
How could I do to deal with the problem?
The picture below is the error log in the stage of TASK [Waiting for the Cluster to become available]
The picture below is the last components related to driver
It gets stuck here, right?
Exactly, I still get stuck here.
Please try to open a new terminal to run below command.
$ kubectl delete crd clusterpolicies.nvidia.com
If I did it, it would terminate and disapper the GPU driver and GPU related network-plugin.
Is there any methods to deal with it ?
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks
This command does not affect driver.