Yes. The dataset was created and uploaded successfully.
Besides, I found that one of pod is in status of CrashloopBackoff. The picture is the log contents.
Yes. The dataset was created and uploaded successfully.
Besides, I found that one of pod is in status of CrashloopBackoff. The picture is the log contents.
Please share the full log with us.
$ kubectl logs -n nvidia-gpu-operator gpu-operator-7bfc5f55-8577v
1.679380604772754e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
1.6793806047730486e+09 INFO setup starting manager
1.67938060477328e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6793806047733126e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0321 06:36:44.773357 1 leaderelection.go:248] attempting to acquire leader lease nvidia-gpu-operator/53822513.nvidia.com...
I0321 06:36:59.846065 1 leaderelection.go:258] successfully acquired lease nvidia-gpu-operator/53822513.nvidia.com
1.6793806198461287e+09 DEBUG events Normal {"object": {"kind":"ConfigMap","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"361ff16d-b58e-4204-b47a-86e4fff31f1c","apiVersion":"v1","resourceVersion":"168810"}, "reason": "LeaderElection", "message": "gpu-operator-7bfc5f55-8577v_8de61e82-40e9-4562-a2e1-68c51ff1fe84 became leader"}
1.6793806198463733e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.ClusterPolicy"}
1.6793806198463368e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"ff021974-19a5-44ec-9dbb-b1da19e1202b","apiVersion":"coordination.k8s.io/v1","resourceVersion":"168811"}, "reason": "LeaderElection", "message": "gpu-operator-7bfc5f55-8577v_8de61e82-40e9-4562-a2e1-68c51ff1fe84 became leader"}
1.6793806198463988e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.Node"}
1.6793806198464212e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.DaemonSet"}
1.6793806198466964e+09 INFO controller.clusterpolicy-controller Starting Controller
1.6793806200498693e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:580
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
I0321 06:37:01.197691 1 request.go:665] Waited for 1.045058702s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/autoscaling/v2?timeout=32s
1.6793806217501187e+09 ERROR controllers.ClusterPolicy Unable to list ClusterPolicies {"error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:80
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:57
sigs.k8s.io/controller-runtime/pkg/source/internal.EventHandler.OnAdd
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/internal/eventsource.go:63
k8s.io/client-go/tools/cache.(*processorListener).run.func1
/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:787
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run
/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:781
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73
1.6793806302524693e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.679380640252055e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806502524986e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806602530806e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.679380670252145e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806802531853e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806902525237e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807002526197e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807102526753e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.67938072025275e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807302527254e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807398470001e+09 ERROR controller.clusterpolicy-controller Could not wait for Cache to sync {"error": "failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:208
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:218
1.6793807398471012e+09 INFO Stopping and waiting for non leader election runnables
1.679380739847109e+09 INFO Stopping and waiting for leader election runnables
1.679380739847114e+09 INFO Stopping and waiting for caches
1.6793807398471637e+09 INFO Stopping and waiting for webhooks
1.6793807398471868e+09 INFO Wait completed, proceeding to shutdown the manager
1.6793807398472097e+09 ERROR setup problem running manager {"error": "failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced"}
main.main
/workspace/main.go:118
runtime.main
/usr/local/go/src/runtime/proc.go:255
Please try to open a new terminal to run below command.
$ kubectl delete crd clusterpolicies.nvidia.com
Similar error log is also found in other topic. TAO Toolkit 4.0 setup issue - #18 by mykim4
Did you set up TAO API successfully?
Refer to AutoML - NVIDIA Docs
and blog https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/
Not needed. Can you share below log as well?
$ kubectl describe pod -n nvidia-gpu-operator gpu-operator-7bfc5f55-8577v
Name: gpu-operator-7bfc5f55-8577v
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Node: admin-ops01/192.168.101.8
Start Time: Mon, 20 Mar 2023 08:54:16 +0000
Labels: app=gpu-operator
app.kubernetes.io/component=gpu-operator
pod-template-hash=7bfc5f55
Annotations: cni.projectcalico.org/containerID: 20a1fb8cccdaaefeada46ef94eeb1902c00f063dd06f17c8db2e9ba49b6a98cb
cni.projectcalico.org/podIP: 192.168.33.118/32
cni.projectcalico.org/podIPs: 192.168.33.118/32
openshift.io/scc: restricted-readonly
Status: Running
IP: 192.168.33.118
IPs:
IP: 192.168.33.118
Controlled By: ReplicaSet/gpu-operator-7bfc5f55
Containers:
gpu-operator:
Container ID: containerd://3f2ec1c212505150c32e325401d9441ae44b291bdc8e378ded60da1c9a01b5ca
Image: nvcr.io/nvidia/gpu-operator:v1.10.1
Image ID: nvcr.io/nvidia/gpu-operator@sha256:c7f9074c1a7f58947c807f23f2eece3a8b04e11175127919156f8e864821d45a
Port: 8080/TCP
Host Port: 0/TCP
Command:
gpu-operator
Args:
--leader-elect
State: Running
Started: Tue, 21 Mar 2023 08:56:58 +0000
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 21 Mar 2023 08:49:28 +0000
Finished: Tue, 21 Mar 2023 08:51:46 +0000
Ready: True
Restart Count: 199
Limits:
cpu: 500m
memory: 350Mi
Requests:
cpu: 200m
memory: 100Mi
Liveness: http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
Environment:
WATCH_NAMESPACE:
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
Mounts:
/host-etc/os-release from host-os-release (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7r8cx (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
kube-api-access-7r8cx:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 85s (x4695 over 23h) kubelet Back-off restarting failed container
Could I
bash setup uninstall
and then run the methods you mentioned before ?
Yes, you can for double check.
Then, to see if there is still failed pod when run “kubectl get pods -A” .
Just setup TAO-API again. Not needed to run AutoML and its notebook.
Is there any methods that don’t need to reinstall ?
I think you already fix the failed pod issue.
Can you share “kubectl get pods -A” ?
No, I haven’t fixed yet. The failed pod still shows the error message:
if kind is a CRD, it should be installed before calling Start {“kind”: “ClusterPolicy.nvidia.com”, “error”: “no matches for kind "ClusterPolicy" in version "nvidia.com/v1"”}
problem running manager {“error”: “failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced”}
main.main
and the message from the command kubectl get pods -A is still like this post
OK. And do you have other kind of machine on hand?
I find that https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#bare-metal-setup mentions
More, could you upload the log when you setup TAO-API?
$ bash setup.sh check-inventory.yml
$ bash setup.sh install
You can upload via button
The machine I am using now is 4 NVIDIA Tesla P100 SXM2 16GB.
How should I do to store the log when I setup TAO-API?
You can copy the log from the terminal and then upload it as a txt file.
Hi,
Please uninstall the driver.
sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean
Then, run below
$ bash setup.sh check-inventory.yml
$ bash setup.sh install
And share with the logs.