AutoML training speed and GPU problem

Yes. The dataset was created and uploaded successfully.

Besides, I found that one of pod is in status of CrashloopBackoff. The picture is the log contents.

Please share the full log with us.

$ kubectl logs -n nvidia-gpu-operator gpu-operator-7bfc5f55-8577v

1.679380604772754e+09   INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8080"}
1.6793806047730486e+09  INFO    setup   starting manager
1.67938060477328e+09    INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6793806047733126e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0321 06:36:44.773357       1 leaderelection.go:248] attempting to acquire leader lease nvidia-gpu-operator/53822513.nvidia.com...
I0321 06:36:59.846065       1 leaderelection.go:258] successfully acquired lease nvidia-gpu-operator/53822513.nvidia.com
1.6793806198461287e+09  DEBUG   events  Normal  {"object": {"kind":"ConfigMap","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"361ff16d-b58e-4204-b47a-86e4fff31f1c","apiVersion":"v1","resourceVersion":"168810"}, "reason": "LeaderElection", "message": "gpu-operator-7bfc5f55-8577v_8de61e82-40e9-4562-a2e1-68c51ff1fe84 became leader"}
1.6793806198463733e+09  INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.ClusterPolicy"}
1.6793806198463368e+09  DEBUG   events  Normal  {"object": {"kind":"Lease","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"ff021974-19a5-44ec-9dbb-b1da19e1202b","apiVersion":"coordination.k8s.io/v1","resourceVersion":"168811"}, "reason": "LeaderElection", "message": "gpu-operator-7bfc5f55-8577v_8de61e82-40e9-4562-a2e1-68c51ff1fe84 became leader"}
1.6793806198463988e+09  INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.Node"}
1.6793806198464212e+09  INFO    controller.clusterpolicy-controller     Starting EventSource    {"source": "kind source: *v1.DaemonSet"}
1.6793806198466964e+09  INFO    controller.clusterpolicy-controller     Starting Controller
1.6793806200498693e+09  ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:580
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
I0321 06:37:01.197691       1 request.go:665] Waited for 1.045058702s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/autoscaling/v2?timeout=32s
1.6793806217501187e+09  ERROR   controllers.ClusterPolicy       Unable to list ClusterPolicies  {"error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:80
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:57
sigs.k8s.io/controller-runtime/pkg/source/internal.EventHandler.OnAdd
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/internal/eventsource.go:63
k8s.io/client-go/tools/cache.(*processorListener).run.func1
        /workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:787
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run
        /workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:781
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73
1.6793806302524693e+09  ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.679380640252055e+09   ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806502524986e+09  ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806602530806e+09  ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.679380670252145e+09   ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806802531853e+09  ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806902525237e+09  ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807002526197e+09  ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807102526753e+09  ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.67938072025275e+09    ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807302527254e+09  ERROR   controller-runtime.source       if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807398470001e+09  ERROR   controller.clusterpolicy-controller     Could not wait for Cache to sync        {"error": "failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:208
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:218
1.6793807398471012e+09  INFO    Stopping and waiting for non leader election runnables
1.679380739847109e+09   INFO    Stopping and waiting for leader election runnables
1.679380739847114e+09   INFO    Stopping and waiting for caches
1.6793807398471637e+09  INFO    Stopping and waiting for webhooks
1.6793807398471868e+09  INFO    Wait completed, proceeding to shutdown the manager
1.6793807398472097e+09  ERROR   setup   problem running manager {"error": "failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced"}
main.main
        /workspace/main.go:118
runtime.main
        /usr/local/go/src/runtime/proc.go:255

Can you open a new terminal to run below?
$ kubectl delete crd clusterpolicies.nvidia.com

But I cannot find crd clusterpolicies.nvidia.com via kubectl get crd

Please try to open a new terminal to run below command.
$ kubectl delete crd clusterpolicies.nvidia.com

Similar error log is also found in other topic. TAO Toolkit 4.0 setup issue - #18 by mykim4

I got the error message: clusterpolicies.nvidia.com not found

Did you set up TAO API successfully?
Refer to AutoML - NVIDIA Docs
and blog https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/

I set up TAO API via the method and successfully. Should I need to reinstall ?

Not needed. Can you share below log as well?
$ kubectl describe pod -n nvidia-gpu-operator gpu-operator-7bfc5f55-8577v

Name:                 gpu-operator-7bfc5f55-8577v
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 admin-ops01/192.168.101.8
Start Time:           Mon, 20 Mar 2023 08:54:16 +0000
Labels:               app=gpu-operator
                      app.kubernetes.io/component=gpu-operator
                      pod-template-hash=7bfc5f55
Annotations:          cni.projectcalico.org/containerID: 20a1fb8cccdaaefeada46ef94eeb1902c00f063dd06f17c8db2e9ba49b6a98cb
                      cni.projectcalico.org/podIP: 192.168.33.118/32
                      cni.projectcalico.org/podIPs: 192.168.33.118/32
                      openshift.io/scc: restricted-readonly
Status:               Running
IP:                   192.168.33.118
IPs:
  IP:           192.168.33.118
Controlled By:  ReplicaSet/gpu-operator-7bfc5f55
Containers:
  gpu-operator:
    Container ID:  containerd://3f2ec1c212505150c32e325401d9441ae44b291bdc8e378ded60da1c9a01b5ca
    Image:         nvcr.io/nvidia/gpu-operator:v1.10.1
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:c7f9074c1a7f58947c807f23f2eece3a8b04e11175127919156f8e864821d45a
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      gpu-operator
    Args:
      --leader-elect
    State:          Running
      Started:      Tue, 21 Mar 2023 08:56:58 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 21 Mar 2023 08:49:28 +0000
      Finished:     Tue, 21 Mar 2023 08:51:46 +0000
    Ready:          True
    Restart Count:  199
    Limits:
      cpu:     500m
      memory:  350Mi
    Requests:
      cpu:      200m
      memory:   100Mi
    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      WATCH_NAMESPACE:     
      OPERATOR_NAMESPACE:  nvidia-gpu-operator (v1:metadata.namespace)
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7r8cx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  kube-api-access-7r8cx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  85s (x4695 over 23h)  kubelet  Back-off restarting failed container

Could I
bash setup uninstall
and then run the methods you mentioned before ?

Yes, you can for double check.
Then, to see if there is still failed pod when run “kubectl get pods -A” .

Just setup TAO-API again. Not needed to run AutoML and its notebook.

Is there any methods that don’t need to reinstall ?

I think you already fix the failed pod issue.
Can you share “kubectl get pods -A” ?

No, I haven’t fixed yet. The failed pod still shows the error message:

if kind is a CRD, it should be installed before calling Start {“kind”: “ClusterPolicy.nvidia.com”, “error”: “no matches for kind "ClusterPolicy" in version "nvidia.com/v1"”}

problem running manager {“error”: “failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced”}
main.main

and the message from the command kubectl get pods -A is still like this post

OK. And do you have other kind of machine on hand?
I find that https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#bare-metal-setup mentions

  • 1 NVIDIA Discrete GPU: Volta, Turing, Ampere, Hopper architecture

More, could you upload the log when you setup TAO-API?
$ bash setup.sh check-inventory.yml
$ bash setup.sh install

You can upload via button
image

The machine I am using now is 4 NVIDIA Tesla P100 SXM2 16GB.

How should I do to store the log when I setup TAO-API?

You can copy the log from the terminal and then upload it as a txt file.

Hi,
Please uninstall the driver.

sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean

Then, run below
$ bash setup.sh check-inventory.yml
$ bash setup.sh install

And share with the logs.