AutoML training speed and GPU problem

I got the error message: clusterpolicies.nvidia.com not found

Did you set up TAO API successfully?
Refer to AutoML - NVIDIA Docs
and blog https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/

I set up TAO API via the method and successfully. Should I need to reinstall ?

Not needed. Can you share below log as well?
$ kubectl describe pod -n nvidia-gpu-operator gpu-operator-7bfc5f55-8577v

Name:                 gpu-operator-7bfc5f55-8577v
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 admin-ops01/192.168.101.8
Start Time:           Mon, 20 Mar 2023 08:54:16 +0000
Labels:               app=gpu-operator
                      app.kubernetes.io/component=gpu-operator
                      pod-template-hash=7bfc5f55
Annotations:          cni.projectcalico.org/containerID: 20a1fb8cccdaaefeada46ef94eeb1902c00f063dd06f17c8db2e9ba49b6a98cb
                      cni.projectcalico.org/podIP: 192.168.33.118/32
                      cni.projectcalico.org/podIPs: 192.168.33.118/32
                      openshift.io/scc: restricted-readonly
Status:               Running
IP:                   192.168.33.118
IPs:
  IP:           192.168.33.118
Controlled By:  ReplicaSet/gpu-operator-7bfc5f55
Containers:
  gpu-operator:
    Container ID:  containerd://3f2ec1c212505150c32e325401d9441ae44b291bdc8e378ded60da1c9a01b5ca
    Image:         nvcr.io/nvidia/gpu-operator:v1.10.1
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:c7f9074c1a7f58947c807f23f2eece3a8b04e11175127919156f8e864821d45a
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      gpu-operator
    Args:
      --leader-elect
    State:          Running
      Started:      Tue, 21 Mar 2023 08:56:58 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 21 Mar 2023 08:49:28 +0000
      Finished:     Tue, 21 Mar 2023 08:51:46 +0000
    Ready:          True
    Restart Count:  199
    Limits:
      cpu:     500m
      memory:  350Mi
    Requests:
      cpu:      200m
      memory:   100Mi
    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      WATCH_NAMESPACE:     
      OPERATOR_NAMESPACE:  nvidia-gpu-operator (v1:metadata.namespace)
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7r8cx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  kube-api-access-7r8cx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  85s (x4695 over 23h)  kubelet  Back-off restarting failed container

Could I
bash setup uninstall
and then run the methods you mentioned before ?

Yes, you can for double check.
Then, to see if there is still failed pod when run “kubectl get pods -A” .

Just setup TAO-API again. Not needed to run AutoML and its notebook.

Is there any methods that don’t need to reinstall ?

I think you already fix the failed pod issue.
Can you share “kubectl get pods -A” ?

No, I haven’t fixed yet. The failed pod still shows the error message:

if kind is a CRD, it should be installed before calling Start {“kind”: “ClusterPolicy.nvidia.com”, “error”: “no matches for kind "ClusterPolicy" in version "nvidia.com/v1"”}

problem running manager {“error”: “failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced”}
main.main

and the message from the command kubectl get pods -A is still like this post

OK. And do you have other kind of machine on hand?
I find that https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#bare-metal-setup mentions

  • 1 NVIDIA Discrete GPU: Volta, Turing, Ampere, Hopper architecture

More, could you upload the log when you setup TAO-API?
$ bash setup.sh check-inventory.yml
$ bash setup.sh install

You can upload via button
image

The machine I am using now is 4 NVIDIA Tesla P100 SXM2 16GB.

How should I do to store the log when I setup TAO-API?

You can copy the log from the terminal and then upload it as a txt file.

Hi,
Please uninstall the driver.

sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean

Then, run below
$ bash setup.sh check-inventory.yml
$ bash setup.sh install

And share with the logs.

In the stage of TASK [Waiting for the Cluster to become available], I found that I could not rmmod nvidia module.

When I ran the command: sudo rmmod nvidia, I got the error message

rmmod: ERROR: Module nvidia is in use

How could I do to deal with the problem?

The picture below is the error log in the stage of TASK [Waiting for the Cluster to become available]

The picture below is the last components related to driver
image

It gets stuck here, right?

Exactly, I still get stuck here.

Please try to open a new terminal to run below command.
$ kubectl delete crd clusterpolicies.nvidia.com

If I did it, it would terminate and disapper the GPU driver and GPU related network-plugin.

Is there any methods to deal with it ?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This command does not affect driver.