AutoML training speed and GPU problem

Did you set up TAO API successfully?
Refer to AutoML - NVIDIA Docs
and blog

I set up TAO API via the method and successfully. Should I need to reinstall ?

Not needed. Can you share below log as well?
$ kubectl describe pod -n nvidia-gpu-operator gpu-operator-7bfc5f55-8577v

Name:                 gpu-operator-7bfc5f55-8577v
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 admin-ops01/
Start Time:           Mon, 20 Mar 2023 08:54:16 +0000
Labels:               app=gpu-operator
Annotations: 20a1fb8cccdaaefeada46ef94eeb1902c00f063dd06f17c8db2e9ba49b6a98cb
Status:               Running
Controlled By:  ReplicaSet/gpu-operator-7bfc5f55
    Container ID:  containerd://3f2ec1c212505150c32e325401d9441ae44b291bdc8e378ded60da1c9a01b5ca
    Image ID:
    Port:          8080/TCP
    Host Port:     0/TCP
    State:          Running
      Started:      Tue, 21 Mar 2023 08:56:58 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 21 Mar 2023 08:49:28 +0000
      Finished:     Tue, 21 Mar 2023 08:51:46 +0000
    Ready:          True
    Restart Count:  199
      cpu:     500m
      memory:  350Mi
      cpu:      200m
      memory:   100Mi
    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
      OPERATOR_NAMESPACE:  nvidia-gpu-operator (v1:metadata.namespace)
      /host-etc/os-release from host-os-release (ro)
      /var/run/secrets/ from kube-api-access-7r8cx (ro)
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
                    op=Exists for 300s
                    op=Exists for 300s
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  85s (x4695 over 23h)  kubelet  Back-off restarting failed container

Could I
bash setup uninstall
and then run the methods you mentioned before ?

Yes, you can for double check.
Then, to see if there is still failed pod when run “kubectl get pods -A” .

Just setup TAO-API again. Not needed to run AutoML and its notebook.

Is there any methods that don’t need to reinstall ?

I think you already fix the failed pod issue.
Can you share “kubectl get pods -A” ?

No, I haven’t fixed yet. The failed pod still shows the error message:

if kind is a CRD, it should be installed before calling Start {“kind”: “”, “error”: “no matches for kind "ClusterPolicy" in version ""”}

problem running manager {“error”: “failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced”}

and the message from the command kubectl get pods -A is still like this post

OK. And do you have other kind of machine on hand?
I find that mentions

  • 1 NVIDIA Discrete GPU: Volta, Turing, Ampere, Hopper architecture

More, could you upload the log when you setup TAO-API?
$ bash check-inventory.yml
$ bash install

You can upload via button

The machine I am using now is 4 NVIDIA Tesla P100 SXM2 16GB.

How should I do to store the log when I setup TAO-API?

You can copy the log from the terminal and then upload it as a txt file.

Please uninstall the driver.

sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean

Then, run below
$ bash check-inventory.yml
$ bash install

And share with the logs.

In the stage of TASK [Waiting for the Cluster to become available], I found that I could not rmmod nvidia module.

When I ran the command: sudo rmmod nvidia, I got the error message

rmmod: ERROR: Module nvidia is in use

How could I do to deal with the problem?

The picture below is the error log in the stage of TASK [Waiting for the Cluster to become available]

The picture below is the last components related to driver

It gets stuck here, right?

Exactly, I still get stuck here.

Please try to open a new terminal to run below command.
$ kubectl delete crd

If I did it, it would terminate and disapper the GPU driver and GPU related network-plugin.

Is there any methods to deal with it ?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

This command does not affect driver.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.