Please provide the following information when requesting support.
• Hardware (Tesla P100-SXM2-16GB)
• Network Type (Classification)
The document of AutoML presents that the estimated time for training multitask_classification model with Bayesian Optimization is 250 min.
However, the model has been still in the status of running train process for more the 7 hours like the picture below and the GPU usage is zero when I run tao-getting-started_v4.0.0/notebooks/tao_api_starter_kit/api/classification.ipynb for training multitask_classification model.
1.679380604772754e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
1.6793806047730486e+09 INFO setup starting manager
1.67938060477328e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6793806047733126e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
I0321 06:36:44.773357 1 leaderelection.go:248] attempting to acquire leader lease nvidia-gpu-operator/53822513.nvidia.com...
I0321 06:36:59.846065 1 leaderelection.go:258] successfully acquired lease nvidia-gpu-operator/53822513.nvidia.com
1.6793806198461287e+09 DEBUG events Normal {"object": {"kind":"ConfigMap","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"361ff16d-b58e-4204-b47a-86e4fff31f1c","apiVersion":"v1","resourceVersion":"168810"}, "reason": "LeaderElection", "message": "gpu-operator-7bfc5f55-8577v_8de61e82-40e9-4562-a2e1-68c51ff1fe84 became leader"}
1.6793806198463733e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.ClusterPolicy"}
1.6793806198463368e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"nvidia-gpu-operator","name":"53822513.nvidia.com","uid":"ff021974-19a5-44ec-9dbb-b1da19e1202b","apiVersion":"coordination.k8s.io/v1","resourceVersion":"168811"}, "reason": "LeaderElection", "message": "gpu-operator-7bfc5f55-8577v_8de61e82-40e9-4562-a2e1-68c51ff1fe84 became leader"}
1.6793806198463988e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.Node"}
1.6793806198464212e+09 INFO controller.clusterpolicy-controller Starting EventSource {"source": "kind source: *v1.DaemonSet"}
1.6793806198466964e+09 INFO controller.clusterpolicy-controller Starting Controller
1.6793806200498693e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:580
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
I0321 06:37:01.197691 1 request.go:665] Waited for 1.045058702s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/autoscaling/v2?timeout=32s
1.6793806217501187e+09 ERROR controllers.ClusterPolicy Unable to list ClusterPolicies {"error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:80
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:57
sigs.k8s.io/controller-runtime/pkg/source/internal.EventHandler.OnAdd
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/internal/eventsource.go:63
k8s.io/client-go/tools/cache.(*processorListener).run.func1
/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:787
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run
/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:781
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73
1.6793806302524693e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.679380640252055e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806502524986e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806602530806e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.679380670252145e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806802531853e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793806902525237e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807002526197e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807102526753e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.67938072025275e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807302527254e+09 ERROR controller-runtime.source if kind is a CRD, it should be installed before calling Start {"kind": "ClusterPolicy.nvidia.com", "error": "no matches for kind \"ClusterPolicy\" in version \"nvidia.com/v1\""}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233
k8s.io/apimachinery/pkg/util/wait.WaitForWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:660
k8s.io/apimachinery/pkg/util/wait.poll
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:594
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131
1.6793807398470001e+09 ERROR controller.clusterpolicy-controller Could not wait for Cache to sync {"error": "failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:208
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:218
1.6793807398471012e+09 INFO Stopping and waiting for non leader election runnables
1.679380739847109e+09 INFO Stopping and waiting for leader election runnables
1.679380739847114e+09 INFO Stopping and waiting for caches
1.6793807398471637e+09 INFO Stopping and waiting for webhooks
1.6793807398471868e+09 INFO Wait completed, proceeding to shutdown the manager
1.6793807398472097e+09 ERROR setup problem running manager {"error": "failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced"}
main.main
/workspace/main.go:118
runtime.main
/usr/local/go/src/runtime/proc.go:255
No, I haven’t fixed yet. The failed pod still shows the error message:
if kind is a CRD, it should be installed before calling Start {“kind”: “ClusterPolicy.nvidia.com”, “error”: “no matches for kind "ClusterPolicy" in version "nvidia.com/v1"”}
problem running manager {“error”: “failed to wait for clusterpolicy-controller caches to sync: timed out waiting for cache to be synced”}
main.main
and the message from the command kubectl get pods -A is still like this post