TAO5 - Detectnet_v2 - MultiGPU TAO-API Dead at train start

Morganh · August 1, 2023, 6:44am

In latest 5.0 notebooks,

The default value in gpu-operator-values.yml is

enable_mig: no
mig_profile: all-disabled
mig_strategy: single
nvidia_driver_version: “525.85.12”
install_driver: true

You can also find this value in the repo.

github.com

NVIDIA/tao_tutorials/blob/main/setup/quickstart_api_bare_metal/gpu-operator-values.yml

enable_mig: no
mig_profile: all-disabled
mig_strategy: single
nvidia_driver_version: "525.85.12"
install_driver: true

Morganh · August 1, 2023, 7:14am

More, could you please try another node_port?
On my side,

   $ kubectl get endpoints
     NAME                                            ENDPOINTS                            AGE
     cluster.local-nfs-subdir-external-provisioner   <none>                               4d21h
     ingress-nginx-controller                        192.168.34.82:443,192.168.34.82:80   4d21h
     kubernetes                                      10.34.4.209:6443                     4d21h
     tao-toolkit-api-jupyterlab-service              192.168.34.85:8888                   4d21h
     tao-toolkit-api-service                         192.168.34.87:8000                   4d21h

   $ kubectl get services
   NAME                                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
   ingress-nginx-controller             NodePort    10.100.40.251   <none>        80:32080/TCP,443:32443/TCP   4d21h
   kubernetes                           ClusterIP   10.96.0.1       <none>        443/TCP                      4d21h
   tao-toolkit-api-jupyterlab-service   NodePort    10.98.32.241    <none>        8888:31952/TCP               4d21h
   tao-toolkit-api-service              NodePort    10.96.101.73    <none>        8000:31951/TCP               4d21h

$ hostname -i
127.0.1.1

And then, I set as below.

alejandro.granda · August 1, 2023, 7:42am

Same Issue with this driver version.

setup.sh uninstall

Delete all the drivers and reboot.

setup.sh install with the version 525.85.12. As pod.

Convert datasets works:

detectnet_v2 dataset_convert --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/ --output_filename=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/specs/06d77ab5-019f-4182-ac8d-e9dfebdb13e5.protobuf  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/logs/06d77ab5-019f-4182-ac8d-e9dfebdb13e5.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/logs/06d77ab5-019f-4182-ac8d-e9dfebdb13e5.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/f7f5ff2e-9438-4be3-84b6-ed5587c8e11a/06d77ab5-019f-4182-ac8d-e9dfebdb13e5/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 06d77ab5-019f-4182-ac8d-e9dfebdb13e5
Post running
Toolkit status for 06d77ab5-019f-4182-ac8d-e9dfebdb13e5 is SUCCESS
Job Done: 06d77ab5-019f-4182-ac8d-e9dfebdb13e5 Final status: Done
detectnet_v2 dataset_convert --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/ --output_filename=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/tfrecords/tfrecords --verbose --dataset_export_spec=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/specs/6b3e21ed-6beb-48a1-985d-8a38890aa63d.protobuf  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/logs/6b3e21ed-6beb-48a1-985d-8a38890aa63d.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/logs/6b3e21ed-6beb-48a1-985d-8a38890aa63d.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/datasets/353fdaec-c473-48db-992d-2524cd82494c/6b3e21ed-6beb-48a1-985d-8a38890aa63d/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 6b3e21ed-6beb-48a1-985d-8a38890aa63d
Post running
Toolkit status for 6b3e21ed-6beb-48a1-985d-8a38890aa63d is SUCCESS
Job Done: 6b3e21ed-6beb-48a1-985d-8a38890aa63d Final status: Done

But the train not launch the new train pod.

But well recived by the API
172.16.1.2 - - [01/Aug/2023:07:36:45 +0000] "POST /api/v1/user/aca5e8b5-9d4c-52e0-a612-563bd387f382/model/a47ffabc-e471-46c6-a361-1dd34be9b001/job HTTP/1.1" 201 117 "-" "python-requests/2.28.2"

And after that the API pod get the status the Unhealthy.

Morganh · August 1, 2023, 7:58am

Could you please elaborate more for above steps?
After the “dataset_convert” successfully, you run into the cell “Run train” for autoML training, right? Do you mean you cannot find a new pod when run “kubectl get pods” ? And then find the unhealthy info in “kubectl logs -f tao-toolkit-api-app-pod-xxxxx-xxxx” ?

alejandro.granda · August 1, 2023, 7:59am

Can’t never work using the Ingress-nginx port, neither with TAO4.

$ curl http://127.0.1.1:32080/api/v1
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>

Morganh · August 1, 2023, 8:10am

You can run
$ curl http://127.0.1.1:32080/api/v1/login/your_ngc_key

alejandro.granda · August 1, 2023, 8:13am

The same as the first post. NOW: Avoid the AutoML process. And avoid multi GPU.
1xGPU - Normal train - NO use_amp.

Login in the pod → reload the datasets (preview loaded in other projects) → create the automatic specs and personalize them → Run the convert dataset process → Watch how the new POD was created, read the log and verify that all the images are recogniced and the TFrecods generated → Create the new “model_id” → Create the automatic specs and personalize them → Review that make sense the generated file → Launch the train JOB ->#The POD is NOT created with the train process → The API POD go to Unhealthy and unreacheable.

alejandro.granda · August 1, 2023, 8:15am

Oh f**k, needs to revive the tao POD.

$ curl http://127.0.1.1:32080/api/v1/login/xxxxxxx
<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>nginx</center>
</body>
</html>

I don’t know how to revive the pod, without uninstall everything… try to drain the node. But continue unhealthy.

Morganh · August 1, 2023, 9:13am

How about the result of
$ kubectl describe service tao-toolkit-api-service

alejandro.granda · August 1, 2023, 9:16am

$ kubectl describe service tao-toolkit-api-service
Name:                     tao-toolkit-api-service
Namespace:                default
Labels:                   app.kubernetes.io/managed-by=Helm
Annotations:              meta.helm.sh/release-name: tao-toolkit-api
                          meta.helm.sh/release-namespace: default
Selector:                 name=tao-toolkit-api-app-pod
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.104.52.184
IPs:                      10.104.52.184
Port:                     api  8000/TCP
TargetPort:               8000/TCP
NodePort:                 api  31951/TCP
Endpoints:                
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

And the last describe from the API pod

$ kubectl describe pod tao-toolkit-api-app-pod-55c5d88d86-2xbm7
Name:         tao-toolkit-api-app-pod-55c5d88d86-2xbm7
Namespace:    default
Priority:     0
Node:         azken/10.1.1.10
Start Time:   Tue, 01 Aug 2023 10:17:49 +0200
Labels:       name=tao-toolkit-api-app-pod
              pod-template-hash=55c5d88d86
Annotations:  cni.projectcalico.org/containerID: 4aac52a87cbc260611c0b3d1146e2d8f4ef30ee6c2ec948d09581207265c6fde
              cni.projectcalico.org/podIP: 192.168.35.124/32
              cni.projectcalico.org/podIPs: 192.168.35.124/32
Status:       Running
IP:           192.168.35.124
IPs:
  IP:           192.168.35.124
Controlled By:  ReplicaSet/tao-toolkit-api-app-pod-55c5d88d86
Containers:
  tao-toolkit-api-app:
    Container ID:   containerd://5a8f77c6358698d420a6db88ae57ff8d448d0f58c908acce43f6038cbf14cacc
    Image:          nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api
    Image ID:       nvcr.io/nvidia/tao/tao-toolkit@sha256:45e93283d23a911477cc433ec43e233af1631e85ec0ba839e63780c30dd2d70b
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 01 Aug 2023 10:38:54 +0200
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 01 Aug 2023 10:19:05 +0200
      Finished:     Tue, 01 Aug 2023 10:36:06 +0200
    Ready:          False
    Restart Count:  3
    Liveness:       http-get http://:8000/api/v1/health/liveness delay=3s timeout=3s period=10s #success=1 #failure=3
    Readiness:      http-get http://:8000/api/v1/health/readiness delay=3s timeout=3s period=10s #success=1 #failure=3
    Environment:
      NAMESPACE:            default
      CLAIMNAME:            tao-toolkit-api-pvc
      IMAGEPULLSECRET:      imagepullsecret
      AUTH_CLIENT_ID:       bnSePYullXlG-504nOZeNAXemGF6DhoCdYR8ysm088w
      NUM_GPU_PER_NODE:     1
      BACKEND:              local-k8s
      IMAGE_TF1:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
      IMAGE_PYT:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt
      IMAGE_TF2:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
      IMAGE_TAO_DEPLOY:     nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy
      IMAGE_DEFAULT:        nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
      IMAGE_API:            nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api
      IMAGE_DATA_SERVICES:  nvcr.io/nvidia/tao/tao-toolkit:5.0.0-data-services
      PYTHONIOENCODING:     utf-8
      LC_ALL:               C.UTF-8
    Mounts:
      /shared from shared-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nw5ml (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  shared-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tao-toolkit-api-pvc
    ReadOnly:   false
  kube-api-access-nw5ml:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  58m (x2 over 60m)     default-scheduler  0/1 nodes are available: 1 node(s) were unschedulable.
  Normal   Scheduled         58m                   default-scheduler  Successfully assigned default/tao-toolkit-api-app-pod-55c5d88d86-2xbm7 to azken
  Normal   Pulled            57m                   kubelet            Successfully pulled image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api" in 8.632331063s
  Normal   Started           57m (x2 over 57m)     kubelet            Started container tao-toolkit-api-app
  Normal   Created           57m (x2 over 57m)     kubelet            Created container tao-toolkit-api-app
  Normal   Pulled            57m                   kubelet            Successfully pulled image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api" in 3.755364478s
  Warning  Unhealthy         57m (x6 over 57m)     kubelet            Liveness probe failed: Get "http://192.168.35.118:8000/api/v1/health/liveness": dial tcp 192.168.35.118:8000: connect: connection refused
  Warning  Unhealthy         57m (x8 over 57m)     kubelet            Readiness probe failed: Get "http://192.168.35.118:8000/api/v1/health/readiness": dial tcp 192.168.35.118:8000: connect: connection refused
  Normal   Killing           57m (x2 over 57m)     kubelet            Container tao-toolkit-api-app failed liveness probe, will be restarted
  Normal   Pulling           56m (x3 over 58m)     kubelet            Pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api"
  Warning  Unhealthy         53m (x23 over 56m)    kubelet            Readiness probe failed: HTTP probe failed with statuscode: 400
  Normal   SandboxChanged    37m (x2 over 37m)     kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling           37m                   kubelet            Pulling image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api"
  Normal   Pulled            37m                   kubelet            Successfully pulled image "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-api" in 3.505803367s
  Normal   Created           37m                   kubelet            Created container tao-toolkit-api-app
  Normal   Started           37m                   kubelet            Started container tao-toolkit-api-app
  Warning  Unhealthy         36m                   kubelet            Readiness probe failed: Get "http://192.168.35.124:8000/api/v1/health/readiness": dial tcp 192.168.35.124:8000: connect: connection refused
  Warning  Unhealthy         36m                   kubelet            Liveness probe failed: Get "http://192.168.35.124:8000/api/v1/health/liveness": dial tcp 192.168.35.124:8000: connect: connection refused
  Warning  Unhealthy         115s (x229 over 35m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 400

Morganh · August 1, 2023, 9:24am

How about
$ kubectl get endpoints

alejandro.granda · August 1, 2023, 9:30am

After reinstall, can login using the nginx port.

But only can login → try to get the dataset default specs → DIE

After the last dead:

$ kubectl get endpoints
NAME                                            ENDPOINTS                            AGE
cluster.local-nfs-subdir-external-provisioner   <none>                               161m
ingress-nginx-controller                        192.168.35.78:443,192.168.35.78:80   161m
kubernetes                                      10.1.1.10:6443                       168m
tao-toolkit-api-jupyterlab-service              192.168.35.74:8888                   9m49s
tao-toolkit-api-service                                                              9m49s

Morganh · August 1, 2023, 9:35am

That is not expected to get empty endpoint for tao-toolkit-api-service .
The same unexpected result in Endpoints: when run $kubectl describe service tao-toolkit-api-service

alejandro.granda · August 1, 2023, 9:39am

Yehp…but…

This is after a TAO-API helm uninstall and install.

$ kubectl get endpoints
NAME                                            ENDPOINTS                            AGE
cluster.local-nfs-subdir-external-provisioner   <none>                               170m
ingress-nginx-controller                        192.168.35.78:443,192.168.35.78:80   170m
kubernetes                                      10.1.1.10:6443                       176m
tao-toolkit-api-jupyterlab-service              192.168.35.83:8888                   38s
tao-toolkit-api-service                         192.168.35.75:8000                   38s

How can debug that?

Any form to enter to the pod and watch whats happening?

Morganh · August 1, 2023, 10:09am

Suggest to narrow down to check which cell will result into this behavior.
You can try to run another notebook to check if TAO-API works.
For example, notebooks/tao_api_starter_kit/client/data_services.ipynb

alejandro.granda · August 1, 2023, 12:02pm

I will try to execute the entire notebook…

alejandro.granda · August 2, 2023, 6:19am

These notebook works well. Create all the important pods when its necesary.

I will repeate the detecnet process…

alejandro.granda · August 2, 2023, 6:28am

One question.
I’m watching that in some notebooks when you do the login, in the API url point to the {namespace} of kubernetes, and in other not.

I’m testing and both forms are interpreted by the cluster. But which is the correct one?

alejandro.granda · August 2, 2023, 7:17am

WTF!!!

With the API notebook, the train process start correctly.
So some bug are hidden in the TAO-CLIENT.

The API POST to start the job was freeze from a minute, but the start. So the api-client maybe have some timeout that makes that the pod dead.

Note: I’m using the {namespace} in the TAO API url.

detectnet_v2 train --experiment_spec_file=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/specs/695e9ea9-fa1c-4349-977a-e185fcd6f4a3.protobuf --results_dir=/shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/ --verbose --key=tlt_encode --gpus=1  > /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/logs/695e9ea9-fa1c-4349-977a-e185fcd6f4a3.txt 2>&1 >> /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/logs/695e9ea9-fa1c-4349-977a-e185fcd6f4a3.txt; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/ -type d | xargs chmod 777; find /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/ -type f | xargs chmod 666 /shared/users/aca5e8b5-9d4c-52e0-a612-563bd387f382/models/15ed0b94-1ff1-45f8-8e11-35c0929cc914/695e9ea9-fa1c-4349-977a-e185fcd6f4a3/status.json
nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
Job created 695e9ea9-fa1c-4349-977a-e185fcd6f4a3

$ kubectl get pods -A
NAMESPACE             NAME                                                              READY   STATUS      RESTARTS       AGE
default               695e9ea9-fa1c-4349-977a-e185fcd6f4a3-86f5x                        1/1     Running     0              56s

LOG:

INFO:tensorflow:Graph was finalized.
2023-08-02 07:06:47,473 [TAO Toolkit] [INFO] tensorflow 240: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-08-02 07:06:49,919 [TAO Toolkit] [INFO] tensorflow 500: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-08-02 07:06:50,496 [TAO Toolkit] [INFO] tensorflow 502: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-08-02 07:06:58,853 [TAO Toolkit] [INFO] tensorflow 81: Saving checkpoints for step-0.
INFO:tensorflow:epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.107709154, step = 0
2023-08-02 07:07:54,302 [TAO Toolkit] [INFO] tensorflow 262: epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.107709154, step = 0
2023-08-02 07:07:54,311 [TAO Toolkit] [INFO] root 2102: None
2023-08-02 07:07:54,319 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.tfhooks.task_progress_monitor_hook 149: Epoch 0/100: loss: 0.10771 learning rate: 4.9999994e-06 Time taken: 0:00:00 ETA: 0:00:00
2023-08-02 07:07:54,319 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 1.003
INFO:tensorflow:epoch = 0.002008032128514056, learning_rate = 5.0046265e-06, loss = 0.107642695, step = 2 (8.946 sec)
2023-08-02 07:08:03,247 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.002008032128514056, learning_rate = 5.0046265e-06, loss = 0.107642695, step = 2 (8.946 sec)
INFO:tensorflow:epoch = 0.023092369477911646, learning_rate = 5.053455e-06, loss = 0.104957655, step = 23 (5.456 sec)
2023-08-02 07:08:08,703 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.023092369477911646, learning_rate = 5.053455e-06, loss = 0.104957655, step = 23 (5.456 sec)
2023-08-02 07:08:08,962 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 15.551
INFO:tensorflow:epoch = 0.04417670682730923, learning_rate = 5.1027596e-06, loss = 0.1001761, step = 44 (5.476 sec)
2023-08-02 07:08:14,178 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.04417670682730923, learning_rate = 5.1027596e-06, loss = 0.1001761, step = 44 (5.476 sec)
2023-08-02 07:08:15,494 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 91.857
INFO:tensorflow:epoch = 0.06526104417670682, learning_rate = 5.1525503e-06, loss = 0.0976929, step = 65 (5.520 sec)
2023-08-02 07:08:19,699 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.06526104417670682, learning_rate = 5.1525503e-06, loss = 0.0976929, step = 65 (5.520 sec)
2023-08-02 07:08:22,081 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 91.099

Morganh · August 2, 2023, 8:33am

OK, glad to know it can run. I think you are running with notebooks/tao_api_starter_kit/api/object_detection.ipynb, right?

Did you install latest tao-client? Will monitor the feedback from users since I cannot reproduce this.