The training process of Tao-Toolkit-API unet is always in Inf status

,

Hello. When I trained unet model according to Tao-Toolkit-API segmentation notebook, I found that the number of epochs is always in Inf in the training phase.

As the picture below, the kpi value and the number of epochs didn’t change even the current experiment number was from 0 to 20.

What happened to the training phase ? How should I do to deal with the problem ?

This is the dataset structure
image

This is the content of image directory
image

This is the content of image/train directory (There are no .ipynb_checkpoints directory in image/val and image/test directory)

This is the content of masks/train directory (There are no .ipynb_checkpoints directory in masks/val)

This is the training phase







hi @swka1043338
could you please give me more information, you can use commands below include:

helm ls
kubect get pod 
kubectl describe pod tao-toolkit-api-app-pod-<str>

and could you please upload all the log files in the folders of this training job , you can find these folders in /mnt/nfs_share/archived-default-tao-toolkit-api-pvc-pvc-<pvc_id>/users/<user_id>/models/<model_id>/

This is the pods created when I trained the Unet model.

This is the log of pod named 60aba7a6-0602-4940-adac-23fa8c94e943-jlzh6

Name:         60aba7a6-0602-4940-adac-23fa8c94e943-jlzh6
Namespace:    default
Priority:     0
Node:         admin-ops01/192.168.101.8
Start Time:   Fri, 07 Apr 2023 05:46:03 +0000
Labels:       controller-uid=938d2911-d746-4aad-b187-9d97d5e62306
              job-name=60aba7a6-0602-4940-adac-23fa8c94e943
              purpose=tao-toolkit-job
Annotations:  cni.projectcalico.org/containerID: ece027c53a71d7bb1a61c22ce4daa58c3f1e6a9e60bb8c74be7efa9f79f1daf7
              cni.projectcalico.org/podIP: 192.168.33.92/32
              cni.projectcalico.org/podIPs: 192.168.33.92/32
Status:       Running
IP:           192.168.33.92
IPs:
  IP:           192.168.33.92
Controlled By:  Job/60aba7a6-0602-4940-adac-23fa8c94e943
Containers:
  container:
    Container ID:  containerd://f4703a1e965699f74f1c8a3741d77f24c57f121cf6954ef0cc86079ff24e216f
    Image:         nvcr.io/nvidia/tao/tao-toolkit:4.0.0-api
    Image ID:      nvcr.io/nvidia/tao/tao-toolkit@sha256:db5890fbe2c720ac679a5c1167c495159e4e2bd05cf022f5fbe58bd5c8ad0d8a
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      umask 0 && umask 0 && unzip -q /opt/ngccli/ngccli_linux.zip -d /opt/ngccli/ && /opt/ngccli/ngc-cli/ngc --version && /venv/bin/python3 automl_start.py /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/ 60aba7a6-0602-4940-adac-23fa8c94e943 unet e437b5bd-ff44-45d5-b373-4cfbc92c0dfb False Bayesian 20 True 27 3 kpi 0 "[]" "[]"
    State:          Running
      Started:      Fri, 07 Apr 2023 05:46:04 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  0
    Requests:
      nvidia.com/gpu:  0
    Environment:
      NUM_GPUS:                0
      TELEMETRY_OPT_OUT:       no
      WANDB_API_KEY:           
      CLEARML_WEB_HOST:        https://app.clear.ml
      CLEARML_API_HOST:        https://api.clear.ml
      CLEARML_FILES_HOST:      https://files.clear.ml
      CLEARML_API_ACCESS_KEY:  
      CLEARML_API_SECRET_KEY:  
    Mounts:
      /dev/shm from dshm (rw)
      /shared from shared-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xjblt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  shared-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tao-toolkit-api-pvc
    ReadOnly:   false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  kube-api-access-xjblt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  20m   default-scheduler  Successfully assigned default/60aba7a6-0602-4940-adac-23fa8c94e943-jlzh6 to admin-ops01
  Normal  Pulled     20m   kubelet            Container image "nvcr.io/nvidia/tao/tao-toolkit:4.0.0-api" already present on machine
  Normal  Created    20m   kubelet            Created container container
  Normal  Started    20m   kubelet            Started container container

This is the log of one of the created pods.

Name:         193da145-19a7-4d22-8a07-0306367da5b5-7mtw4
Namespace:    default
Priority:     0
Node:         admin-ops01/192.168.101.8
Start Time:   Fri, 07 Apr 2023 05:59:02 +0000
Labels:       controller-uid=7fb7356f-4579-4cc4-82c1-66ed39d8e7ea
              job-name=193da145-19a7-4d22-8a07-0306367da5b5
              purpose=tao-toolkit-job
Annotations:  cni.projectcalico.org/containerID: c80d79b37262df1c08108d7cc4f45dc21caca2cd233fb9d40ba829ee0b507b15
              cni.projectcalico.org/podIP: 
              cni.projectcalico.org/podIPs: 
Status:       Failed
IP:           192.168.33.94
IPs:
  IP:           192.168.33.94
Controlled By:  Job/193da145-19a7-4d22-8a07-0306367da5b5
Containers:
  container:
    Container ID:  containerd://e2ab4abc694f32170d636bf66f7faef74954be493245706fb25b6ae18d0553e6
    Image:         nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
    Image ID:      nvcr.io/nvidia/tao/tao-toolkit@sha256:6282b5b09220942e321a452109ad40cde47e5e490480c405c92b930fff2b0574
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
    Args:
      umask 0 && unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/recommendation_4.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/experiment_4 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/experiment_4/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/experiment_4/log.txt
    State:          Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 07 Apr 2023 05:59:03 +0000
      Finished:     Fri, 07 Apr 2023 05:59:12 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  4
    Requests:
      nvidia.com/gpu:  4
    Environment:
      NUM_GPUS:                4
      TELEMETRY_OPT_OUT:       no
      WANDB_API_KEY:           
      CLEARML_WEB_HOST:        https://app.clear.ml
      CLEARML_API_HOST:        https://api.clear.ml
      CLEARML_FILES_HOST:      https://files.clear.ml
      CLEARML_API_ACCESS_KEY:  
      CLEARML_API_SECRET_KEY:  
    Mounts:
      /dev/shm from dshm (rw)
      /shared from shared-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7f7n4 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  shared-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tao-toolkit-api-pvc
    ReadOnly:   false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  kube-api-access-7f7n4:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  64s   default-scheduler  Successfully assigned default/193da145-19a7-4d22-8a07-0306367da5b5-7mtw4 to admin-ops01
  Normal  Pulled     64s   kubelet            Container image "nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5" already present on machine
  Normal  Created    64s   kubelet            Created container container
  Normal  Started    63s   kubelet            Started container container

This is the information of tao-toolkit-api-app-pod-d6d46986d-xjcr2

Name:         tao-toolkit-api-app-pod-d6d46986d-xjcr2
Namespace:    default
Priority:     0
Node:         admin-ops01/192.168.101.8
Start Time:   Mon, 03 Apr 2023 08:00:29 +0000
Labels:       name=tao-toolkit-api-app-pod
              pod-template-hash=d6d46986d
Annotations:  cni.projectcalico.org/containerID: 5857cf861b9a9513ec0568aed80b1043f8d7f22f34da3441c827a5a2219440f1
              cni.projectcalico.org/podIP: 192.168.33.108/32
              cni.projectcalico.org/podIPs: 192.168.33.108/32
Status:       Running
IP:           192.168.33.108
IPs:
  IP:           192.168.33.108
Controlled By:  ReplicaSet/tao-toolkit-api-app-pod-d6d46986d
Containers:
  tao-toolkit-api-app:
    Container ID:   containerd://f2be65dd470d602aa9b1c0f9bca1f5d9d25025598d2e0cf211a7711ce22dde25
    Image:          nvcr.io/nvidia/tao/tao-toolkit:4.0.0-api
    Image ID:       nvcr.io/nvidia/tao/tao-toolkit@sha256:db5890fbe2c720ac679a5c1167c495159e4e2bd05cf022f5fbe58bd5c8ad0d8a
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Mon, 03 Apr 2023 08:00:33 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:8000/api/v1/health/liveness delay=3s timeout=3s period=10s #success=1 #failure=3
    Readiness:      http-get http://:8000/api/v1/health/readiness delay=3s timeout=3s period=10s #success=1 #failure=3
    Environment:
      NAMESPACE:        default
      CLAIMNAME:        tao-toolkit-api-pvc
      IMAGEPULLSECRET:  imagepullsecret
      AUTH_CLIENT_ID:   bnSePYullXlG-504nOZeNAXemGF6DhoCdYR8ysm088w
      NUM_GPUS:         4
      BACKEND:          local-k8s
      IMAGE_TF:         nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
      IMAGE_PYT:        nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
      IMAGE_DNV2:       nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
      IMAGE_DEFAULT:    nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
      IMAGE_API:        nvcr.io/nvidia/tao/tao-toolkit:4.0.0-api
    Mounts:
      /shared from shared-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4zc5t (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  shared-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tao-toolkit-api-pvc
    ReadOnly:   false
  kube-api-access-4zc5t:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

This is content of log file in /mnt/nfs_share/archived-default-tao-toolkit-api-pvc-pvc-<pvc_id>/users/<user_id>/models/<model_id>/<train_job_id>/experiment_19

2023-04-07 06:04:20.062506: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
EPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
2023-04-07 06:04:27,602 [INFO] root: Starting UNet Training job
Loading experiment spec at /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/recommendation_19.kitti.
2023-04-07 06:04:27,604 [INFO] __main__: Loading experiment spec at /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/recommendation_19.kitti.
2023-04-07 06:04:27,604 [INFO] iva.unet.spec_handler.spec_loader: Merging specification from /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/recommendation_19.kitti
2023-04-07 06:04:27,607 [INFO] root: Initializing the pre-trained weights from /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2023-04-07 06:04:27,619 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2023-04-07 06:04:27,619 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2023-04-07 06:04:27,619 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2023-04-07 06:04:27,627 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

2023-04-07 06:04:27,650 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

2023-04-07 06:04:27,650 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-04-07 06:04:27,652 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-04-07 06:04:27,653 [INFO] iva.unet.model.utilities: Label Id 0: Train Id 0
2023-04-07 06:04:27,653 [INFO] iva.unet.model.utilities: Label Id 1: Train Id 1

Phase train: Total 20 files.
2023-04-07 06:04:27,666 [INFO] iva.unet.model.utilities: The total number of training samples 20 and the batch size per                 GPU 3
2023-04-07 06:04:27,666 [INFO] iva.unet.model.utilities: Cannot iterate over exactly 20 samples with a batch size of 3; each epoch will therefore take one extra step.
2023-04-07 06:04:27,667 [INFO] iva.unet.model.utilities: Cannot iterate over exactly 1 steps per epoch with 3 processors; each processor will therefore take one extra step per epoch.
2023-04-07 06:04:27,667 [INFO] iva.unet.model.utilities: Steps per epoch taken: 2
2023-04-07 06:04:27,667 [INFO] root: Number of save_summary_steps should be less than number of steps per epoch.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
2023-04-07 06:04:27,689 [INFO] iva.common.logging.logging: Log file already exists at /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/experiment_19/status.json
2023-04-07 06:04:27,689 [INFO] root: Starting UNet Training job
2023-04-07 06:04:27,689 [INFO] __main__: Loading experiment spec at /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/recommendation_19.kitti.
2023-04-07 06:04:27,690 [INFO] iva.unet.spec_handler.spec_loader: Merging specification from /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/recommendation_19.kitti
2023-04-07 06:04:27,691 [INFO] root: Initializing the pre-trained weights from /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2023-04-07 06:04:27,702 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2023-04-07 06:04:27,703 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2023-04-07 06:04:27,703 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2023-04-07 06:04:27,710 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
2023-04-07 06:04:27,715 [INFO] iva.common.logging.logging: Log file already exists at /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/experiment_19/status.json
2023-04-07 06:04:27,715 [INFO] root: Starting UNet Training job
2023-04-07 06:04:27,715 [INFO] __main__: Loading experiment spec at /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/recommendation_19.kitti.
2023-04-07 06:04:27,716 [INFO] iva.unet.spec_handler.spec_loader: Merging specification from /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/recommendation_19.kitti
2023-04-07 06:04:27,718 [INFO] root: Initializing the pre-trained weights from /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
2023-04-07 06:04:27,725 [INFO] iva.common.logging.logging: Log file already exists at /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/experiment_19/status.json
2023-04-07 06:04:27,725 [INFO] root: Starting UNet Training job
2023-04-07 06:04:27,725 [INFO] __main__: Loading experiment spec at /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/recommendation_19.kitti.
2023-04-07 06:04:27,726 [INFO] iva.unet.spec_handler.spec_loader: Merging specification from /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/e437b5bd-ff44-45d5-b373-4cfbc92c0dfb/60aba7a6-0602-4940-adac-23fa8c94e943/recommendation_19.kitti
2023-04-07 06:04:27,727 [INFO] root: Initializing the pre-trained weights from /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2023-04-07 06:04:27,730 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2023-04-07 06:04:27,730 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2023-04-07 06:04:27,730 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

2023-04-07 06:04:27,732 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

2023-04-07 06:04:27,732 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-04-07 06:04:27,734 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-04-07 06:04:27,735 [INFO] iva.unet.model.utilities: Label Id 0: Train Id 0
2023-04-07 06:04:27,735 [INFO] iva.unet.model.utilities: Label Id 1: Train Id 1
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2023-04-07 06:04:27,739 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2023-04-07 06:04:27,739 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2023-04-07 06:04:27,739 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2023-04-07 06:04:27,740 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2023-04-07 06:04:27,748 [INFO] iva.unet.model.utilities: The total number of training samples 20 and the batch size per                 GPU 3
2023-04-07 06:04:27,748 [INFO] iva.unet.model.utilities: Cannot iterate over exactly 20 samples with a batch size of 3; each epoch will therefore take one extra step.

Phase train: Total 20 files.
2023-04-07 06:04:27,748 [INFO] iva.unet.model.utilities: Cannot iterate over exactly 1 steps per epoch with 3 processors; each processor will therefore take one extra step per epoch.
2023-04-07 06:04:27,748 [INFO] iva.unet.model.utilities: Steps per epoch taken: 2
2023-04-07 06:04:27,749 [INFO] root: Number of save_summary_steps should be less than number of steps per epoch.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2023-04-07 06:04:27,751 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 533, in <module>
  File "<frozen iva.unet.scripts.train>", line 529, in main
  File "<frozen iva.unet.scripts.train>", line 516, in main
  File "<frozen iva.unet.scripts.train>", line 382, in run_experiment
  File "<frozen iva.unet.scripts.train>", line 266, in train_unet
  File "<frozen iva.unet.model.utilities>", line 574, in update_train_params
AssertionError: Number of save_summary_steps should be less than number of steps per epoch.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

2023-04-07 06:04:27,767 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

2023-04-07 06:04:27,768 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-04-07 06:04:27,769 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-04-07 06:04:27,771 [INFO] iva.unet.model.utilities: Label Id 0: Train Id 0
2023-04-07 06:04:27,771 [INFO] iva.unet.model.utilities: Label Id 1: Train Id 1
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

2023-04-07 06:04:27,773 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

2023-04-07 06:04:27,773 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-04-07 06:04:27,775 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-04-07 06:04:27,776 [INFO] iva.unet.model.utilities: Label Id 0: Train Id 0
2023-04-07 06:04:27,776 [INFO] iva.unet.model.utilities: Label Id 1: Train Id 1

Phase train: Total 20 files.
2023-04-07 06:04:27,783 [INFO] iva.unet.model.utilities: The total number of training samples 20 and the batch size per                 GPU 3
2023-04-07 06:04:27,783 [INFO] iva.unet.model.utilities: Cannot iterate over exactly 20 samples with a batch size of 3; each epoch will therefore take one extra step.
2023-04-07 06:04:27,783 [INFO] iva.unet.model.utilities: Cannot iterate over exactly 1 steps per epoch with 3 processors; each processor will therefore take one extra step per epoch.
2023-04-07 06:04:27,783 [INFO] iva.unet.model.utilities: Steps per epoch taken: 2
2023-04-07 06:04:27,783 [INFO] root: Number of save_summary_steps should be less than number of steps per epoch.

Phase train: Total 20 files.
2023-04-07 06:04:27,788 [INFO] iva.unet.model.utilities: The total number of training samples 20 and the batch size per                 GPU 3
2023-04-07 06:04:27,788 [INFO] iva.unet.model.utilities: Cannot iterate over exactly 20 samples with a batch size of 3; each epoch will therefore take one extra step.
2023-04-07 06:04:27,788 [INFO] iva.unet.model.utilities: Cannot iterate over exactly 1 steps per epoch with 3 processors; each processor will therefore take one extra step per epoch.
2023-04-07 06:04:27,788 [INFO] iva.unet.model.utilities: Steps per epoch taken: 2
2023-04-07 06:04:27,788 [INFO] root: Number of save_summary_steps should be less than number of steps per epoch.
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 533, in <module>
  File "<frozen iva.unet.scripts.train>", line 529, in main
  File "<frozen iva.unet.scripts.train>", line 516, in main
  File "<frozen iva.unet.scripts.train>", line 382, in run_experiment
  File "<frozen iva.unet.scripts.train>", line 266, in train_unet
  File "<frozen iva.unet.model.utilities>", line 574, in update_train_params
AssertionError: Number of save_summary_steps should be less than number of steps per epoch.
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 533, in <module>
  File "<frozen iva.unet.scripts.train>", line 529, in main
  File "<frozen iva.unet.scripts.train>", line 516, in main
  File "<frozen iva.unet.scripts.train>", line 382, in run_experiment
  File "<frozen iva.unet.scripts.train>", line 266, in train_unet
  File "<frozen iva.unet.model.utilities>", line 574, in update_train_params
AssertionError: Number of save_summary_steps should be less than number of steps per epoch.
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 533, in <module>
  File "<frozen iva.unet.scripts.train>", line 529, in main
  File "<frozen iva.unet.scripts.train>", line 516, in main
  File "<frozen iva.unet.scripts.train>", line 382, in run_experiment
  File "<frozen iva.unet.scripts.train>", line 266, in train_unet
  File "<frozen iva.unet.model.utilities>", line 574, in update_train_params
AssertionError: Number of save_summary_steps should be less than number of steps per epoch.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[4275,1],0]
  Exit code:    1
--------------------------------------------------------------------------
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

did you make any changes in the segmentation.ipynbnotebook?
could you please upload the whole <model_id> folder? I’d like to find more information from your folder.

This is my model_id folder.
model_id.tar (750 KB)

This is the segmentation notebook I ran.
segmentation.ipynb (57.2 KB)

Excuse me. @Bin_Zhao_NV Are there anything should be adjusted in the notebook I uploaded ?

could you please also share the logs of workflow pod when the error happened using this command kubectl logs -f tao-toolkit-api-workflow-pod-<str>?

Excuse me. @Bin_Zhao_NV Here is my logs of workflow pod.

AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_0.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_0 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_0/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_0/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_1.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_1 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_1/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_1/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_2.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_2 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_2/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_2/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_3.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_3 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_3/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_3/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_4.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_4 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_4/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_4/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_5.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_5 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_5/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_5/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_6.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_6 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_6/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_6/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_7.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_7 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_7/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_7/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_8.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_8 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_8/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_8/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_9.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_9 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_9/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_9/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_10.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_10 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_10/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_10/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_11.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_11 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_11/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_11/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_12.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_12 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_12/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_12/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_13.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_13 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_13/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_13/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_14.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_14 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_14/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_14/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_15.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_15 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_15/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_15/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_16.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_16 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_16/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_16/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_17.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_17 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_17/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_17/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_18.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_18 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_18/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_18/log.txt
AutoML pipeline done
AutoML pipeline
unet train --gpus $NUM_GPUS  -m /shared/users/00000000-0000-0000-0000-000000000000/models/a45f880a-9e8d-48cf-bf26-b883e1f56205/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 -e /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/recommendation_19.kitti -r /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_19 -k tlt_encode > /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_19/log.txt 2>&1 >> /shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/369e1880-f592-43d5-a4cf-22936b552dcd/b313980d-4cdd-44a3-be8c-1b0c71cb2702/experiment_19/log.txt
AutoML pipeline done

Excuse me. @Bin_Zhao_NV I found that I got the error message: AssertionError: Number of save_summary_steps should be less than number of steps per epoch.

However. I haven’t set the number of save_summary_steps and number of steps per epoch.

Why did I got the error message in the case that I didn’t set those parameters ?

this 2 parameters is from the source code. you don’t need to set them.
could you please give me this infor, when the training job started, you can see that there are 2 pods , one is the same name of your training job id and other one is where the traning run. as below:
image
please exec this 2 pods and run the command echo $NUM_GPUS as below and share the results:

These are the pods created when I trained unet by AutoML

The $NUM_GPUS of the pod named by training job id is 0

The $NUM_GPUS of the pod where the traning run are 4

did you install the tao-toolkit-api with helm chart in Deployment - NVIDIA Docs?
have you change the numGpus in the tao-toolkit-api/values.yaml?

Do you mean change numGpus from 4 to 1 ?

I just want to confirm you’ve changed numGpus when you install the tao-toolkit-api. and it will work if you change numGpus back to 1.

currently if you don’t want to change the numGpus from 4 to 1, I’ll give a workaround to train the Unet with multi-gpus later

@swka1043338
you don’t need to change the numGpus from 4 to 1.
please add specs["training_config"]["visualizer"]["save_summary_steps"] = 1 in the notebook cell as below and try again.

I could run with multiple GPUs but got new error message: tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service

INFO:tensorflow:Done calling model_fn.
2023-04-18 07:01:10,067 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-04-18 07:01:10,419 [INFO] tensorflow: Graph was finalized.
2023-04-18 07:01:10,420 [INFO] root: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
50da6c6f-0e2e-41cf-bd6a-260f5e8bd32a-dhrg8:66:246 [0] NCCL INFO comm 0x7fa00c469820 rank 0 nranks 4 cudaDev 0 busId 60 - Destroy COMPLETE
INFO:tensorflow:Done calling model_fn.
2023-04-18 07:01:10,496 [INFO] tensorflow: Done calling model_fn.
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 533, in <module>
  File "<frozen iva.unet.scripts.train>", line 529, in main
  File "<frozen iva.unet.scripts.train>", line 516, in main
  File "<frozen iva.unet.scripts.train>", line 387, in run_experiment
  File "<frozen iva.unet.scripts.evaluate>", line 323, in evaluate_unet
  File "<frozen iva.unet.scripts.evaluate>", line 228, in run_evaluate_tlt
  File "<frozen iva.unet.scripts.evaluate>", line 138, in print_compute_metrics
  File "<frozen iva.unet.scripts.evaluate>", line 81, in compute_metrics_masks
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 955, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 638, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
INFO:tensorflow:Done calling model_fn.
2023-04-18 07:01:10,580 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-04-18 07:01:10,831 [INFO] tensorflow: Graph was finalized.
2023-04-18 07:01:10,832 [INFO] root: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
INFO:tensorflow:Graph was finalized.
2023-04-18 07:01:10,926 [INFO] tensorflow: Graph was finalized.
2023-04-18 07:01:10,926 [INFO] root: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 533, in <module>
  File "<frozen iva.unet.scripts.train>", line 529, in main
  File "<frozen iva.unet.scripts.train>", line 516, in main
  File "<frozen iva.unet.scripts.train>", line 387, in run_experiment
  File "<frozen iva.unet.scripts.evaluate>", line 323, in evaluate_unet
  File "<frozen iva.unet.scripts.evaluate>", line 228, in run_evaluate_tlt
  File "<frozen iva.unet.scripts.evaluate>", line 138, in print_compute_metrics
  File "<frozen iva.unet.scripts.evaluate>", line 81, in compute_metrics_masks
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 955, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 638, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 533, in <module>
  File "<frozen iva.unet.scripts.train>", line 529, in main
  File "<frozen iva.unet.scripts.train>", line 516, in main
  File "<frozen iva.unet.scripts.train>", line 387, in run_experiment
  File "<frozen iva.unet.scripts.evaluate>", line 323, in evaluate_unet
  File "<frozen iva.unet.scripts.evaluate>", line 228, in run_evaluate_tlt
  File "<frozen iva.unet.scripts.evaluate>", line 138, in print_compute_metrics
  File "<frozen iva.unet.scripts.evaluate>", line 81, in compute_metrics_masks
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 955, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 638, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
model.ckpt-100.meta
INFO:tensorflow:Using config: {'_model_dir': '/shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/58d3d7e4-9886-4995-8d29-1fb280a59108/1a8c03ef-afcd-4eea-b0b9-6361551255bf/experiment_0/weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9ff81ceef0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2023-04-18 07:01:11,345 [INFO] tensorflow: Using config: {'_model_dir': '/shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/58d3d7e4-9886-4995-8d29-1fb280a59108/1a8c03ef-afcd-4eea-b0b9-6361551255bf/experiment_0/weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9ff81ceef0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2023-04-18 07:01:11,347 [INFO] iva.unet.scripts.evaluate: Starting Evaluation.
0it [00:00, ?it/s]WARNING:tensorflow:Entity <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-04-18 07:01:11,367 [WARNING] tensorflow: Entity <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea9e429d8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea9e429d8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-04-18 07:01:11,382 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea9e429d8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea9e429d8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-04-18 07:01:11,392 [WARNING] tensorflow: Entity <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-04-18 07:01:11,401 [WARNING] tensorflow: Entity <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-04-18 07:01:11,410 [WARNING] tensorflow: Entity <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea82158c8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea82158c8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-04-18 07:01:11,557 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea82158c8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea82158c8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea8215b70> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea8215b70>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-04-18 07:01:11,566 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea8215b70> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7f9ea8215b70>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-04-18 07:01:11,575 [WARNING] tensorflow: Entity <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7f9ffb4a00f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fa12e5777b8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fa12e5777b8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-04-18 07:01:11,586 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fa12e5777b8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fa12e5777b8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fa12e577a60> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fa12e577a60>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-04-18 07:01:11,603 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fa12e577a60> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fa12e577a60>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Calling model_fn.
2023-04-18 07:01:11,614 [INFO] tensorflow: Calling model_fn.
2023-04-18 07:01:11,614 [INFO] iva.unet.utils.model_fn: {'exec_mode': 'train', 'model_dir': '/shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/models/58d3d7e4-9886-4995-8d29-1fb280a59108/1a8c03ef-afcd-4eea-b0b9-6361551255bf/experiment_0/weights', 'resize_padding': True, 'resize_method': 'BILINEAR', 'log_dir': None, 'batch_size': 3, 'learning_rate': 0.00040598164196126163, 'activation': 'softmax', 'crossvalidation_idx': None, 'max_steps': None, 'regularizer_type': 1, 'weight_decay': 0.0029441313818097115, 'log_summary_steps': 10, 'warmup_steps': 0, 'augment': False, 'use_amp': False, 'filter_data': True, 'use_trt': False, 'use_xla': False, 'loss': 'cross_entropy', 'epochs': 50, 'pretrained_weights_file': None, 'lr_scheduler': None, 'unet_model': <iva.unet.model.resnet_unet.ResnetUnet object at 0x7f9ea83c1358>, 'key': 'tlt_encode', 'experiment_spec': random_seed: 42
dataset_config {
  dataset: "custom"
  input_image_type: "grayscale"
  train_images_path: "/shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/datasets/ec350625-500b-43bd-879f-9fb592013485/images/train"
  train_masks_path: "/shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/datasets/ec350625-500b-43bd-879f-9fb592013485/masks/train"
  val_images_path: "/shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/datasets/0914ff02-fd5a-4cad-88ef-faed08165ce4/images/val"
  val_masks_path: "/shared/users/80ab3db1-baf9-5608-8a94-f5b86a8cbd59/datasets/0914ff02-fd5a-4cad-88ef-faed08165ce4/masks/val"
  data_class_config {
    target_classes {
      name: "foreground"
      mapping_class: "foreground"
    }
    target_classes {
      name: "background"
      label_id: 1
      mapping_class: "background"
    }
  }
  augmentation_config {
    spatial_augmentation {
      hflip_probability: 0.5
      vflip_probability: 0.5
      crop_and_resize_prob: 0.5
    }
    brightness_augmentation {
      delta: 0.20000000298023224
    }
  }
  resize_padding: true
  resize_method: "BILINEAR"
  filter_data: true
}
model_config {
  num_layers: 18
  use_batch_norm: true
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
  all_projections: true
  model_input_height: 352
  model_input_width: 128
  model_input_channels: 1
}
training_config {
  batch_size: 3
  regularizer {
    type: L1
    weight: 0.0029441313818097115
  }
  optimizer {
    adam {
      epsilon: 9.99999993922529e-09
      beta1: 0.8526546955108643
      beta2: 0.9990000128746033
    }
  }
  checkpoint_interval: 1
  log_summary_steps: 10
  learning_rate: 0.00040598164196126163
  loss: "cross_entropy"
  epochs: 50
  visualizer {
    save_summary_steps: 1
  }
  data_options: true
}
, 'seed': 42, 'benchmark': False, 'temp_dir': '/tmp/tmp9y5k5vtf', 'num_classes': 2, 'num_conf_mat_classes': 2, 'start_step': 0, 'checkpoint_interval': 1, 'model_json': None, 'custom_objs': {}, 'load_graph': False, 'remove_head': False, 'buffer_size': None, 'data_options': True, 'weights_monitor': False, 'visualize': False, 'save_summary_steps': 1, 'infrequent_save_summary_steps': None, 'enable_qat': False, 'phase': 'val', 'model_size': 183.5315408706665}
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[54947,1],1]
  Exit code:    1
--------------------------------------------------------------------------
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

how about the nvidia-smi?

I haven’t install nvidia-smi

ok, there might be some GPU error like this InvalidArgumentError: device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0 · Issue #423 · IndicoDataSolutions/finetune (github.com).

could you please do some experiments using the tao launcher notebook in getting_started_v4.0.1/notebooks/tao_launcher_starter_kit/unet/tao_isbi/unet_isbi.ipynb and set different GPUs to see which GPU is going wrong, such as :

tao unet train --gpus=1 --gpu_index=0 \
              -e $SPECS_DIR/unet_train_resnet_unet_isbi.txt \
              -r $USER_EXPERIMENT_DIR/isbi_experiment_unpruned \
              -m $USER_EXPERIMENT_DIR/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 \
              -n model_isbi \
              -k $KEY 

tao unet train --gpus=1 --gpu_index=1 \
              -e $SPECS_DIR/unet_train_resnet_unet_isbi.txt \
              -r $USER_EXPERIMENT_DIR/isbi_experiment_unpruned \
              -m $USER_EXPERIMENT_DIR/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 \
              -n model_isbi \
              -k $KEY 

tao unet train --gpus=1 --gpu_index=2 \
              -e $SPECS_DIR/unet_train_resnet_unet_isbi.txt \
              -r $USER_EXPERIMENT_DIR/isbi_experiment_unpruned \
              -m $USER_EXPERIMENT_DIR/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 \
              -n model_isbi \
              -k $KEY 

tao unet train --gpus=1 --gpu_index=3 \
              -e $SPECS_DIR/unet_train_resnet_unet_isbi.txt \
              -r $USER_EXPERIMENT_DIR/isbi_experiment_unpruned \
              -m $USER_EXPERIMENT_DIR/pretrained_resnet18/pretrained_semantic_segmentation_vresnet18/resnet_18.hdf5 \
              -n model_isbi \
              -k $KEY