Excuse me @Bin_Zhao_NV @Morganh
I’ve changed gpus from Tesla P100 to Tesla V100 and tried to train Tao Toolkit UNet model with 4 gpus in version v4.0.0 and v4.0.1 again.
However. I still got the error message: device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0.
This is the result in the process of training UNet when I ran the command nvidia-smi.
Is this a bug for Tao Toolkit v4.0.0 and v4.0.1 ? When I trained UNet in the version v3.22.05, it seemed that there were no errors occurred as the contents below.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,646 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,652 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,682 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,749 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:30,165 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,315 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,319 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,354 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,431 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:31,914 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,944 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,963 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,963 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,053 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,073 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,073 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:32,105 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,206 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:33,644 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:33,759 [INFO] tensorflow: Done running local_init_op.
[GPU] Restoring pretrained weights from: /tmp/tmpb0kfjiee/model.ckpt
2023-06-08 10:39:34,497 [INFO] iva.unet.hooks.pretrained_restore_hook: Pretrained weights loaded with success...
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
2023-06-08 10:39:35,492 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
2023-06-08 10:39:35,495 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
2023-06-08 10:39:35,496 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
2023-06-08 10:39:35,498 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
INFO:tensorflow:Saving checkpoints for step-0.
2023-06-08 10:39:38,987 [INFO] tensorflow: Saving checkpoints for step-0.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
2023-06-08 10:39:48,316 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.
f883eb5b84f2:166:895 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:166:895 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:166:895 [0] NCCL INFO P2P plugin IBext
f883eb5b84f2:166:895 [0] NCCL INFO NET/IB : No device found.
f883eb5b84f2:166:895 [0] NCCL INFO NET/IB : No device found.
f883eb5b84f2:166:895 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:166:895 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
f883eb5b84f2:176:889 [4] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:176:889 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:176:889 [4] NCCL INFO P2P plugin IBext
f883eb5b84f2:176:889 [4] NCCL INFO NET/IB : No device found.
f883eb5b84f2:176:889 [4] NCCL INFO NET/IB : No device found.
f883eb5b84f2:176:889 [4] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:176:889 [4] NCCL INFO Using network Socket
f883eb5b84f2:169:898 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:169:898 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:169:898 [2] NCCL INFO P2P plugin IBext
f883eb5b84f2:169:898 [2] NCCL INFO NET/IB : No device found.
f883eb5b84f2:169:898 [2] NCCL INFO NET/IB : No device found.
f883eb5b84f2:169:898 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:169:898 [2] NCCL INFO Using network Socket
f883eb5b84f2:167:886 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:167:886 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:167:886 [1] NCCL INFO P2P plugin IBext
f883eb5b84f2:167:886 [1] NCCL INFO NET/IB : No device found.
f883eb5b84f2:167:886 [1] NCCL INFO NET/IB : No device found.
f883eb5b84f2:167:886 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:167:886 [1] NCCL INFO Using network Socket
f883eb5b84f2:173:890 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:173:890 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:173:890 [3] NCCL INFO P2P plugin IBext
f883eb5b84f2:173:890 [3] NCCL INFO NET/IB : No device found.
f883eb5b84f2:173:890 [3] NCCL INFO NET/IB : No device found.
f883eb5b84f2:173:890 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:173:890 [3] NCCL INFO Using network Socket
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00/02 : 0 3 2 4 1
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01/02 : 0 3 2 4 1
f883eb5b84f2:166:895 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] 3/-1/-1->0->-1
f883eb5b84f2:167:886 [1] NCCL INFO Trees [0] 2/-1/-1->1->3 [1] 2/-1/-1->1->3
f883eb5b84f2:169:898 [2] NCCL INFO Trees [0] 4/-1/-1->2->1 [1] 4/-1/-1->2->1
f883eb5b84f2:173:890 [3] NCCL INFO Trees [0] 1/-1/-1->3->0 [1] 1/-1/-1->3->0
f883eb5b84f2:176:889 [4] NCCL INFO Trees [0] -1/-1/-1->4->2 [1] -1/-1/-1->4->2
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00 : 0[100] -> 3[1c0] via P2P/IPC
f883eb5b84f2:169:898 [2] NCCL INFO Channel 00 : 2[1b0] -> 4[20d0] via P2P/IPC
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01 : 0[100] -> 3[1c0] via P2P/IPC
f883eb5b84f2:169:898 [2] NCCL INFO Channel 01 : 2[1b0] -> 4[20d0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 1[110] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 1[110] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 0[100] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 0[100] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Connected all rings
f883eb5b84f2:169:898 [2] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Connected all rings
f883eb5b84f2:166:895 [0] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 2[1b0] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 0[100] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 3[1c0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 0[100] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 3[1c0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Connected all trees
f883eb5b84f2:176:889 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:176:889 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:169:898 [2] NCCL INFO Channel 00 : 2[1b0] -> 1[110] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 0[100] via P2P/indirect/2[1b0]
f883eb5b84f2:169:898 [2] NCCL INFO Channel 01 : 2[1b0] -> 1[110] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 0[100] via P2P/indirect/2[1b0]
f883eb5b84f2:166:895 [0] NCCL INFO Connected all trees
f883eb5b84f2:166:895 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:166:895 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 1[110] via P2P/IPC
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00 : 0[100] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 1[110] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Connected all trees
f883eb5b84f2:173:890 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:173:890 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01 : 0[100] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:167:886 [1] NCCL INFO Connected all trees
f883eb5b84f2:167:886 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:167:886 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:169:898 [2] NCCL INFO Connected all trees
f883eb5b84f2:169:898 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:169:898 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 3[1c0] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 3[1c0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO comm 0x7fd5087f9550 rank 3 nranks 5 cudaDev 3 busId 1c0 - Init COMPLETE
f883eb5b84f2:176:889 [4] NCCL INFO comm 0x7f9cf87f9820 rank 4 nranks 5 cudaDev 4 busId 20d0 - Init COMPLETE
f883eb5b84f2:169:898 [2] NCCL INFO comm 0x7fb5ec7fa6d0 rank 2 nranks 5 cudaDev 2 busId 1b0 - Init COMPLETE
f883eb5b84f2:167:886 [1] NCCL INFO comm 0x7f4bdc7f9890 rank 1 nranks 5 cudaDev 1 busId 110 - Init COMPLETE
f883eb5b84f2:166:895 [0] NCCL INFO comm 0x7f6b60811fd0 rank 0 nranks 5 cudaDev 0 busId 100 - Init COMPLETE
f883eb5b84f2:166:895 [0] NCCL INFO Launch mode Parallel
2023-06-08 10:39:56,769 [INFO] root: None
Epoch: 0/11:, Cur-Step: 0, loss(cross_entropy): 0.75569, Running average loss:0.75569, Time taken: 0:00:00 ETA: 0:00:00
2023-06-08 10:39:56,828 [INFO] __main__: Epoch: 0/11:, Cur-Step: 0, loss(cross_entropy): 0.75569, Running average loss:0.75569, Time taken: 0:00:00 ETA: 0:00:00
INFO:tensorflow:Saving checkpoints for step-2.
2023-06-08 10:40:00,244 [INFO] tensorflow: Saving checkpoints for step-2.
INFO:tensorflow:Saving checkpoints for step-4.
2023-06-08 10:40:10,701 [INFO] tensorflow: Saving checkpoints for step-4.
INFO:tensorflow:Saving checkpoints for step-6.
2023-06-08 10:40:19,732 [INFO] tensorflow: Saving checkpoints for step-6.
INFO:tensorflow:Saving checkpoints for step-8.
2023-06-08 10:40:28,992 [INFO] tensorflow: Saving checkpoints for step-8.
INFO:tensorflow:Saving checkpoints for step-10.
2023-06-08 10:40:38,437 [INFO] tensorflow: Saving checkpoints for step-10.
2023-06-08 10:40:47,885 [INFO] root: None
Epoch: 5/11:, Cur-Step: 10, loss(cross_entropy): 0.72443, Running average loss:0.72443, Time taken: 0:00:09.482444 ETA: 0:00:56.894661
2023-06-08 10:40:47,985 [INFO] __main__: Epoch: 5/11:, Cur-Step: 10, loss(cross_entropy): 0.72443, Running average loss:0.72443, Time taken: 0:00:09.482444 ETA: 0:00:56.894661
INFO:tensorflow:Saving checkpoints for step-12.
2023-06-08 10:40:48,291 [INFO] tensorflow: Saving checkpoints for step-12.
INFO:tensorflow:Saving checkpoints for step-14.
2023-06-08 10:40:57,538 [INFO] tensorflow: Saving checkpoints for step-14.
INFO:tensorflow:Saving checkpoints for step-16.
2023-06-08 10:41:06,739 [INFO] tensorflow: Saving checkpoints for step-16.
INFO:tensorflow:Saving checkpoints for step-18.
2023-06-08 10:41:16,086 [INFO] tensorflow: Saving checkpoints for step-18.
INFO:tensorflow:Saving checkpoints for step-20.
2023-06-08 10:41:25,417 [INFO] tensorflow: Saving checkpoints for step-20.
2023-06-08 10:41:34,961 [INFO] root: None
Epoch: 10/11:, Cur-Step: 20, loss(cross_entropy): 0.62239, Running average loss:0.62239, Time taken: 0:00:09.437342 ETA: 0:00:09.437342
2023-06-08 10:41:35,023 [INFO] __main__: Epoch: 10/11:, Cur-Step: 20, loss(cross_entropy): 0.62239, Running average loss:0.62239, Time taken: 0:00:09.437342 ETA: 0:00:09.437342
INFO:tensorflow:Saving checkpoints for step-22.
2023-06-08 10:41:35,358 [INFO] tensorflow: Saving checkpoints for step-22.
INFO:tensorflow:Loss for final step: 0.6164588.
2023-06-08 10:41:35,453 [INFO] tensorflow: Loss for final step: 0.6164588.
INFO:tensorflow:Loss for final step: 0.6013098.
2023-06-08 10:41:35,461 [INFO] tensorflow: Loss for final step: 0.6013098.
INFO:tensorflow:Loss for final step: 0.62208736.
2023-06-08 10:41:35,461 [INFO] tensorflow: Loss for final step: 0.62208736.
INFO:tensorflow:Loss for final step: 0.6182792.
2023-06-08 10:41:35,471 [INFO] tensorflow: Loss for final step: 0.6182792.
2023-06-08 10:41:35,476 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,477 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,477 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,517 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
Throughput Avg: 67.075 img/s
Latency Avg: 392.697 ms
Latency 90%: 627.808 ms
Latency 95%: 672.829 ms
Latency 99%: 760.871 ms
DLL 2023-06-08 10:41:49.240021 - () throughput_train:67.0745170186196 latency_train:392.69723211015975 elapsed_time:142.369777
INFO:tensorflow:Loss for final step: 0.6112231.
2023-06-08 10:41:49,324 [INFO] tensorflow: Loss for final step: 0.6112231.
Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:49,780 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:42:03,500 [INFO] root: Experiment complete.
2023-06-08 10:42:50,187 [INFO] root: Experiment complete.
2023-06-08 10:42:55,107 [INFO] root: Experiment complete.
2023-06-08 10:42:55,107 [INFO] root: Experiment complete.
2023-06-08 10:42:55,110 [INFO] root: Experiment complete.