Cannot train Tao Toolkit UNet model in version v4.0.0 and v4.0.1

Excuse me @Bin_Zhao_NV @Morganh

I’ve changed gpus from Tesla P100 to Tesla V100 and tried to train Tao Toolkit UNet model with 4 gpus in version v4.0.0 and v4.0.1 again.

However. I still got the error message: device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0.

This is the result in the process of training UNet when I ran the command nvidia-smi.

Is this a bug for Tao Toolkit v4.0.0 and v4.0.1 ? When I trained UNet in the version v3.22.05, it seemed that there were no errors occurred as the contents below.

INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,646 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,652 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,682 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,749 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:30,165 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,315 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,319 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,354 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,431 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:31,914 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,944 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,963 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,963 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,053 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,073 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,073 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:32,105 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,206 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:33,644 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:33,759 [INFO] tensorflow: Done running local_init_op.
[GPU] Restoring pretrained weights from: /tmp/tmpb0kfjiee/model.ckpt
2023-06-08 10:39:34,497 [INFO] iva.unet.hooks.pretrained_restore_hook: Pretrained weights loaded with success...

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,492 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,495 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,496 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,498 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

INFO:tensorflow:Saving checkpoints for step-0.
2023-06-08 10:39:38,987 [INFO] tensorflow: Saving checkpoints for step-0.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:48,316 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

f883eb5b84f2:166:895 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:166:895 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:166:895 [0] NCCL INFO P2P plugin IBext
f883eb5b84f2:166:895 [0] NCCL INFO NET/IB : No device found.
f883eb5b84f2:166:895 [0] NCCL INFO NET/IB : No device found.
f883eb5b84f2:166:895 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:166:895 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
f883eb5b84f2:176:889 [4] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:176:889 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:176:889 [4] NCCL INFO P2P plugin IBext
f883eb5b84f2:176:889 [4] NCCL INFO NET/IB : No device found.
f883eb5b84f2:176:889 [4] NCCL INFO NET/IB : No device found.
f883eb5b84f2:176:889 [4] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:176:889 [4] NCCL INFO Using network Socket
f883eb5b84f2:169:898 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:169:898 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:169:898 [2] NCCL INFO P2P plugin IBext
f883eb5b84f2:169:898 [2] NCCL INFO NET/IB : No device found.
f883eb5b84f2:169:898 [2] NCCL INFO NET/IB : No device found.
f883eb5b84f2:169:898 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:169:898 [2] NCCL INFO Using network Socket
f883eb5b84f2:167:886 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:167:886 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:167:886 [1] NCCL INFO P2P plugin IBext
f883eb5b84f2:167:886 [1] NCCL INFO NET/IB : No device found.
f883eb5b84f2:167:886 [1] NCCL INFO NET/IB : No device found.
f883eb5b84f2:167:886 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:167:886 [1] NCCL INFO Using network Socket
f883eb5b84f2:173:890 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:173:890 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:173:890 [3] NCCL INFO P2P plugin IBext
f883eb5b84f2:173:890 [3] NCCL INFO NET/IB : No device found.
f883eb5b84f2:173:890 [3] NCCL INFO NET/IB : No device found.
f883eb5b84f2:173:890 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:173:890 [3] NCCL INFO Using network Socket
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00/02 :    0   3   2   4   1
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01/02 :    0   3   2   4   1
f883eb5b84f2:166:895 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] 3/-1/-1->0->-1
f883eb5b84f2:167:886 [1] NCCL INFO Trees [0] 2/-1/-1->1->3 [1] 2/-1/-1->1->3
f883eb5b84f2:169:898 [2] NCCL INFO Trees [0] 4/-1/-1->2->1 [1] 4/-1/-1->2->1
f883eb5b84f2:173:890 [3] NCCL INFO Trees [0] 1/-1/-1->3->0 [1] 1/-1/-1->3->0
f883eb5b84f2:176:889 [4] NCCL INFO Trees [0] -1/-1/-1->4->2 [1] -1/-1/-1->4->2
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00 : 0[100] -> 3[1c0] via P2P/IPC
f883eb5b84f2:169:898 [2] NCCL INFO Channel 00 : 2[1b0] -> 4[20d0] via P2P/IPC
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01 : 0[100] -> 3[1c0] via P2P/IPC
f883eb5b84f2:169:898 [2] NCCL INFO Channel 01 : 2[1b0] -> 4[20d0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 1[110] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 1[110] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 0[100] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 0[100] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Connected all rings
f883eb5b84f2:169:898 [2] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Connected all rings
f883eb5b84f2:166:895 [0] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 2[1b0] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 0[100] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 3[1c0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 0[100] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 3[1c0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Connected all trees
f883eb5b84f2:176:889 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:176:889 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:169:898 [2] NCCL INFO Channel 00 : 2[1b0] -> 1[110] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 0[100] via P2P/indirect/2[1b0]
f883eb5b84f2:169:898 [2] NCCL INFO Channel 01 : 2[1b0] -> 1[110] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 0[100] via P2P/indirect/2[1b0]
f883eb5b84f2:166:895 [0] NCCL INFO Connected all trees
f883eb5b84f2:166:895 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:166:895 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 1[110] via P2P/IPC
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00 : 0[100] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 1[110] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Connected all trees
f883eb5b84f2:173:890 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:173:890 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01 : 0[100] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:167:886 [1] NCCL INFO Connected all trees
f883eb5b84f2:167:886 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:167:886 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:169:898 [2] NCCL INFO Connected all trees
f883eb5b84f2:169:898 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:169:898 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 3[1c0] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 3[1c0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO comm 0x7fd5087f9550 rank 3 nranks 5 cudaDev 3 busId 1c0 - Init COMPLETE
f883eb5b84f2:176:889 [4] NCCL INFO comm 0x7f9cf87f9820 rank 4 nranks 5 cudaDev 4 busId 20d0 - Init COMPLETE
f883eb5b84f2:169:898 [2] NCCL INFO comm 0x7fb5ec7fa6d0 rank 2 nranks 5 cudaDev 2 busId 1b0 - Init COMPLETE
f883eb5b84f2:167:886 [1] NCCL INFO comm 0x7f4bdc7f9890 rank 1 nranks 5 cudaDev 1 busId 110 - Init COMPLETE
f883eb5b84f2:166:895 [0] NCCL INFO comm 0x7f6b60811fd0 rank 0 nranks 5 cudaDev 0 busId 100 - Init COMPLETE
f883eb5b84f2:166:895 [0] NCCL INFO Launch mode Parallel
2023-06-08 10:39:56,769 [INFO] root: None
Epoch: 0/11:, Cur-Step: 0, loss(cross_entropy): 0.75569, Running average loss:0.75569, Time taken: 0:00:00 ETA: 0:00:00
2023-06-08 10:39:56,828 [INFO] __main__: Epoch: 0/11:, Cur-Step: 0, loss(cross_entropy): 0.75569, Running average loss:0.75569, Time taken: 0:00:00 ETA: 0:00:00
INFO:tensorflow:Saving checkpoints for step-2.
2023-06-08 10:40:00,244 [INFO] tensorflow: Saving checkpoints for step-2.
INFO:tensorflow:Saving checkpoints for step-4.
2023-06-08 10:40:10,701 [INFO] tensorflow: Saving checkpoints for step-4.
INFO:tensorflow:Saving checkpoints for step-6.
2023-06-08 10:40:19,732 [INFO] tensorflow: Saving checkpoints for step-6.
INFO:tensorflow:Saving checkpoints for step-8.
2023-06-08 10:40:28,992 [INFO] tensorflow: Saving checkpoints for step-8.
INFO:tensorflow:Saving checkpoints for step-10.
2023-06-08 10:40:38,437 [INFO] tensorflow: Saving checkpoints for step-10.
2023-06-08 10:40:47,885 [INFO] root: None
Epoch: 5/11:, Cur-Step: 10, loss(cross_entropy): 0.72443, Running average loss:0.72443, Time taken: 0:00:09.482444 ETA: 0:00:56.894661
2023-06-08 10:40:47,985 [INFO] __main__: Epoch: 5/11:, Cur-Step: 10, loss(cross_entropy): 0.72443, Running average loss:0.72443, Time taken: 0:00:09.482444 ETA: 0:00:56.894661
INFO:tensorflow:Saving checkpoints for step-12.
2023-06-08 10:40:48,291 [INFO] tensorflow: Saving checkpoints for step-12.
INFO:tensorflow:Saving checkpoints for step-14.
2023-06-08 10:40:57,538 [INFO] tensorflow: Saving checkpoints for step-14.
INFO:tensorflow:Saving checkpoints for step-16.
2023-06-08 10:41:06,739 [INFO] tensorflow: Saving checkpoints for step-16.
INFO:tensorflow:Saving checkpoints for step-18.
2023-06-08 10:41:16,086 [INFO] tensorflow: Saving checkpoints for step-18.
INFO:tensorflow:Saving checkpoints for step-20.
2023-06-08 10:41:25,417 [INFO] tensorflow: Saving checkpoints for step-20.
2023-06-08 10:41:34,961 [INFO] root: None
Epoch: 10/11:, Cur-Step: 20, loss(cross_entropy): 0.62239, Running average loss:0.62239, Time taken: 0:00:09.437342 ETA: 0:00:09.437342
2023-06-08 10:41:35,023 [INFO] __main__: Epoch: 10/11:, Cur-Step: 20, loss(cross_entropy): 0.62239, Running average loss:0.62239, Time taken: 0:00:09.437342 ETA: 0:00:09.437342
INFO:tensorflow:Saving checkpoints for step-22.
2023-06-08 10:41:35,358 [INFO] tensorflow: Saving checkpoints for step-22.
INFO:tensorflow:Loss for final step: 0.6164588.
2023-06-08 10:41:35,453 [INFO] tensorflow: Loss for final step: 0.6164588.
INFO:tensorflow:Loss for final step: 0.6013098.
2023-06-08 10:41:35,461 [INFO] tensorflow: Loss for final step: 0.6013098.
INFO:tensorflow:Loss for final step: 0.62208736.
2023-06-08 10:41:35,461 [INFO] tensorflow: Loss for final step: 0.62208736.
INFO:tensorflow:Loss for final step: 0.6182792.
2023-06-08 10:41:35,471 [INFO] tensorflow: Loss for final step: 0.6182792.
2023-06-08 10:41:35,476 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,477 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,477 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,517 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
Throughput Avg: 67.075 img/s
Latency Avg: 392.697 ms
Latency 90%: 627.808 ms
Latency 95%: 672.829 ms
Latency 99%: 760.871 ms
DLL 2023-06-08 10:41:49.240021 - () throughput_train:67.0745170186196  latency_train:392.69723211015975 elapsed_time:142.369777
INFO:tensorflow:Loss for final step: 0.6112231.
2023-06-08 10:41:49,324 [INFO] tensorflow: Loss for final step: 0.6112231.
Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:49,780 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:42:03,500 [INFO] root: Experiment complete.
2023-06-08 10:42:50,187 [INFO] root: Experiment complete.
2023-06-08 10:42:55,107 [INFO] root: Experiment complete.
2023-06-08 10:42:55,107 [INFO] root: Experiment complete.
2023-06-08 10:42:55,110 [INFO] root: Experiment complete.

Could you please update nvidia-driver to 525?
Uninstall:
sudo apt purge nvidia-driver-515
sudo apt autoremove
sudo apt autoclean

Install: sudo apt install nvidia-driver-525

Excuse me @Morganh

Is it the reason for nccl version as the post stated? Cause I have not got error message in the version v3.22.05 of TAO Toolkit.

This is the log when I trained TAO Toolkit UNet in the version v3.22.05

INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,646 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,652 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,682 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,749 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:30,165 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,315 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,319 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,354 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,431 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:31,914 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,944 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,963 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,963 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,053 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,073 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,073 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:32,105 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,206 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:33,644 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:33,759 [INFO] tensorflow: Done running local_init_op.
[GPU] Restoring pretrained weights from: /tmp/tmpb0kfjiee/model.ckpt
2023-06-08 10:39:34,497 [INFO] iva.unet.hooks.pretrained_restore_hook: Pretrained weights loaded with success...

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,492 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,495 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,496 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,498 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

INFO:tensorflow:Saving checkpoints for step-0.
2023-06-08 10:39:38,987 [INFO] tensorflow: Saving checkpoints for step-0.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:48,316 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

f883eb5b84f2:166:895 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:166:895 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:166:895 [0] NCCL INFO P2P plugin IBext
f883eb5b84f2:166:895 [0] NCCL INFO NET/IB : No device found.
f883eb5b84f2:166:895 [0] NCCL INFO NET/IB : No device found.
f883eb5b84f2:166:895 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:166:895 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
f883eb5b84f2:176:889 [4] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:176:889 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:176:889 [4] NCCL INFO P2P plugin IBext
f883eb5b84f2:176:889 [4] NCCL INFO NET/IB : No device found.
f883eb5b84f2:176:889 [4] NCCL INFO NET/IB : No device found.
f883eb5b84f2:176:889 [4] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:176:889 [4] NCCL INFO Using network Socket
f883eb5b84f2:169:898 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:169:898 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:169:898 [2] NCCL INFO P2P plugin IBext
f883eb5b84f2:169:898 [2] NCCL INFO NET/IB : No device found.
f883eb5b84f2:169:898 [2] NCCL INFO NET/IB : No device found.
f883eb5b84f2:169:898 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:169:898 [2] NCCL INFO Using network Socket
f883eb5b84f2:167:886 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:167:886 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:167:886 [1] NCCL INFO P2P plugin IBext
f883eb5b84f2:167:886 [1] NCCL INFO NET/IB : No device found.
f883eb5b84f2:167:886 [1] NCCL INFO NET/IB : No device found.
f883eb5b84f2:167:886 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:167:886 [1] NCCL INFO Using network Socket
f883eb5b84f2:173:890 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:173:890 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:173:890 [3] NCCL INFO P2P plugin IBext
f883eb5b84f2:173:890 [3] NCCL INFO NET/IB : No device found.
f883eb5b84f2:173:890 [3] NCCL INFO NET/IB : No device found.
f883eb5b84f2:173:890 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:173:890 [3] NCCL INFO Using network Socket
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00/02 :    0   3   2   4   1
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01/02 :    0   3   2   4   1
f883eb5b84f2:166:895 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] 3/-1/-1->0->-1
f883eb5b84f2:167:886 [1] NCCL INFO Trees [0] 2/-1/-1->1->3 [1] 2/-1/-1->1->3
f883eb5b84f2:169:898 [2] NCCL INFO Trees [0] 4/-1/-1->2->1 [1] 4/-1/-1->2->1
f883eb5b84f2:173:890 [3] NCCL INFO Trees [0] 1/-1/-1->3->0 [1] 1/-1/-1->3->0
f883eb5b84f2:176:889 [4] NCCL INFO Trees [0] -1/-1/-1->4->2 [1] -1/-1/-1->4->2
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00 : 0[100] -> 3[1c0] via P2P/IPC
f883eb5b84f2:169:898 [2] NCCL INFO Channel 00 : 2[1b0] -> 4[20d0] via P2P/IPC
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01 : 0[100] -> 3[1c0] via P2P/IPC
f883eb5b84f2:169:898 [2] NCCL INFO Channel 01 : 2[1b0] -> 4[20d0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 1[110] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 1[110] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 0[100] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 0[100] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Connected all rings
f883eb5b84f2:169:898 [2] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Connected all rings
f883eb5b84f2:166:895 [0] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 2[1b0] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 0[100] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 3[1c0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 0[100] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 3[1c0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Connected all trees
f883eb5b84f2:176:889 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:176:889 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:169:898 [2] NCCL INFO Channel 00 : 2[1b0] -> 1[110] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 0[100] via P2P/indirect/2[1b0]
f883eb5b84f2:169:898 [2] NCCL INFO Channel 01 : 2[1b0] -> 1[110] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 0[100] via P2P/indirect/2[1b0]
f883eb5b84f2:166:895 [0] NCCL INFO Connected all trees
f883eb5b84f2:166:895 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:166:895 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 1[110] via P2P/IPC
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00 : 0[100] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 1[110] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Connected all trees
f883eb5b84f2:173:890 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:173:890 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01 : 0[100] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:167:886 [1] NCCL INFO Connected all trees
f883eb5b84f2:167:886 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:167:886 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:169:898 [2] NCCL INFO Connected all trees
f883eb5b84f2:169:898 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:169:898 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 3[1c0] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 3[1c0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO comm 0x7fd5087f9550 rank 3 nranks 5 cudaDev 3 busId 1c0 - Init COMPLETE
f883eb5b84f2:176:889 [4] NCCL INFO comm 0x7f9cf87f9820 rank 4 nranks 5 cudaDev 4 busId 20d0 - Init COMPLETE
f883eb5b84f2:169:898 [2] NCCL INFO comm 0x7fb5ec7fa6d0 rank 2 nranks 5 cudaDev 2 busId 1b0 - Init COMPLETE
f883eb5b84f2:167:886 [1] NCCL INFO comm 0x7f4bdc7f9890 rank 1 nranks 5 cudaDev 1 busId 110 - Init COMPLETE
f883eb5b84f2:166:895 [0] NCCL INFO comm 0x7f6b60811fd0 rank 0 nranks 5 cudaDev 0 busId 100 - Init COMPLETE
f883eb5b84f2:166:895 [0] NCCL INFO Launch mode Parallel
2023-06-08 10:39:56,769 [INFO] root: None
Epoch: 0/11:, Cur-Step: 0, loss(cross_entropy): 0.75569, Running average loss:0.75569, Time taken: 0:00:00 ETA: 0:00:00
2023-06-08 10:39:56,828 [INFO] __main__: Epoch: 0/11:, Cur-Step: 0, loss(cross_entropy): 0.75569, Running average loss:0.75569, Time taken: 0:00:00 ETA: 0:00:00
INFO:tensorflow:Saving checkpoints for step-2.
2023-06-08 10:40:00,244 [INFO] tensorflow: Saving checkpoints for step-2.
INFO:tensorflow:Saving checkpoints for step-4.
2023-06-08 10:40:10,701 [INFO] tensorflow: Saving checkpoints for step-4.
INFO:tensorflow:Saving checkpoints for step-6.
2023-06-08 10:40:19,732 [INFO] tensorflow: Saving checkpoints for step-6.
INFO:tensorflow:Saving checkpoints for step-8.
2023-06-08 10:40:28,992 [INFO] tensorflow: Saving checkpoints for step-8.
INFO:tensorflow:Saving checkpoints for step-10.
2023-06-08 10:40:38,437 [INFO] tensorflow: Saving checkpoints for step-10.
2023-06-08 10:40:47,885 [INFO] root: None
Epoch: 5/11:, Cur-Step: 10, loss(cross_entropy): 0.72443, Running average loss:0.72443, Time taken: 0:00:09.482444 ETA: 0:00:56.894661
2023-06-08 10:40:47,985 [INFO] __main__: Epoch: 5/11:, Cur-Step: 10, loss(cross_entropy): 0.72443, Running average loss:0.72443, Time taken: 0:00:09.482444 ETA: 0:00:56.894661
INFO:tensorflow:Saving checkpoints for step-12.
2023-06-08 10:40:48,291 [INFO] tensorflow: Saving checkpoints for step-12.
INFO:tensorflow:Saving checkpoints for step-14.
2023-06-08 10:40:57,538 [INFO] tensorflow: Saving checkpoints for step-14.
INFO:tensorflow:Saving checkpoints for step-16.
2023-06-08 10:41:06,739 [INFO] tensorflow: Saving checkpoints for step-16.
INFO:tensorflow:Saving checkpoints for step-18.
2023-06-08 10:41:16,086 [INFO] tensorflow: Saving checkpoints for step-18.
INFO:tensorflow:Saving checkpoints for step-20.
2023-06-08 10:41:25,417 [INFO] tensorflow: Saving checkpoints for step-20.
2023-06-08 10:41:34,961 [INFO] root: None
Epoch: 10/11:, Cur-Step: 20, loss(cross_entropy): 0.62239, Running average loss:0.62239, Time taken: 0:00:09.437342 ETA: 0:00:09.437342
2023-06-08 10:41:35,023 [INFO] __main__: Epoch: 10/11:, Cur-Step: 20, loss(cross_entropy): 0.62239, Running average loss:0.62239, Time taken: 0:00:09.437342 ETA: 0:00:09.437342
INFO:tensorflow:Saving checkpoints for step-22.
2023-06-08 10:41:35,358 [INFO] tensorflow: Saving checkpoints for step-22.
INFO:tensorflow:Loss for final step: 0.6164588.
2023-06-08 10:41:35,453 [INFO] tensorflow: Loss for final step: 0.6164588.
INFO:tensorflow:Loss for final step: 0.6013098.
2023-06-08 10:41:35,461 [INFO] tensorflow: Loss for final step: 0.6013098.
INFO:tensorflow:Loss for final step: 0.62208736.
2023-06-08 10:41:35,461 [INFO] tensorflow: Loss for final step: 0.62208736.
INFO:tensorflow:Loss for final step: 0.6182792.
2023-06-08 10:41:35,471 [INFO] tensorflow: Loss for final step: 0.6182792.
2023-06-08 10:41:35,476 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,477 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,477 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,517 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
Throughput Avg: 67.075 img/s
Latency Avg: 392.697 ms
Latency 90%: 627.808 ms
Latency 95%: 672.829 ms
Latency 99%: 760.871 ms
DLL 2023-06-08 10:41:49.240021 - () throughput_train:67.0745170186196  latency_train:392.69723211015975 elapsed_time:142.369777
INFO:tensorflow:Loss for final step: 0.6112231.
2023-06-08 10:41:49,324 [INFO] tensorflow: Loss for final step: 0.6112231.
Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:49,780 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:42:03,500 [INFO] root: Experiment complete.
2023-06-08 10:42:50,187 [INFO] root: Experiment complete.
2023-06-08 10:42:55,107 [INFO] root: Experiment complete.
2023-06-08 10:42:55,107 [INFO] root: Experiment complete.
2023-06-08 10:42:55,110 [INFO] root: Experiment complete.

In 4.0.1 docker, could you add below in the training_config then retry?
use_xla: false

More, please run below to check if it works?

docker run --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu
python -c “import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))”

Do you mean add use_xla: false in the training_config section which is in the file named unet_train_resnet_unet_isbi.txt ?

Yes.
More, another experiment is

docker run --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu
python -c “import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))”

I still got the error message device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0 after adding use_xla: false in the training_config

INFO:tensorflow:Done calling model_fn.
2023-06-14 08:21:24,413 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-14 08:21:24,539 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-06-14 08:21:24,547 [INFO] tensorflow: Graph was finalized.
2023-06-14 08:21:24,548 [INFO] root: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
INFO:tensorflow:Done calling model_fn.
2023-06-14 08:21:24,561 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-06-14 08:21:24,675 [INFO] tensorflow: Graph was finalized.
2023-06-14 08:21:24,676 [INFO] root: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
INFO:tensorflow:Graph was finalized.
2023-06-14 08:21:24,703 [INFO] tensorflow: Graph was finalized.
2023-06-14 08:21:24,704 [INFO] root: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
521fd6662d1c:139:341 [0] NCCL INFO comm 0x7feee0410b00 rank 0 nranks 4 cudaDev 0 busId 60 - Destroy COMPLETE
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 579, in <module>
  File "<frozen iva.unet.scripts.train>", line 571, in main
  File "<frozen iva.unet.scripts.train>", line 558, in main
  File "<frozen iva.unet.scripts.train>", line 425, in run_experiment
  File "<frozen iva.unet.scripts.evaluate>", line 323, in evaluate_unet
  File "<frozen iva.unet.scripts.evaluate>", line 228, in run_evaluate_tlt
  File "<frozen iva.unet.scripts.evaluate>", line 138, in print_compute_metrics
  File "<frozen iva.unet.scripts.evaluate>", line 81, in compute_metrics_masks
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 955, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 638, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 579, in <module>
  File "<frozen iva.unet.scripts.train>", line 571, in main
  File "<frozen iva.unet.scripts.train>", line 558, in main
  File "<frozen iva.unet.scripts.train>", line 425, in run_experiment
  File "<frozen iva.unet.scripts.evaluate>", line 323, in evaluate_unet
  File "<frozen iva.unet.scripts.evaluate>", line 228, in run_evaluate_tlt
  File "<frozen iva.unet.scripts.evaluate>", line 138, in print_compute_metrics
  File "<frozen iva.unet.scripts.evaluate>", line 81, in compute_metrics_masks
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 955, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 638, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 579, in <module>
  File "<frozen iva.unet.scripts.train>", line 571, in main
  File "<frozen iva.unet.scripts.train>", line 558, in main
  File "<frozen iva.unet.scripts.train>", line 425, in run_experiment
  File "<frozen iva.unet.scripts.evaluate>", line 323, in evaluate_unet
  File "<frozen iva.unet.scripts.evaluate>", line 228, in run_evaluate_tlt
  File "<frozen iva.unet.scripts.evaluate>", line 138, in print_compute_metrics
  File "<frozen iva.unet.scripts.evaluate>", line 81, in compute_metrics_masks
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 955, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 638, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
model.ckpt-22.meta
INFO:tensorflow:Using config: {'_model_dir': '/workspace/tao-experiments/isbi_experiment_unpruned/weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7feff80bed68>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2023-06-14 08:21:25,639 [INFO] tensorflow: Using config: {'_model_dir': '/workspace/tao-experiments/isbi_experiment_unpruned/weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7feff80bed68>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2023-06-14 08:21:25,640 [INFO] iva.unet.scripts.evaluate: Starting Evaluation.
0it [00:00, ?it/s]WARNING:tensorflow:Entity <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,655 [WARNING] tensorflow: Entity <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef5c05a8c8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef5c05a8c8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,670 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef5c05a8c8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef5c05a8c8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,679 [WARNING] tensorflow: Entity <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,688 [WARNING] tensorflow: Entity <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,696 [WARNING] tensorflow: Entity <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59aff7b8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59aff7b8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,712 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59aff7b8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59aff7b8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affa60> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affa60>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,720 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affa60> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affa60>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,728 [WARNING] tensorflow: Entity <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affbf8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affbf8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,738 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affbf8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affbf8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59b3c598> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59b3c598>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,755 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59b3c598> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59b3c598>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Calling model_fn.
2023-06-14 08:21:25,765 [INFO] tensorflow: Calling model_fn.
2023-06-14 08:21:25,765 [INFO] iva.unet.utils.model_fn: {'exec_mode': 'train', 'model_dir': '/workspace/tao-experiments/isbi_experiment_unpruned/weights', 'resize_padding': False, 'resize_method': 'BILINEAR', 'log_dir': None, 'batch_size': 3, 'learning_rate': 9.999999747378752e-05, 'activation': 'softmax', 'crossvalidation_idx': None, 'max_steps': None, 'regularizer_type': 2, 'weight_decay': 1.9999999494757503e-05, 'log_summary_steps': 10, 'warmup_steps': 0, 'augment': False, 'use_amp': False, 'filter_data': False, 'use_trt': False, 'use_xla': False, 'loss': 'cross_entropy', 'epochs': 11, 'pretrained_weights_file': None, 'lr_scheduler': None, 'unet_model': <iva.unet.model.resnet_unet.ResnetUnet object at 0x7fef59af3160>, 'key': 'nvidia_tlt', 'experiment_spec': random_seed: 42
dataset_config {
  dataset: "custom"
  input_image_type: "grayscale"
  train_images_path: "/workspace/tao-experiments/data/images/train"
  train_masks_path: "/workspace/tao-experiments/data/masks/train"
  val_images_path: "/workspace/tao-experiments/data/images/val"
  val_masks_path: "/workspace/tao-experiments/data/masks/val"
  test_images_path: "/workspace/tao-experiments/data/images/test"
  data_class_config {
    target_classes {
      name: "foreground"
      mapping_class: "foreground"
    }
    target_classes {
      name: "background"
      label_id: 1
      mapping_class: "background"
    }
  }
  augmentation_config {
    spatial_augmentation {
      hflip_probability: 0.5
      vflip_probability: 0.5
      crop_and_resize_prob: 0.5
    }
    brightness_augmentation {
      delta: 0.20000000298023224
    }
  }
}
model_config {
  num_layers: 18
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
  all_projections: true
  model_input_height: 320
  model_input_width: 320
  model_input_channels: 1
}
training_config {
  batch_size: 3
  regularizer {
    type: L2
    weight: 1.9999999494757503e-05
  }
  optimizer {
    adam {
      epsilon: 9.99999993922529e-09
      beta1: 0.8999999761581421
      beta2: 0.9990000128746033
    }
  }
  checkpoint_interval: 1
  log_summary_steps: 10
  learning_rate: 9.999999747378752e-05
  loss: "cross_entropy"
  epochs: 11
  visualizer {
    save_summary_steps: 1
  }
}
, 'seed': 42, 'benchmark': False, 'temp_dir': '/tmp/tmp_k6l73zd', 'num_classes': 2, 'num_conf_mat_classes': 2, 'start_step': 0, 'checkpoint_interval': 1, 'model_json': None, 'custom_objs': {}, 'load_graph': False, 'remove_head': False, 'buffer_size': None, 'data_options': False, 'weights_monitor': False, 'visualize': False, 'save_summary_steps': 1, 'infrequent_save_summary_steps': None, 'enable_qat': False, 'phase': 'val', 'model_size': 179.40708923339844}

And this is the result after I ran the command you provided to me.

Status: Downloaded newer image for tensorflow/tensorflow:latest-gpu
2023-06-14 08:31:15.734838: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'tensorflow' has no attribute 'enable_eager_execution'

May I know the CPU info in your system?

This is cpu info of my device.

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              1
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Stepping:                        4
CPU MHz:                         2294.749
BogoMIPS:                        4589.49
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        64 MiB
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr s
                                 se sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl cpuid p
                                 ni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes 
                                 xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc
                                 _adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clfl
                                 ushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Can you try old version of tf instead of tensorflow/tensorflow:latest-gpu?

Could you provide me with an example of the old version of tf ?

You can refer to Docker

For example,
$ docker pull tensorflow/tensorflow:2.10.0-gpu
$ docker pull tensorflow/tensorflow:1.13.0rc2-gpu-py3

Excuse me @Morganh

Which version of tensorflow do you recommend to test ?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

To narrow down, you can above two examples to check if you still meet error
device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.