Cannot train Tao Toolkit UNet model in version v4.0.0 and v4.0.1

swka1043338 · June 13, 2023, 5:03am

I’ve changed gpus from Tesla P100 to Tesla V100 and tried to train Tao Toolkit UNet model with 4 gpus in version v4.0.0 and v4.0.1 again.

However. I still got the error message: device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0.

This is the result in the process of training UNet when I ran the command nvidia-smi.

Is this a bug for Tao Toolkit v4.0.0 and v4.0.1 ? When I trained UNet in the version v3.22.05, it seemed that there were no errors occurred as the contents below.

INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,646 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,652 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,682 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,749 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:30,165 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,315 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,319 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,354 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,431 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:31,914 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,944 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,963 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,963 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,053 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,073 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,073 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:32,105 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,206 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:33,644 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:33,759 [INFO] tensorflow: Done running local_init_op.
[GPU] Restoring pretrained weights from: /tmp/tmpb0kfjiee/model.ckpt
2023-06-08 10:39:34,497 [INFO] iva.unet.hooks.pretrained_restore_hook: Pretrained weights loaded with success...

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,492 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,495 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,496 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,498 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

INFO:tensorflow:Saving checkpoints for step-0.
2023-06-08 10:39:38,987 [INFO] tensorflow: Saving checkpoints for step-0.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:48,316 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

f883eb5b84f2:166:895 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:166:895 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:166:895 [0] NCCL INFO P2P plugin IBext
f883eb5b84f2:166:895 [0] NCCL INFO NET/IB : No device found.
f883eb5b84f2:166:895 [0] NCCL INFO NET/IB : No device found.
f883eb5b84f2:166:895 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:166:895 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
f883eb5b84f2:176:889 [4] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:176:889 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:176:889 [4] NCCL INFO P2P plugin IBext
f883eb5b84f2:176:889 [4] NCCL INFO NET/IB : No device found.
f883eb5b84f2:176:889 [4] NCCL INFO NET/IB : No device found.
f883eb5b84f2:176:889 [4] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:176:889 [4] NCCL INFO Using network Socket
f883eb5b84f2:169:898 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:169:898 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:169:898 [2] NCCL INFO P2P plugin IBext
f883eb5b84f2:169:898 [2] NCCL INFO NET/IB : No device found.
f883eb5b84f2:169:898 [2] NCCL INFO NET/IB : No device found.
f883eb5b84f2:169:898 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:169:898 [2] NCCL INFO Using network Socket
f883eb5b84f2:167:886 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:167:886 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:167:886 [1] NCCL INFO P2P plugin IBext
f883eb5b84f2:167:886 [1] NCCL INFO NET/IB : No device found.
f883eb5b84f2:167:886 [1] NCCL INFO NET/IB : No device found.
f883eb5b84f2:167:886 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:167:886 [1] NCCL INFO Using network Socket
f883eb5b84f2:173:890 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:173:890 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:173:890 [3] NCCL INFO P2P plugin IBext
f883eb5b84f2:173:890 [3] NCCL INFO NET/IB : No device found.
f883eb5b84f2:173:890 [3] NCCL INFO NET/IB : No device found.
f883eb5b84f2:173:890 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:173:890 [3] NCCL INFO Using network Socket
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00/02 :    0   3   2   4   1
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01/02 :    0   3   2   4   1
f883eb5b84f2:166:895 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] 3/-1/-1->0->-1
f883eb5b84f2:167:886 [1] NCCL INFO Trees [0] 2/-1/-1->1->3 [1] 2/-1/-1->1->3
f883eb5b84f2:169:898 [2] NCCL INFO Trees [0] 4/-1/-1->2->1 [1] 4/-1/-1->2->1
f883eb5b84f2:173:890 [3] NCCL INFO Trees [0] 1/-1/-1->3->0 [1] 1/-1/-1->3->0
f883eb5b84f2:176:889 [4] NCCL INFO Trees [0] -1/-1/-1->4->2 [1] -1/-1/-1->4->2
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00 : 0[100] -> 3[1c0] via P2P/IPC
f883eb5b84f2:169:898 [2] NCCL INFO Channel 00 : 2[1b0] -> 4[20d0] via P2P/IPC
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01 : 0[100] -> 3[1c0] via P2P/IPC
f883eb5b84f2:169:898 [2] NCCL INFO Channel 01 : 2[1b0] -> 4[20d0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 1[110] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 1[110] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 0[100] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 0[100] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Connected all rings
f883eb5b84f2:169:898 [2] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Connected all rings
f883eb5b84f2:166:895 [0] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 2[1b0] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 0[100] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 3[1c0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 0[100] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 3[1c0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Connected all trees
f883eb5b84f2:176:889 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:176:889 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:169:898 [2] NCCL INFO Channel 00 : 2[1b0] -> 1[110] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 0[100] via P2P/indirect/2[1b0]
f883eb5b84f2:169:898 [2] NCCL INFO Channel 01 : 2[1b0] -> 1[110] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 0[100] via P2P/indirect/2[1b0]
f883eb5b84f2:166:895 [0] NCCL INFO Connected all trees
f883eb5b84f2:166:895 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:166:895 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 1[110] via P2P/IPC
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00 : 0[100] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 1[110] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Connected all trees
f883eb5b84f2:173:890 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:173:890 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01 : 0[100] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:167:886 [1] NCCL INFO Connected all trees
f883eb5b84f2:167:886 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:167:886 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:169:898 [2] NCCL INFO Connected all trees
f883eb5b84f2:169:898 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:169:898 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 3[1c0] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 3[1c0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO comm 0x7fd5087f9550 rank 3 nranks 5 cudaDev 3 busId 1c0 - Init COMPLETE
f883eb5b84f2:176:889 [4] NCCL INFO comm 0x7f9cf87f9820 rank 4 nranks 5 cudaDev 4 busId 20d0 - Init COMPLETE
f883eb5b84f2:169:898 [2] NCCL INFO comm 0x7fb5ec7fa6d0 rank 2 nranks 5 cudaDev 2 busId 1b0 - Init COMPLETE
f883eb5b84f2:167:886 [1] NCCL INFO comm 0x7f4bdc7f9890 rank 1 nranks 5 cudaDev 1 busId 110 - Init COMPLETE
f883eb5b84f2:166:895 [0] NCCL INFO comm 0x7f6b60811fd0 rank 0 nranks 5 cudaDev 0 busId 100 - Init COMPLETE
f883eb5b84f2:166:895 [0] NCCL INFO Launch mode Parallel
2023-06-08 10:39:56,769 [INFO] root: None
Epoch: 0/11:, Cur-Step: 0, loss(cross_entropy): 0.75569, Running average loss:0.75569, Time taken: 0:00:00 ETA: 0:00:00
2023-06-08 10:39:56,828 [INFO] __main__: Epoch: 0/11:, Cur-Step: 0, loss(cross_entropy): 0.75569, Running average loss:0.75569, Time taken: 0:00:00 ETA: 0:00:00
INFO:tensorflow:Saving checkpoints for step-2.
2023-06-08 10:40:00,244 [INFO] tensorflow: Saving checkpoints for step-2.
INFO:tensorflow:Saving checkpoints for step-4.
2023-06-08 10:40:10,701 [INFO] tensorflow: Saving checkpoints for step-4.
INFO:tensorflow:Saving checkpoints for step-6.
2023-06-08 10:40:19,732 [INFO] tensorflow: Saving checkpoints for step-6.
INFO:tensorflow:Saving checkpoints for step-8.
2023-06-08 10:40:28,992 [INFO] tensorflow: Saving checkpoints for step-8.
INFO:tensorflow:Saving checkpoints for step-10.
2023-06-08 10:40:38,437 [INFO] tensorflow: Saving checkpoints for step-10.
2023-06-08 10:40:47,885 [INFO] root: None
Epoch: 5/11:, Cur-Step: 10, loss(cross_entropy): 0.72443, Running average loss:0.72443, Time taken: 0:00:09.482444 ETA: 0:00:56.894661
2023-06-08 10:40:47,985 [INFO] __main__: Epoch: 5/11:, Cur-Step: 10, loss(cross_entropy): 0.72443, Running average loss:0.72443, Time taken: 0:00:09.482444 ETA: 0:00:56.894661
INFO:tensorflow:Saving checkpoints for step-12.
2023-06-08 10:40:48,291 [INFO] tensorflow: Saving checkpoints for step-12.
INFO:tensorflow:Saving checkpoints for step-14.
2023-06-08 10:40:57,538 [INFO] tensorflow: Saving checkpoints for step-14.
INFO:tensorflow:Saving checkpoints for step-16.
2023-06-08 10:41:06,739 [INFO] tensorflow: Saving checkpoints for step-16.
INFO:tensorflow:Saving checkpoints for step-18.
2023-06-08 10:41:16,086 [INFO] tensorflow: Saving checkpoints for step-18.
INFO:tensorflow:Saving checkpoints for step-20.
2023-06-08 10:41:25,417 [INFO] tensorflow: Saving checkpoints for step-20.
2023-06-08 10:41:34,961 [INFO] root: None
Epoch: 10/11:, Cur-Step: 20, loss(cross_entropy): 0.62239, Running average loss:0.62239, Time taken: 0:00:09.437342 ETA: 0:00:09.437342
2023-06-08 10:41:35,023 [INFO] __main__: Epoch: 10/11:, Cur-Step: 20, loss(cross_entropy): 0.62239, Running average loss:0.62239, Time taken: 0:00:09.437342 ETA: 0:00:09.437342
INFO:tensorflow:Saving checkpoints for step-22.
2023-06-08 10:41:35,358 [INFO] tensorflow: Saving checkpoints for step-22.
INFO:tensorflow:Loss for final step: 0.6164588.
2023-06-08 10:41:35,453 [INFO] tensorflow: Loss for final step: 0.6164588.
INFO:tensorflow:Loss for final step: 0.6013098.
2023-06-08 10:41:35,461 [INFO] tensorflow: Loss for final step: 0.6013098.
INFO:tensorflow:Loss for final step: 0.62208736.
2023-06-08 10:41:35,461 [INFO] tensorflow: Loss for final step: 0.62208736.
INFO:tensorflow:Loss for final step: 0.6182792.
2023-06-08 10:41:35,471 [INFO] tensorflow: Loss for final step: 0.6182792.
2023-06-08 10:41:35,476 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,477 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,477 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,517 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
Throughput Avg: 67.075 img/s
Latency Avg: 392.697 ms
Latency 90%: 627.808 ms
Latency 95%: 672.829 ms
Latency 99%: 760.871 ms
DLL 2023-06-08 10:41:49.240021 - () throughput_train:67.0745170186196  latency_train:392.69723211015975 elapsed_time:142.369777
INFO:tensorflow:Loss for final step: 0.6112231.
2023-06-08 10:41:49,324 [INFO] tensorflow: Loss for final step: 0.6112231.
Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:49,780 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:42:03,500 [INFO] root: Experiment complete.
2023-06-08 10:42:50,187 [INFO] root: Experiment complete.
2023-06-08 10:42:55,107 [INFO] root: Experiment complete.
2023-06-08 10:42:55,107 [INFO] root: Experiment complete.
2023-06-08 10:42:55,110 [INFO] root: Experiment complete.

Morganh · June 13, 2023, 8:55am

Could you please update nvidia-driver to 525?
Uninstall:
sudo apt purge nvidia-driver-515
sudo apt autoremove
sudo apt autoclean

Install: sudo apt install nvidia-driver-525

swka1043338 · June 14, 2023, 12:09am

Excuse me @Morganh

Is it the reason for nccl version as the post stated? Cause I have not got error message in the version v3.22.05 of TAO Toolkit.

This is the log when I trained TAO Toolkit UNet in the version v3.22.05

INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,646 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,652 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,682 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:29,749 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-08 10:39:30,165 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,315 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,319 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,354 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:30,431 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Graph was finalized.
2023-06-08 10:39:31,914 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,944 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,963 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:31,963 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,053 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,073 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,073 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:32,105 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:32,206 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-06-08 10:39:33,644 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-06-08 10:39:33,759 [INFO] tensorflow: Done running local_init_op.
[GPU] Restoring pretrained weights from: /tmp/tmpb0kfjiee/model.ckpt
2023-06-08 10:39:34,497 [INFO] iva.unet.hooks.pretrained_restore_hook: Pretrained weights loaded with success...

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,492 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,495 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,496 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:35,498 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

INFO:tensorflow:Saving checkpoints for step-0.
2023-06-08 10:39:38,987 [INFO] tensorflow: Saving checkpoints for step-0.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2023-06-08 10:39:48,316 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:111: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

f883eb5b84f2:166:895 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:166:895 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:166:895 [0] NCCL INFO P2P plugin IBext
f883eb5b84f2:166:895 [0] NCCL INFO NET/IB : No device found.
f883eb5b84f2:166:895 [0] NCCL INFO NET/IB : No device found.
f883eb5b84f2:166:895 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:166:895 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
f883eb5b84f2:176:889 [4] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:176:889 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:176:889 [4] NCCL INFO P2P plugin IBext
f883eb5b84f2:176:889 [4] NCCL INFO NET/IB : No device found.
f883eb5b84f2:176:889 [4] NCCL INFO NET/IB : No device found.
f883eb5b84f2:176:889 [4] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:176:889 [4] NCCL INFO Using network Socket
f883eb5b84f2:169:898 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:169:898 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:169:898 [2] NCCL INFO P2P plugin IBext
f883eb5b84f2:169:898 [2] NCCL INFO NET/IB : No device found.
f883eb5b84f2:169:898 [2] NCCL INFO NET/IB : No device found.
f883eb5b84f2:169:898 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:169:898 [2] NCCL INFO Using network Socket
f883eb5b84f2:167:886 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:167:886 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:167:886 [1] NCCL INFO P2P plugin IBext
f883eb5b84f2:167:886 [1] NCCL INFO NET/IB : No device found.
f883eb5b84f2:167:886 [1] NCCL INFO NET/IB : No device found.
f883eb5b84f2:167:886 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:167:886 [1] NCCL INFO Using network Socket
f883eb5b84f2:173:890 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
f883eb5b84f2:173:890 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
f883eb5b84f2:173:890 [3] NCCL INFO P2P plugin IBext
f883eb5b84f2:173:890 [3] NCCL INFO NET/IB : No device found.
f883eb5b84f2:173:890 [3] NCCL INFO NET/IB : No device found.
f883eb5b84f2:173:890 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
f883eb5b84f2:173:890 [3] NCCL INFO Using network Socket
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00/02 :    0   3   2   4   1
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01/02 :    0   3   2   4   1
f883eb5b84f2:166:895 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] 3/-1/-1->0->-1
f883eb5b84f2:167:886 [1] NCCL INFO Trees [0] 2/-1/-1->1->3 [1] 2/-1/-1->1->3
f883eb5b84f2:169:898 [2] NCCL INFO Trees [0] 4/-1/-1->2->1 [1] 4/-1/-1->2->1
f883eb5b84f2:173:890 [3] NCCL INFO Trees [0] 1/-1/-1->3->0 [1] 1/-1/-1->3->0
f883eb5b84f2:176:889 [4] NCCL INFO Trees [0] -1/-1/-1->4->2 [1] -1/-1/-1->4->2
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00 : 0[100] -> 3[1c0] via P2P/IPC
f883eb5b84f2:169:898 [2] NCCL INFO Channel 00 : 2[1b0] -> 4[20d0] via P2P/IPC
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01 : 0[100] -> 3[1c0] via P2P/IPC
f883eb5b84f2:169:898 [2] NCCL INFO Channel 01 : 2[1b0] -> 4[20d0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 1[110] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 1[110] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 0[100] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 0[100] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Connected all rings
f883eb5b84f2:169:898 [2] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Connected all rings
f883eb5b84f2:166:895 [0] NCCL INFO Connected all rings
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 2[1b0] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 2[1b0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 0[100] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 3[1c0] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 0[100] via P2P/IPC
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 3[1c0] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Connected all trees
f883eb5b84f2:176:889 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:176:889 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:169:898 [2] NCCL INFO Channel 00 : 2[1b0] -> 1[110] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 0[100] via P2P/indirect/2[1b0]
f883eb5b84f2:169:898 [2] NCCL INFO Channel 01 : 2[1b0] -> 1[110] via P2P/IPC
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 0[100] via P2P/indirect/2[1b0]
f883eb5b84f2:166:895 [0] NCCL INFO Connected all trees
f883eb5b84f2:166:895 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:166:895 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 1[110] via P2P/IPC
f883eb5b84f2:166:895 [0] NCCL INFO Channel 00 : 0[100] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 1[110] via P2P/IPC
f883eb5b84f2:173:890 [3] NCCL INFO Connected all trees
f883eb5b84f2:173:890 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:173:890 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:166:895 [0] NCCL INFO Channel 01 : 0[100] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 00 : 3[1c0] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO Channel 01 : 3[1c0] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:167:886 [1] NCCL INFO Connected all trees
f883eb5b84f2:167:886 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:167:886 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:169:898 [2] NCCL INFO Connected all trees
f883eb5b84f2:169:898 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/512
f883eb5b84f2:169:898 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
f883eb5b84f2:167:886 [1] NCCL INFO Channel 00 : 1[110] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:167:886 [1] NCCL INFO Channel 01 : 1[110] -> 4[20d0] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 00 : 4[20d0] -> 3[1c0] via P2P/indirect/2[1b0]
f883eb5b84f2:176:889 [4] NCCL INFO Channel 01 : 4[20d0] -> 3[1c0] via P2P/indirect/2[1b0]
f883eb5b84f2:173:890 [3] NCCL INFO comm 0x7fd5087f9550 rank 3 nranks 5 cudaDev 3 busId 1c0 - Init COMPLETE
f883eb5b84f2:176:889 [4] NCCL INFO comm 0x7f9cf87f9820 rank 4 nranks 5 cudaDev 4 busId 20d0 - Init COMPLETE
f883eb5b84f2:169:898 [2] NCCL INFO comm 0x7fb5ec7fa6d0 rank 2 nranks 5 cudaDev 2 busId 1b0 - Init COMPLETE
f883eb5b84f2:167:886 [1] NCCL INFO comm 0x7f4bdc7f9890 rank 1 nranks 5 cudaDev 1 busId 110 - Init COMPLETE
f883eb5b84f2:166:895 [0] NCCL INFO comm 0x7f6b60811fd0 rank 0 nranks 5 cudaDev 0 busId 100 - Init COMPLETE
f883eb5b84f2:166:895 [0] NCCL INFO Launch mode Parallel
2023-06-08 10:39:56,769 [INFO] root: None
Epoch: 0/11:, Cur-Step: 0, loss(cross_entropy): 0.75569, Running average loss:0.75569, Time taken: 0:00:00 ETA: 0:00:00
2023-06-08 10:39:56,828 [INFO] __main__: Epoch: 0/11:, Cur-Step: 0, loss(cross_entropy): 0.75569, Running average loss:0.75569, Time taken: 0:00:00 ETA: 0:00:00
INFO:tensorflow:Saving checkpoints for step-2.
2023-06-08 10:40:00,244 [INFO] tensorflow: Saving checkpoints for step-2.
INFO:tensorflow:Saving checkpoints for step-4.
2023-06-08 10:40:10,701 [INFO] tensorflow: Saving checkpoints for step-4.
INFO:tensorflow:Saving checkpoints for step-6.
2023-06-08 10:40:19,732 [INFO] tensorflow: Saving checkpoints for step-6.
INFO:tensorflow:Saving checkpoints for step-8.
2023-06-08 10:40:28,992 [INFO] tensorflow: Saving checkpoints for step-8.
INFO:tensorflow:Saving checkpoints for step-10.
2023-06-08 10:40:38,437 [INFO] tensorflow: Saving checkpoints for step-10.
2023-06-08 10:40:47,885 [INFO] root: None
Epoch: 5/11:, Cur-Step: 10, loss(cross_entropy): 0.72443, Running average loss:0.72443, Time taken: 0:00:09.482444 ETA: 0:00:56.894661
2023-06-08 10:40:47,985 [INFO] __main__: Epoch: 5/11:, Cur-Step: 10, loss(cross_entropy): 0.72443, Running average loss:0.72443, Time taken: 0:00:09.482444 ETA: 0:00:56.894661
INFO:tensorflow:Saving checkpoints for step-12.
2023-06-08 10:40:48,291 [INFO] tensorflow: Saving checkpoints for step-12.
INFO:tensorflow:Saving checkpoints for step-14.
2023-06-08 10:40:57,538 [INFO] tensorflow: Saving checkpoints for step-14.
INFO:tensorflow:Saving checkpoints for step-16.
2023-06-08 10:41:06,739 [INFO] tensorflow: Saving checkpoints for step-16.
INFO:tensorflow:Saving checkpoints for step-18.
2023-06-08 10:41:16,086 [INFO] tensorflow: Saving checkpoints for step-18.
INFO:tensorflow:Saving checkpoints for step-20.
2023-06-08 10:41:25,417 [INFO] tensorflow: Saving checkpoints for step-20.
2023-06-08 10:41:34,961 [INFO] root: None
Epoch: 10/11:, Cur-Step: 20, loss(cross_entropy): 0.62239, Running average loss:0.62239, Time taken: 0:00:09.437342 ETA: 0:00:09.437342
2023-06-08 10:41:35,023 [INFO] __main__: Epoch: 10/11:, Cur-Step: 20, loss(cross_entropy): 0.62239, Running average loss:0.62239, Time taken: 0:00:09.437342 ETA: 0:00:09.437342
INFO:tensorflow:Saving checkpoints for step-22.
2023-06-08 10:41:35,358 [INFO] tensorflow: Saving checkpoints for step-22.
INFO:tensorflow:Loss for final step: 0.6164588.
2023-06-08 10:41:35,453 [INFO] tensorflow: Loss for final step: 0.6164588.
INFO:tensorflow:Loss for final step: 0.6013098.
2023-06-08 10:41:35,461 [INFO] tensorflow: Loss for final step: 0.6013098.
INFO:tensorflow:Loss for final step: 0.62208736.
2023-06-08 10:41:35,461 [INFO] tensorflow: Loss for final step: 0.62208736.
INFO:tensorflow:Loss for final step: 0.6182792.
2023-06-08 10:41:35,471 [INFO] tensorflow: Loss for final step: 0.6182792.
2023-06-08 10:41:35,476 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,477 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,477 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:35,517 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
Throughput Avg: 67.075 img/s
Latency Avg: 392.697 ms
Latency 90%: 627.808 ms
Latency 95%: 672.829 ms
Latency 99%: 760.871 ms
DLL 2023-06-08 10:41:49.240021 - () throughput_train:67.0745170186196  latency_train:392.69723211015975 elapsed_time:142.369777
INFO:tensorflow:Loss for final step: 0.6112231.
2023-06-08 10:41:49,324 [INFO] tensorflow: Loss for final step: 0.6112231.
Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:41:49,780 [INFO] __main__: Saving the final step model to /workspace/tao-experiments/isbi_experiment_unpruned/weights/model_isbi.tlt
2023-06-08 10:42:03,500 [INFO] root: Experiment complete.
2023-06-08 10:42:50,187 [INFO] root: Experiment complete.
2023-06-08 10:42:55,107 [INFO] root: Experiment complete.
2023-06-08 10:42:55,107 [INFO] root: Experiment complete.
2023-06-08 10:42:55,110 [INFO] root: Experiment complete.

Morganh · June 14, 2023, 7:08am

In 4.0.1 docker, could you add below in the training_config then retry?
use_xla: false

Morganh · June 14, 2023, 7:15am

More, please run below to check if it works?

docker run --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu
python -c “import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))”

swka1043338 · June 14, 2023, 8:03am

Do you mean add use_xla: false in the training_config section which is in the file named unet_train_resnet_unet_isbi.txt ?

Morganh · June 14, 2023, 8:03am

Yes.
More, another experiment is

docker run --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu
python -c “import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))”

swka1043338 · June 14, 2023, 8:26am

I still got the error message device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0 after adding use_xla: false in the training_config

INFO:tensorflow:Done calling model_fn.
2023-06-14 08:21:24,413 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
2023-06-14 08:21:24,539 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-06-14 08:21:24,547 [INFO] tensorflow: Graph was finalized.
2023-06-14 08:21:24,548 [INFO] root: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
INFO:tensorflow:Done calling model_fn.
2023-06-14 08:21:24,561 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2023-06-14 08:21:24,675 [INFO] tensorflow: Graph was finalized.
2023-06-14 08:21:24,676 [INFO] root: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
INFO:tensorflow:Graph was finalized.
2023-06-14 08:21:24,703 [INFO] tensorflow: Graph was finalized.
2023-06-14 08:21:24,704 [INFO] root: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
521fd6662d1c:139:341 [0] NCCL INFO comm 0x7feee0410b00 rank 0 nranks 4 cudaDev 0 busId 60 - Destroy COMPLETE
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 579, in <module>
  File "<frozen iva.unet.scripts.train>", line 571, in main
  File "<frozen iva.unet.scripts.train>", line 558, in main
  File "<frozen iva.unet.scripts.train>", line 425, in run_experiment
  File "<frozen iva.unet.scripts.evaluate>", line 323, in evaluate_unet
  File "<frozen iva.unet.scripts.evaluate>", line 228, in run_evaluate_tlt
  File "<frozen iva.unet.scripts.evaluate>", line 138, in print_compute_metrics
  File "<frozen iva.unet.scripts.evaluate>", line 81, in compute_metrics_masks
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 955, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 638, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 579, in <module>
  File "<frozen iva.unet.scripts.train>", line 571, in main
  File "<frozen iva.unet.scripts.train>", line 558, in main
  File "<frozen iva.unet.scripts.train>", line 425, in run_experiment
  File "<frozen iva.unet.scripts.evaluate>", line 323, in evaluate_unet
  File "<frozen iva.unet.scripts.evaluate>", line 228, in run_evaluate_tlt
  File "<frozen iva.unet.scripts.evaluate>", line 138, in print_compute_metrics
  File "<frozen iva.unet.scripts.evaluate>", line 81, in compute_metrics_masks
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 955, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 638, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/unet/scripts/train.py>", line 3, in <module>
  File "<frozen iva.unet.scripts.train>", line 579, in <module>
  File "<frozen iva.unet.scripts.train>", line 571, in main
  File "<frozen iva.unet.scripts.train>", line 558, in main
  File "<frozen iva.unet.scripts.train>", line 425, in run_experiment
  File "<frozen iva.unet.scripts.evaluate>", line 323, in evaluate_unet
  File "<frozen iva.unet.scripts.evaluate>", line 228, in run_evaluate_tlt
  File "<frozen iva.unet.scripts.evaluate>", line 138, in print_compute_metrics
  File "<frozen iva.unet.scripts.evaluate>", line 81, in compute_metrics_masks
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 955, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 638, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/session_manager.py", line 194, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
        while setting up XLA_GPU_JIT device number 0
model.ckpt-22.meta
INFO:tensorflow:Using config: {'_model_dir': '/workspace/tao-experiments/isbi_experiment_unpruned/weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7feff80bed68>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2023-06-14 08:21:25,639 [INFO] tensorflow: Using config: {'_model_dir': '/workspace/tao-experiments/isbi_experiment_unpruned/weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7feff80bed68>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2023-06-14 08:21:25,640 [INFO] iva.unet.scripts.evaluate: Starting Evaluation.
0it [00:00, ?it/s]WARNING:tensorflow:Entity <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,655 [WARNING] tensorflow: Entity <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.read_image_and_label_tensors of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef5c05a8c8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef5c05a8c8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,670 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef5c05a8c8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef5c05a8c8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,679 [WARNING] tensorflow: Entity <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.rgb_to_bgr_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,688 [WARNING] tensorflow: Entity <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.cast_img_lbl_dtype_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,696 [WARNING] tensorflow: Entity <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.resize_image_and_label_tf of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59aff7b8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59aff7b8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,712 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59aff7b8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59aff7b8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affa60> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affa60>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,720 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affa60> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affa60>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,728 [WARNING] tensorflow: Entity <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset.transpose_to_nchw of <iva.unet.utils.data_loader.Dataset object at 0x7feff80330f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affbf8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affbf8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,738 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affbf8> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59affbf8>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59b3c598> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59b3c598>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-06-14 08:21:25,755 [WARNING] tensorflow: Entity <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59b3c598> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <function Dataset.input_fn_aigs_tf.<locals>.<lambda> at 0x7fef59b3c598>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Calling model_fn.
2023-06-14 08:21:25,765 [INFO] tensorflow: Calling model_fn.
2023-06-14 08:21:25,765 [INFO] iva.unet.utils.model_fn: {'exec_mode': 'train', 'model_dir': '/workspace/tao-experiments/isbi_experiment_unpruned/weights', 'resize_padding': False, 'resize_method': 'BILINEAR', 'log_dir': None, 'batch_size': 3, 'learning_rate': 9.999999747378752e-05, 'activation': 'softmax', 'crossvalidation_idx': None, 'max_steps': None, 'regularizer_type': 2, 'weight_decay': 1.9999999494757503e-05, 'log_summary_steps': 10, 'warmup_steps': 0, 'augment': False, 'use_amp': False, 'filter_data': False, 'use_trt': False, 'use_xla': False, 'loss': 'cross_entropy', 'epochs': 11, 'pretrained_weights_file': None, 'lr_scheduler': None, 'unet_model': <iva.unet.model.resnet_unet.ResnetUnet object at 0x7fef59af3160>, 'key': 'nvidia_tlt', 'experiment_spec': random_seed: 42
dataset_config {
  dataset: "custom"
  input_image_type: "grayscale"
  train_images_path: "/workspace/tao-experiments/data/images/train"
  train_masks_path: "/workspace/tao-experiments/data/masks/train"
  val_images_path: "/workspace/tao-experiments/data/images/val"
  val_masks_path: "/workspace/tao-experiments/data/masks/val"
  test_images_path: "/workspace/tao-experiments/data/images/test"
  data_class_config {
    target_classes {
      name: "foreground"
      mapping_class: "foreground"
    }
    target_classes {
      name: "background"
      label_id: 1
      mapping_class: "background"
    }
  }
  augmentation_config {
    spatial_augmentation {
      hflip_probability: 0.5
      vflip_probability: 0.5
      crop_and_resize_prob: 0.5
    }
    brightness_augmentation {
      delta: 0.20000000298023224
    }
  }
}
model_config {
  num_layers: 18
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
  all_projections: true
  model_input_height: 320
  model_input_width: 320
  model_input_channels: 1
}
training_config {
  batch_size: 3
  regularizer {
    type: L2
    weight: 1.9999999494757503e-05
  }
  optimizer {
    adam {
      epsilon: 9.99999993922529e-09
      beta1: 0.8999999761581421
      beta2: 0.9990000128746033
    }
  }
  checkpoint_interval: 1
  log_summary_steps: 10
  learning_rate: 9.999999747378752e-05
  loss: "cross_entropy"
  epochs: 11
  visualizer {
    save_summary_steps: 1
  }
}
, 'seed': 42, 'benchmark': False, 'temp_dir': '/tmp/tmp_k6l73zd', 'num_classes': 2, 'num_conf_mat_classes': 2, 'start_step': 0, 'checkpoint_interval': 1, 'model_json': None, 'custom_objs': {}, 'load_graph': False, 'remove_head': False, 'buffer_size': None, 'data_options': False, 'weights_monitor': False, 'visualize': False, 'save_summary_steps': 1, 'infrequent_save_summary_steps': None, 'enable_qat': False, 'phase': 'val', 'model_size': 179.40708923339844}

swka1043338 · June 14, 2023, 8:32am

And this is the result after I ran the command you provided to me.

Status: Downloaded newer image for tensorflow/tensorflow:latest-gpu
2023-06-14 08:31:15.734838: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'tensorflow' has no attribute 'enable_eager_execution'

Morganh · June 14, 2023, 4:37pm

swka1043338:

Status: Downloaded newer image for tensorflow/tensorflow:latest-gpu
2023-06-14 08:31:15.734838: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'tensorflow' has no attribute 'enable_eager_execution'

May I know the CPU info in your system?

swka1043338 · June 14, 2023, 11:47pm

This is cpu info of my device.

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              1
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Stepping:                        4
CPU MHz:                         2294.749
BogoMIPS:                        4589.49
L1d cache:                       512 KiB
L1i cache:                       512 KiB
L2 cache:                        64 MiB
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr s
                                 se sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl cpuid p
                                 ni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes 
                                 xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc
                                 _adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clfl
                                 ushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat

Morganh · June 19, 2023, 5:41pm

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Can you try old version of tf instead of tensorflow/tensorflow:latest-gpu?

swka1043338 · July 4, 2023, 8:19am

Could you provide me with an example of the old version of tf ?

Morganh · July 4, 2023, 8:32am

You can refer to Docker

For example,
$ docker pull tensorflow/tensorflow:2.10.0-gpu
$ docker pull tensorflow/tensorflow:1.13.0rc2-gpu-py3

swka1043338 · July 13, 2023, 12:55am

Excuse me @Morganh

Which version of tensorflow do you recommend to test ?

Morganh · July 13, 2023, 1:38am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

To narrow down, you can above two examples to check if you still meet error
device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0.

Topic		Replies	Views
TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain TAO Toolkit	6	790	July 17, 2023
The training process of Tao-Toolkit-API unet is always in Inf status TAO Toolkit api , tao	61	2839	June 12, 2023
More than 1 GPU not working using Tao Train TAO Toolkit	47	5435	April 9, 2023
Run TAO training using unet.ipynb in Jupyter Notebook failed TAO Toolkit	4	551	August 1, 2022
No CUDA-capable device is detected - yolov4 TAO Toolkit	10	368	August 16, 2024
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>) TAO Toolkit	6	1979	March 4, 2022
Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14] TAO Toolkit	51	2010	July 26, 2022
TAO API - Detectnet_v2 - Multi GPU Stuck TAO Toolkit	57	2553	August 29, 2023
Error when training with multiple GPUs in TAO TAO Toolkit	17	2217	May 4, 2023
No CUDA-capable device is detected TAO Toolkit cuda , tao	9	274	February 17, 2025

Cannot train Tao Toolkit UNet model in version v4.0.0 and v4.0.1

Related topics