Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) - NVIDIA GeForce RTX 3050 Laptop GPU
nvidia-smi
Mon Feb 14 13:16:01 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| N/A 41C P8 6W / N/A | 773MiB / 3910MiB | 20% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1036 G /usr/lib/xorg/Xorg 70MiB |
| 0 N/A N/A 2737 G /usr/lib/xorg/Xorg 469MiB |
| 0 N/A N/A 2913 G /usr/bin/gnome-shell 137MiB |
| 0 N/A N/A 125059 G ...AAAAAAAAA= --shared-files 70MiB |
| 0 N/A N/A 125135 G ...AAAAAAAAA= --shared-files 14MiB |
+-----------------------------------------------------------------------------+
• Network Type - Detectnet_v2
• TLT Version : v3.21.11-tf1.15.5-py3:
• Training spec file : PeoplenetTrainingConfig.txt (3.0 KB)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
My original images are of the dimensions 4000 x 3000. I have resized them to 1280 x 960 in order to make it smaller and a multiple of 16 using the following code :
from PIL import Image
import pandas as pd
import os
inputImageDir="images"
inputLabelDir="heridal_kitti_labels/label_2"
outputImageDir="resized/image_2"
outputLabelDir="resized/label_2"
for image in os.listdir(inputImageDir):
im = Image.open(os.path.join(inputImageDir, image))
resized = im.resize((1280, 960))
resized.save(os.path.join(outputImageDir, image))
labelfile=os.path.join(inputLabelDir,image.replace(".JPG",".txt"))
df = pd.read_csv(labelfile, sep = " ", header=None)
df[4] = df[4].div(4000/1280)
df[5] = df[5]/(4000/1280)
df[6] = df[6]/(4000/1280)
df[7] = df[7]/(4000/1280)
outputlabelfile=os.path.join(outputLabelDir,image.replace(".JPG",".txt"))
df.to_csv(outputlabelfile, sep = " ", header=None, index=False)
The log is as follows (only the relevant subsection)
...
INFO:tensorflow:Graph was finalized.
2022-02-14 07:14:15,159 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp1vyo2zqc/model.ckpt-0
2022-02-14 07:14:15,474 [INFO] tensorflow: Restoring parameters from /tmp/tmp1vyo2zqc/model.ckpt-0
INFO:tensorflow:Running local_init_op.
2022-02-14 07:14:16,954 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-02-14 07:14:17,428 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2022-02-14 07:14:22,383 [INFO] tensorflow: Saving checkpoints for step-0.
INFO:tensorflow:epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.0456286, step = 0
2022-02-14 07:17:39,780 [INFO] tensorflow: epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.0456286, step = 0
2022-02-14 07:17:39,789 [INFO] iva.detectnet_v2.tfhooks.task_progress_monitor_hook: Epoch 0/120: loss: 0.04563 learning rate: 0.00000 Time taken: 0:00:00 ETA: 0:00:00
2022-02-14 07:17:39,789 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 0.006
2022-02-14 07:17:44,658 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 0.141
INFO:tensorflow:epoch = 0.030660377358490566, learning_rate = 5.0591784e-06, loss = 0.042033017, step = 26 (5.140 sec)
2022-02-14 07:17:44,919 [INFO] tensorflow: epoch = 0.030660377358490566, learning_rate = 5.0591784e-06, loss = 0.042033017, step = 26 (5.140 sec)
2022-02-14 07:17:47,717 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.174
INFO:tensorflow:epoch = 0.08136792452830188, learning_rate = 5.1585935e-06, loss = 0.037755508, step = 69 (5.240 sec)
2022-02-14 07:17:50,159 [INFO] tensorflow: epoch = 0.08136792452830188, learning_rate = 5.1585935e-06, loss = 0.037755508, step = 69 (5.240 sec)
2022-02-14 07:17:50,788 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.141
INFO:tensorflow:global_step/sec: 6.86711
2022-02-14 07:17:52,020 [INFO] tensorflow: global_step/sec: 6.86711
2022-02-14 07:17:53,886 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.069
INFO:tensorflow:epoch = 0.13089622641509432, learning_rate = 5.257579e-06, loss = 0.033465218, step = 111 (5.186 sec)
2022-02-14 07:17:55,344 [INFO] tensorflow: epoch = 0.13089622641509432, learning_rate = 5.257579e-06, loss = 0.033465218, step = 111 (5.186 sec)
2022-02-14 07:17:56,931 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.212
2022-02-14 07:17:59,973 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.218
INFO:tensorflow:epoch = 0.1804245283018868, learning_rate = 5.358469e-06, loss = 0.03268508, step = 153 (5.123 sec)
2022-02-14 07:18:00,468 [INFO] tensorflow: epoch = 0.1804245283018868, learning_rate = 5.358469e-06, loss = 0.03268508, step = 153 (5.123 sec)
INFO:tensorflow:global_step/sec: 8.16827
2022-02-14 07:18:02,303 [INFO] tensorflow: global_step/sec: 8.16827
2022-02-14 07:18:03,041 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.149
INFO:tensorflow:epoch = 0.22995283018867924, learning_rate = 5.4612906e-06, loss = 0.029245147, step = 195 (5.133 sec)
2022-02-14 07:18:05,601 [INFO] tensorflow: epoch = 0.22995283018867924, learning_rate = 5.4612906e-06, loss = 0.029245147, step = 195 (5.133 sec)
2022-02-14 07:18:06,094 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.190
2022-02-14 07:18:09,139 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.210
INFO:tensorflow:epoch = 0.2794811320754717, learning_rate = 5.5660894e-06, loss = 0.027445426, step = 237 (5.127 sec)
2022-02-14 07:18:10,728 [INFO] tensorflow: epoch = 0.2794811320754717, learning_rate = 5.5660894e-06, loss = 0.027445426, step = 237 (5.127 sec)
2022-02-14 07:18:12,199 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.171
INFO:tensorflow:global_step/sec: 8.17859
2022-02-14 07:18:12,573 [INFO] tensorflow: global_step/sec: 8.17859
2022-02-14 07:18:15,292 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.084
INFO:tensorflow:epoch = 0.3290094339622641, learning_rate = 5.6728945e-06, loss = 0.02999771, step = 279 (5.250 sec)
2022-02-14 07:18:15,978 [INFO] tensorflow: epoch = 0.3290094339622641, learning_rate = 5.6728945e-06, loss = 0.02999771, step = 279 (5.250 sec)
2022-02-14 07:18:18,590 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 7.580
INFO:tensorflow:epoch = 0.3773584905660377, learning_rate = 5.779136e-06, loss = 0.024050923, step = 320 (5.184 sec)
2022-02-14 07:18:21,162 [INFO] tensorflow: epoch = 0.3773584905660377, learning_rate = 5.779136e-06, loss = 0.024050923, step = 320 (5.184 sec)
2022-02-14 07:18:21,660 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.145
INFO:tensorflow:global_step/sec: 7.95422
2022-02-14 07:18:23,134 [INFO] tensorflow: global_step/sec: 7.95422
2022-02-14 07:18:24,730 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.143
INFO:tensorflow:epoch = 0.4268867924528302, learning_rate = 5.890034e-06, loss = 0.022420418, step = 362 (5.169 sec)
2022-02-14 07:18:26,331 [INFO] tensorflow: epoch = 0.4268867924528302, learning_rate = 5.890034e-06, loss = 0.022420418, step = 362 (5.169 sec)
2022-02-14 07:18:27,805 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.132
2022-02-14 07:18:30,920 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.028
INFO:tensorflow:epoch = 0.47641509433962265, learning_rate = 6.003055e-06, loss = 0.021905642, step = 404 (5.278 sec)
2022-02-14 07:18:31,609 [INFO] tensorflow: epoch = 0.47641509433962265, learning_rate = 6.003055e-06, loss = 0.021905642, step = 404 (5.278 sec)
INFO:tensorflow:global_step/sec: 8.04348
2022-02-14 07:18:33,577 [INFO] tensorflow: global_step/sec: 8.04348
2022-02-14 07:18:34,076 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 7.920
INFO:tensorflow:epoch = 0.5259433962264151, learning_rate = 6.118251e-06, loss = 0.020372454, step = 446 (5.176 sec)
2022-02-14 07:18:36,785 [INFO] tensorflow: epoch = 0.5259433962264151, learning_rate = 6.118251e-06, loss = 0.020372454, step = 446 (5.176 sec)
2022-02-14 07:18:37,164 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.098
2022-02-14 07:18:40,237 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.134
INFO:tensorflow:epoch = 0.5754716981132075, learning_rate = 6.235651e-06, loss = 0.019447327, step = 488 (5.180 sec)
2022-02-14 07:18:41,965 [INFO] tensorflow: epoch = 0.5754716981132075, learning_rate = 6.235651e-06, loss = 0.019447327, step = 488 (5.180 sec)
2022-02-14 07:18:43,326 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.096
INFO:tensorflow:global_step/sec: 8.09803
2022-02-14 07:18:43,950 [INFO] tensorflow: global_step/sec: 8.09803
2022-02-14 07:18:46,412 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.100
INFO:tensorflow:epoch = 0.625, learning_rate = 6.35531e-06, loss = 0.0180731, step = 530 (5.192 sec)
2022-02-14 07:18:47,157 [INFO] tensorflow: epoch = 0.625, learning_rate = 6.35531e-06, loss = 0.0180731, step = 530 (5.192 sec)
2022-02-14 07:18:49,498 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.102
INFO:tensorflow:epoch = 0.6745283018867925, learning_rate = 6.477259e-06, loss = 0.026863359, step = 572 (5.173 sec)
2022-02-14 07:18:52,330 [INFO] tensorflow: epoch = 0.6745283018867925, learning_rate = 6.477259e-06, loss = 0.026863359, step = 572 (5.173 sec)
2022-02-14 07:18:52,585 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.100
INFO:tensorflow:global_step/sec: 8.10352
2022-02-14 07:18:54,316 [INFO] tensorflow: global_step/sec: 8.10352
2022-02-14 07:18:55,680 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.078
INFO:tensorflow:epoch = 0.7240566037735848, learning_rate = 6.6015537e-06, loss = 0.017696362, step = 614 (5.199 sec)
2022-02-14 07:18:57,529 [INFO] tensorflow: epoch = 0.7240566037735848, learning_rate = 6.6015537e-06, loss = 0.017696362, step = 614 (5.199 sec)
2022-02-14 07:18:58,769 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.093
2022-02-14 07:19:01,851 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.112
INFO:tensorflow:epoch = 0.7735849056603773, learning_rate = 6.728228e-06, loss = 0.016738735, step = 656 (5.192 sec)
2022-02-14 07:19:02,722 [INFO] tensorflow: epoch = 0.7735849056603773, learning_rate = 6.728228e-06, loss = 0.016738735, step = 656 (5.192 sec)
INFO:tensorflow:global_step/sec: 8.09019
2022-02-14 07:19:04,699 [INFO] tensorflow: global_step/sec: 8.09019
2022-02-14 07:19:04,953 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.060
INFO:tensorflow:epoch = 0.8231132075471698, learning_rate = 6.8573395e-06, loss = 0.016266808, step = 698 (5.190 sec)
2022-02-14 07:19:07,912 [INFO] tensorflow: epoch = 0.8231132075471698, learning_rate = 6.8573395e-06, loss = 0.016266808, step = 698 (5.190 sec)
2022-02-14 07:19:08,044 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.088
2022-02-14 07:19:11,125 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.115
INFO:tensorflow:epoch = 0.8726415094339622, learning_rate = 6.988921e-06, loss = 0.01601935, step = 740 (5.189 sec)
2022-02-14 07:19:13,101 [INFO] tensorflow: epoch = 0.8726415094339622, learning_rate = 6.988921e-06, loss = 0.01601935, step = 740 (5.189 sec)
2022-02-14 07:19:14,215 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.092
INFO:tensorflow:global_step/sec: 8.0862
2022-02-14 07:19:15,087 [INFO] tensorflow: global_step/sec: 8.0862
2022-02-14 07:19:17,310 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.078
INFO:tensorflow:epoch = 0.9221698113207547, learning_rate = 7.123035e-06, loss = 0.016791396, step = 782 (5.205 sec)
2022-02-14 07:19:18,306 [INFO] tensorflow: epoch = 0.9221698113207547, learning_rate = 7.123035e-06, loss = 0.016791396, step = 782 (5.205 sec)
2022-02-14 07:19:20,411 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.063
INFO:tensorflow:epoch = 0.9716981132075472, learning_rate = 7.2597154e-06, loss = 0.0155552225, step = 824 (5.186 sec)
2022-02-14 07:19:23,492 [INFO] tensorflow: epoch = 0.9716981132075472, learning_rate = 7.2597154e-06, loss = 0.0155552225, step = 824 (5.186 sec)
2022-02-14 07:19:23,492 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.114
INFO:tensorflow:global_step/sec: 8.08974
2022-02-14 07:19:25,470 [INFO] tensorflow: global_step/sec: 8.08974
ec1fb179e025:37:55 [0] enqueue.cc:74 NCCL WARN Cuda failure 'out of memory'
ec1fb179e025:37:55 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
ec1fb179e025:37:55 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ec1fb179e025:37:55 [0] NCCL INFO NET/IB : No device found.
ec1fb179e025:37:55 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>
ec1fb179e025:37:55 [0] NCCL INFO Using network Socket
NCCL version 2.9.9+cuda11.3
ec1fb179e025:37:55 [0] init.cc:891 NCCL WARN Cuda failure 'out of memory'
ec1fb179e025:37:55 [0] NCCL INFO init.cc:916 -> 1
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled cuda error
[[{{node HorovodAllreduce_cost_sums_person_cov_0}}]]
[[Assign_1/_7677]]
(1) Unknown: ncclCommInitRank failed: unhandled cuda error
[[{{node HorovodAllreduce_cost_sums_person_cov_0}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 849, in <module>
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 838, in <module>
File "<decorator-gen-2>", line 2, in main
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 827, in main
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 708, in run_experiment
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 644, in train_gridbox
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 155, in run_training_loop
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
run_metadata=run_metadata))
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py", line 206, in after_run
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled cuda error
[[node HorovodAllreduce_cost_sums_person_cov_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[Assign_1/_7677]]
(1) Unknown: ncclCommInitRank failed: unhandled cuda error
[[node HorovodAllreduce_cost_sums_person_cov_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'HorovodAllreduce_cost_sums_person_cov_0':
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 838, in <module>
File "<decorator-gen-2>", line 2, in main
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 827, in main
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 708, in run_experiment
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 589, in train_gridbox
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py", line 40, in build_cost_auto_weight_hook
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py", line 78, in __init__
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py", line 143, in _init_objective_weights
File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/distribution/distribution.py", line 328, in allreduce
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 80, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 86, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File "<string>", line 80, in horovod_allreduce
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
2022-02-14 12:49:29,048 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
I have reduced batch_size to the minimum that is 1 to use less memory. How else can I optimize memory usage to fix the problem?