Out Of Memory Error While Training Peoplenet Model

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) - NVIDIA GeForce RTX 3050 Laptop GPU

nvidia-smi
Mon Feb 14 13:16:01 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   41C    P8     6W /  N/A |    773MiB /  3910MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1036      G   /usr/lib/xorg/Xorg                 70MiB |
|    0   N/A  N/A      2737      G   /usr/lib/xorg/Xorg                469MiB |
|    0   N/A  N/A      2913      G   /usr/bin/gnome-shell              137MiB |
|    0   N/A  N/A    125059      G   ...AAAAAAAAA= --shared-files       70MiB |
|    0   N/A  N/A    125135      G   ...AAAAAAAAA= --shared-files       14MiB |
+-----------------------------------------------------------------------------+

• Network Type - Detectnet_v2
• TLT Version : v3.21.11-tf1.15.5-py3:
• Training spec file : PeoplenetTrainingConfig.txt (3.0 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

My original images are of the dimensions 4000 x 3000. I have resized them to 1280 x 960 in order to make it smaller and a multiple of 16 using the following code :

from PIL import Image
import pandas as pd
import os

inputImageDir="images"
inputLabelDir="heridal_kitti_labels/label_2"

outputImageDir="resized/image_2"
outputLabelDir="resized/label_2"

for image in os.listdir(inputImageDir):
	im = Image.open(os.path.join(inputImageDir, image))
	resized = im.resize((1280, 960))
	resized.save(os.path.join(outputImageDir, image))
	
	labelfile=os.path.join(inputLabelDir,image.replace(".JPG",".txt"))
	df = pd.read_csv(labelfile, sep = " ", header=None)
	df[4] = df[4].div(4000/1280)
	df[5] = df[5]/(4000/1280)
	df[6] = df[6]/(4000/1280)
	df[7] = df[7]/(4000/1280)

	outputlabelfile=os.path.join(outputLabelDir,image.replace(".JPG",".txt"))
	df.to_csv(outputlabelfile, sep = " ", header=None, index=False)

The log is as follows (only the relevant subsection)

...
INFO:tensorflow:Graph was finalized.
2022-02-14 07:14:15,159 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp1vyo2zqc/model.ckpt-0
2022-02-14 07:14:15,474 [INFO] tensorflow: Restoring parameters from /tmp/tmp1vyo2zqc/model.ckpt-0
INFO:tensorflow:Running local_init_op.
2022-02-14 07:14:16,954 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-02-14 07:14:17,428 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2022-02-14 07:14:22,383 [INFO] tensorflow: Saving checkpoints for step-0.
INFO:tensorflow:epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.0456286, step = 0
2022-02-14 07:17:39,780 [INFO] tensorflow: epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.0456286, step = 0
2022-02-14 07:17:39,789 [INFO] iva.detectnet_v2.tfhooks.task_progress_monitor_hook: Epoch 0/120: loss: 0.04563 learning rate: 0.00000 Time taken: 0:00:00 ETA: 0:00:00
2022-02-14 07:17:39,789 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 0.006
2022-02-14 07:17:44,658 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 0.141
INFO:tensorflow:epoch = 0.030660377358490566, learning_rate = 5.0591784e-06, loss = 0.042033017, step = 26 (5.140 sec)
2022-02-14 07:17:44,919 [INFO] tensorflow: epoch = 0.030660377358490566, learning_rate = 5.0591784e-06, loss = 0.042033017, step = 26 (5.140 sec)
2022-02-14 07:17:47,717 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.174
INFO:tensorflow:epoch = 0.08136792452830188, learning_rate = 5.1585935e-06, loss = 0.037755508, step = 69 (5.240 sec)
2022-02-14 07:17:50,159 [INFO] tensorflow: epoch = 0.08136792452830188, learning_rate = 5.1585935e-06, loss = 0.037755508, step = 69 (5.240 sec)
2022-02-14 07:17:50,788 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.141
INFO:tensorflow:global_step/sec: 6.86711
2022-02-14 07:17:52,020 [INFO] tensorflow: global_step/sec: 6.86711
2022-02-14 07:17:53,886 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.069
INFO:tensorflow:epoch = 0.13089622641509432, learning_rate = 5.257579e-06, loss = 0.033465218, step = 111 (5.186 sec)
2022-02-14 07:17:55,344 [INFO] tensorflow: epoch = 0.13089622641509432, learning_rate = 5.257579e-06, loss = 0.033465218, step = 111 (5.186 sec)
2022-02-14 07:17:56,931 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.212
2022-02-14 07:17:59,973 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.218
INFO:tensorflow:epoch = 0.1804245283018868, learning_rate = 5.358469e-06, loss = 0.03268508, step = 153 (5.123 sec)
2022-02-14 07:18:00,468 [INFO] tensorflow: epoch = 0.1804245283018868, learning_rate = 5.358469e-06, loss = 0.03268508, step = 153 (5.123 sec)
INFO:tensorflow:global_step/sec: 8.16827
2022-02-14 07:18:02,303 [INFO] tensorflow: global_step/sec: 8.16827
2022-02-14 07:18:03,041 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.149
INFO:tensorflow:epoch = 0.22995283018867924, learning_rate = 5.4612906e-06, loss = 0.029245147, step = 195 (5.133 sec)
2022-02-14 07:18:05,601 [INFO] tensorflow: epoch = 0.22995283018867924, learning_rate = 5.4612906e-06, loss = 0.029245147, step = 195 (5.133 sec)
2022-02-14 07:18:06,094 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.190
2022-02-14 07:18:09,139 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.210
INFO:tensorflow:epoch = 0.2794811320754717, learning_rate = 5.5660894e-06, loss = 0.027445426, step = 237 (5.127 sec)
2022-02-14 07:18:10,728 [INFO] tensorflow: epoch = 0.2794811320754717, learning_rate = 5.5660894e-06, loss = 0.027445426, step = 237 (5.127 sec)
2022-02-14 07:18:12,199 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.171
INFO:tensorflow:global_step/sec: 8.17859
2022-02-14 07:18:12,573 [INFO] tensorflow: global_step/sec: 8.17859
2022-02-14 07:18:15,292 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.084
INFO:tensorflow:epoch = 0.3290094339622641, learning_rate = 5.6728945e-06, loss = 0.02999771, step = 279 (5.250 sec)
2022-02-14 07:18:15,978 [INFO] tensorflow: epoch = 0.3290094339622641, learning_rate = 5.6728945e-06, loss = 0.02999771, step = 279 (5.250 sec)
2022-02-14 07:18:18,590 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 7.580
INFO:tensorflow:epoch = 0.3773584905660377, learning_rate = 5.779136e-06, loss = 0.024050923, step = 320 (5.184 sec)
2022-02-14 07:18:21,162 [INFO] tensorflow: epoch = 0.3773584905660377, learning_rate = 5.779136e-06, loss = 0.024050923, step = 320 (5.184 sec)
2022-02-14 07:18:21,660 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.145
INFO:tensorflow:global_step/sec: 7.95422
2022-02-14 07:18:23,134 [INFO] tensorflow: global_step/sec: 7.95422
2022-02-14 07:18:24,730 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.143
INFO:tensorflow:epoch = 0.4268867924528302, learning_rate = 5.890034e-06, loss = 0.022420418, step = 362 (5.169 sec)
2022-02-14 07:18:26,331 [INFO] tensorflow: epoch = 0.4268867924528302, learning_rate = 5.890034e-06, loss = 0.022420418, step = 362 (5.169 sec)
2022-02-14 07:18:27,805 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.132
2022-02-14 07:18:30,920 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.028
INFO:tensorflow:epoch = 0.47641509433962265, learning_rate = 6.003055e-06, loss = 0.021905642, step = 404 (5.278 sec)
2022-02-14 07:18:31,609 [INFO] tensorflow: epoch = 0.47641509433962265, learning_rate = 6.003055e-06, loss = 0.021905642, step = 404 (5.278 sec)
INFO:tensorflow:global_step/sec: 8.04348
2022-02-14 07:18:33,577 [INFO] tensorflow: global_step/sec: 8.04348
2022-02-14 07:18:34,076 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 7.920
INFO:tensorflow:epoch = 0.5259433962264151, learning_rate = 6.118251e-06, loss = 0.020372454, step = 446 (5.176 sec)
2022-02-14 07:18:36,785 [INFO] tensorflow: epoch = 0.5259433962264151, learning_rate = 6.118251e-06, loss = 0.020372454, step = 446 (5.176 sec)
2022-02-14 07:18:37,164 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.098
2022-02-14 07:18:40,237 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.134
INFO:tensorflow:epoch = 0.5754716981132075, learning_rate = 6.235651e-06, loss = 0.019447327, step = 488 (5.180 sec)
2022-02-14 07:18:41,965 [INFO] tensorflow: epoch = 0.5754716981132075, learning_rate = 6.235651e-06, loss = 0.019447327, step = 488 (5.180 sec)
2022-02-14 07:18:43,326 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.096
INFO:tensorflow:global_step/sec: 8.09803
2022-02-14 07:18:43,950 [INFO] tensorflow: global_step/sec: 8.09803
2022-02-14 07:18:46,412 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.100
INFO:tensorflow:epoch = 0.625, learning_rate = 6.35531e-06, loss = 0.0180731, step = 530 (5.192 sec)
2022-02-14 07:18:47,157 [INFO] tensorflow: epoch = 0.625, learning_rate = 6.35531e-06, loss = 0.0180731, step = 530 (5.192 sec)
2022-02-14 07:18:49,498 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.102
INFO:tensorflow:epoch = 0.6745283018867925, learning_rate = 6.477259e-06, loss = 0.026863359, step = 572 (5.173 sec)
2022-02-14 07:18:52,330 [INFO] tensorflow: epoch = 0.6745283018867925, learning_rate = 6.477259e-06, loss = 0.026863359, step = 572 (5.173 sec)
2022-02-14 07:18:52,585 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.100
INFO:tensorflow:global_step/sec: 8.10352
2022-02-14 07:18:54,316 [INFO] tensorflow: global_step/sec: 8.10352
2022-02-14 07:18:55,680 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.078
INFO:tensorflow:epoch = 0.7240566037735848, learning_rate = 6.6015537e-06, loss = 0.017696362, step = 614 (5.199 sec)
2022-02-14 07:18:57,529 [INFO] tensorflow: epoch = 0.7240566037735848, learning_rate = 6.6015537e-06, loss = 0.017696362, step = 614 (5.199 sec)
2022-02-14 07:18:58,769 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.093
2022-02-14 07:19:01,851 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.112
INFO:tensorflow:epoch = 0.7735849056603773, learning_rate = 6.728228e-06, loss = 0.016738735, step = 656 (5.192 sec)
2022-02-14 07:19:02,722 [INFO] tensorflow: epoch = 0.7735849056603773, learning_rate = 6.728228e-06, loss = 0.016738735, step = 656 (5.192 sec)
INFO:tensorflow:global_step/sec: 8.09019
2022-02-14 07:19:04,699 [INFO] tensorflow: global_step/sec: 8.09019
2022-02-14 07:19:04,953 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.060
INFO:tensorflow:epoch = 0.8231132075471698, learning_rate = 6.8573395e-06, loss = 0.016266808, step = 698 (5.190 sec)
2022-02-14 07:19:07,912 [INFO] tensorflow: epoch = 0.8231132075471698, learning_rate = 6.8573395e-06, loss = 0.016266808, step = 698 (5.190 sec)
2022-02-14 07:19:08,044 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.088
2022-02-14 07:19:11,125 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.115
INFO:tensorflow:epoch = 0.8726415094339622, learning_rate = 6.988921e-06, loss = 0.01601935, step = 740 (5.189 sec)
2022-02-14 07:19:13,101 [INFO] tensorflow: epoch = 0.8726415094339622, learning_rate = 6.988921e-06, loss = 0.01601935, step = 740 (5.189 sec)
2022-02-14 07:19:14,215 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.092
INFO:tensorflow:global_step/sec: 8.0862
2022-02-14 07:19:15,087 [INFO] tensorflow: global_step/sec: 8.0862
2022-02-14 07:19:17,310 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.078
INFO:tensorflow:epoch = 0.9221698113207547, learning_rate = 7.123035e-06, loss = 0.016791396, step = 782 (5.205 sec)
2022-02-14 07:19:18,306 [INFO] tensorflow: epoch = 0.9221698113207547, learning_rate = 7.123035e-06, loss = 0.016791396, step = 782 (5.205 sec)
2022-02-14 07:19:20,411 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.063
INFO:tensorflow:epoch = 0.9716981132075472, learning_rate = 7.2597154e-06, loss = 0.0155552225, step = 824 (5.186 sec)
2022-02-14 07:19:23,492 [INFO] tensorflow: epoch = 0.9716981132075472, learning_rate = 7.2597154e-06, loss = 0.0155552225, step = 824 (5.186 sec)
2022-02-14 07:19:23,492 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 8.114
INFO:tensorflow:global_step/sec: 8.08974
2022-02-14 07:19:25,470 [INFO] tensorflow: global_step/sec: 8.08974

ec1fb179e025:37:55 [0] enqueue.cc:74 NCCL WARN Cuda failure 'out of memory'
ec1fb179e025:37:55 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
ec1fb179e025:37:55 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ec1fb179e025:37:55 [0] NCCL INFO NET/IB : No device found.
ec1fb179e025:37:55 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>
ec1fb179e025:37:55 [0] NCCL INFO Using network Socket
NCCL version 2.9.9+cuda11.3

ec1fb179e025:37:55 [0] init.cc:891 NCCL WARN Cuda failure 'out of memory'
ec1fb179e025:37:55 [0] NCCL INFO init.cc:916 -> 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[{{node HorovodAllreduce_cost_sums_person_cov_0}}]]
	 [[Assign_1/_7677]]
  (1) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[{{node HorovodAllreduce_cost_sums_person_cov_0}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 849, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 838, in <module>
  File "<decorator-gen-2>", line 2, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 827, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 708, in run_experiment
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 644, in train_gridbox
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 155, in run_training_loop
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
    run_metadata=run_metadata))
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py", line 206, in after_run
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[node HorovodAllreduce_cost_sums_person_cov_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[Assign_1/_7677]]
  (1) Unknown: ncclCommInitRank failed: unhandled cuda error
	 [[node HorovodAllreduce_cost_sums_person_cov_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'HorovodAllreduce_cost_sums_person_cov_0':
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 838, in <module>
  File "<decorator-gen-2>", line 2, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 827, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 708, in run_experiment
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 589, in train_gridbox
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py", line 40, in build_cost_auto_weight_hook
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py", line 78, in __init__
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py", line 143, in _init_objective_weights
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/distribution/distribution.py", line 328, in allreduce
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 80, in allreduce
    summed_tensor_compressed = _allreduce(tensor_compressed)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 86, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 80, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

2022-02-14 12:49:29,048 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I have reduced batch_size to the minimum that is 1 to use less memory. How else can I optimize memory usage to fix the problem?

Below spec is not correct.

output_image_width: 1248
output_image_height: 384

Since you are using peoplenet.tlt as pretrained model, you can resize your images/labels to 960x544.

Then, train with

output_image_width: 960
output_image_height: 544
enable_auto_resize  : true

More, are you sure your labels are resized correctly?

BTW, are you running training with WSL?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.