Training got killed before start

user86169 · January 19, 2022, 9:25am

TLT Version → docker_tag: v3.21.08-py3
Network Type → Yolov4
Training Spec file : specfile.txt (5.3 KB)

Hi,

I am trying to train Yolov4 on custom dataset using resnet 18 pretrained model but training got killed as shown below.

2022-01-19 08:58:54,014 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

Epoch 1/80
2/160 […] - ETA: 3:56:09 - loss: 52.1715/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (21.293008). Check your callbacks.
% delta_t_median)
37/160 [=====>…] - ETA: 28:28 - loss: 52.6629Killed
2022-01-19 14:38:08,958 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

looking forward for help from your side.

Morganh · January 19, 2022, 9:34am

The training should be killed due to out of memory.
Can you check nvidia-smi during training?
More, which dgpu did you run?

user86169 · January 19, 2022, 9:37am

training command : /taoworkspace/dataset$ tao yolo_v4 train --gpus 1 -e /workspace/tao-experiments/specs/spec.txt -r /workspace/tao-experiments/results -k (my key)

using single gpu

Nvidia-smi

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 1335 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1414 G /usr/bin/gnome-shell 6MiB |
±----------------------------------------------------------------------------+

So, What to do to avoid this issue?

Morganh · January 19, 2022, 9:42am

Is it the nvidia-smi result when the training is running?

user86169 · January 19, 2022, 10:04am

No.

This is Nvidia-smi result when training in running

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4302 C /usr/bin/python3.6 115MiB |
| 0 N/A N/A 4357 C python3.6 15637MiB |
| 1 N/A N/A 1335 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1414 G /usr/bin/gnome-shell 6MiB |
±----------------------------------------------------------------------------+

But the training got killed.

user86169 · January 19, 2022, 11:25am

Hi, I am waiting for the response from your side.

Morganh · January 19, 2022, 1:48pm

Please try to train with a lower batch size.

user86169 · January 19, 2022, 2:01pm

I set to batch size 8. But the training speed is very low. it stops training for a min then move a lit bit then stop then move even a single epoch is not completed yet as shown below.

2022-01-19 13:54:45,069 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

Epoch 1/100
2/160 […] - ETA: 2:40:21 - loss: 52.1415/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (6.909908). Check your callbacks.
% delta_t_median)
39/160 [======>…] - ETA: 19:34 - loss: 52.4700

Training is very slow as m also checking by using nvidia-smi during training.

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 10329 C /usr/bin/python3.6 115MiB |
| 0 N/A N/A 10390 C python3.6 7049MiB |
| 1 N/A N/A 1335 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1414 G /usr/bin/gnome-shell 6MiB |
±----------------------------------------------------------------------------+

Morganh · January 19, 2022, 2:09pm

So, original issue is gone.
For slow training speed, please try to
#freeze_blocks: 0
n_workers: 4

More, if possible, you can run with 2gpus.

user86169 · January 19, 2022, 2:19pm

I set

#freeze_blocks: 0
n_workers: 4

and using 2 gpus but still training speed is slow.

Epoch 1/100
0196087d8816:66:139 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.8<0>
0196087d8816:66:139 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
0196087d8816:66:139 [0] NCCL INFO NET/IB : No device found.
0196087d8816:66:139 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.8<0>
0196087d8816:66:139 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
0196087d8816:67:138 [1] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.8<0>
0196087d8816:67:138 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
0196087d8816:67:138 [1] NCCL INFO NET/IB : No device found.
0196087d8816:67:138 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.8<0>
0196087d8816:67:138 [1] NCCL INFO Using network Socket
0196087d8816:66:139 [0] NCCL INFO Channel 00/02 : 0 1
0196087d8816:66:139 [0] NCCL INFO Channel 01/02 : 0 1
0196087d8816:67:138 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
0196087d8816:67:138 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
0196087d8816:67:138 [1] NCCL INFO Setting affinity for GPU 1 to ffff
0196087d8816:66:139 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
0196087d8816:66:139 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
0196087d8816:66:139 [0] NCCL INFO Setting affinity for GPU 0 to ffff
0196087d8816:67:138 [1] NCCL INFO Channel 00 : 1[b3000] → 0[17000] via P2P/IPC
0196087d8816:66:139 [0] NCCL INFO Channel 00 : 0[17000] → 1[b3000] via P2P/IPC
0196087d8816:67:138 [1] NCCL INFO Channel 01 : 1[b3000] → 0[17000] via P2P/IPC
0196087d8816:66:139 [0] NCCL INFO Channel 01 : 0[17000] → 1[b3000] via P2P/IPC
0196087d8816:67:138 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
0196087d8816:67:138 [1] NCCL INFO comm 0x7f49bf00f370 rank 1 nranks 2 cudaDev 1 busId b3000 - Init COMPLETE
0196087d8816:66:139 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
0196087d8816:66:139 [0] NCCL INFO comm 0x7fc5b43b9cb0 rank 0 nranks 2 cudaDev 0 busId 17000 - Init COMPLETE
0196087d8816:66:139 [0] NCCL INFO Launch mode Parallel
14/80 [====>…] - ETA: 17:21 - loss: 52.6657

Nvidia-smi

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 11505 C /usr/bin/python3.6 115MiB |
| 0 N/A N/A 11568 C python3.6 7095MiB |
| 1 N/A N/A 1335 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1414 G /usr/bin/gnome-shell 6MiB |
| 1 N/A N/A 11569 C python3.6 9143MiB |
±----------------------------------------------------------------------------+
(base) verizon@blreqx492475ws:~$

Morganh · January 19, 2022, 2:21pm

Please keep training. The training speed may be slow in the first several epochs.

user86169 · January 19, 2022, 2:23pm

As u can see till now even it is not completed a single epoch.

0196087d8816:66:139 [0] NCCL INFO comm 0x7fc5b43b9cb0 rank 0 nranks 2 cudaDev 0 busId 17000 - Init COMPLETE
0196087d8816:66:139 [0] NCCL INFO Launch mode Parallel
35/80 [============>…] - ETA: 10:21 - loss: 52.3755

user86169 · January 19, 2022, 3:17pm

0196087d8816:66:139 [0] NCCL INFO Launch mode Parallel
80/80 [==============================] - 1151s 14s/step - loss: 51.3615
Epoch 2/100
80/80 [==============================] - 1004s 13s/step - loss: 44.4567
Epoch 3/100
80/80 [==============================] - 1122s 14s/step - loss: 34.4783
Epoch 4/100
24/80 [========>…] - ETA: 15:03 - loss: 28.4328

Still slow

Morganh · January 20, 2022, 12:46am

What is the requirement for the speed?

user86169 · January 20, 2022, 5:04am

As I had already trained two more datasets which completed 100 epochs with in 15-20 min. But here in this dataset it is taking 15-25 min to complete single epoch this will take lot of time to complete the training if I train this till 100 epoch

Morganh · January 21, 2022, 9:40am

For the training against “two more datasets” , do you mean you are using the same yolov4 network?

Morganh · January 24, 2022, 4:38am

More, please

Set more n_workers , for example, 8
Use tfrecord format instead. In tfrecord format, please disable mosaic augmentation (mosaic_prob=0). See more info in YOLOv4 — TAO Toolkit 3.22.05 documentation
use AMP since your GPU(RTX 5000) supports it.
try different max_queue_size

user86169 · January 25, 2022, 2:30pm

Yes

system · February 8, 2022, 2:31pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Training Become very slow Yolov4 TAO Toolkit	25	2101	January 25, 2022
Multi GPU's and invalid loss TAO Toolkit	18	1176	July 19, 2022
[TLT] YoloV4 training fails. training process asigned to CPU instead of GPU? TAO Toolkit	8	442	August 9, 2022
Yolo_v4_tiny randomly stops docker container during second or third validation phase with no errors TAO Toolkit yolo	20	880	August 29, 2022
Yolov4 multi-gpu training with Darknet Arch encounters a problem TAO Toolkit	17	749	July 2, 2023
Multigpu training raises error TAO Toolkit	9	1126	November 15, 2022
[TLT-yolov4-Deepstream ] ERROR: [TRT]: UffParser: Unsupported number of graph 0 TAO Toolkit	16	759	October 12, 2021
Training Yolov4 with 4 GPUs cause out of memory TAO Toolkit	4	984	August 3, 2022
More than 1 GPU not working using Tao Train TAO Toolkit	47	4557	April 9, 2023
Out Of Memory Error While Training Peoplenet Model TAO Toolkit	2	440	March 8, 2022

Training got killed before start

Related topics