Training got killed before start

TLT Version → docker_tag: v3.21.08-py3
Network Type → Yolov4
Training Spec file : specfile.txt (5.3 KB)

Hi,

I am trying to train Yolov4 on custom dataset using resnet 18 pretrained model but training got killed as shown below.

2022-01-19 08:58:54,014 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

Epoch 1/80
2/160 […] - ETA: 3:56:09 - loss: 52.1715/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (21.293008). Check your callbacks.
% delta_t_median)
37/160 [=====>…] - ETA: 28:28 - loss: 52.6629Killed
2022-01-19 14:38:08,958 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

looking forward for help from your side.

The training should be killed due to out of memory.
Can you check nvidia-smi during training?
More, which dgpu did you run?

training command : /taoworkspace/dataset$ tao yolo_v4 train --gpus 1 -e /workspace/tao-experiments/specs/spec.txt -r /workspace/tao-experiments/results -k (my key)

using single gpu

Nvidia-smi

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 Off | 00000000:17:00.0 Off | Off |
| 33% 27C P8 8W / 230W | 1MiB / 16125MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 5000 Off | 00000000:B3:00.0 Off | Off |
| 33% 28C P8 4W / 230W | 20MiB / 16122MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 1335 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1414 G /usr/bin/gnome-shell 6MiB |
±----------------------------------------------------------------------------+

So, What to do to avoid this issue?

Is it the nvidia-smi result when the training is running?

No.

This is Nvidia-smi result when training in running

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 Off | 00000000:17:00.0 Off | Off |
| 33% 36C P2 44W / 230W | 15756MiB / 16125MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 5000 Off | 00000000:B3:00.0 Off | Off |
| 33% 30C P8 4W / 230W | 20MiB / 16122MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4302 C /usr/bin/python3.6 115MiB |
| 0 N/A N/A 4357 C python3.6 15637MiB |
| 1 N/A N/A 1335 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1414 G /usr/bin/gnome-shell 6MiB |
±----------------------------------------------------------------------------+

But the training got killed.

Hi, I am waiting for the response from your side.

Please try to train with a lower batch size.

I set to batch size 8. But the training speed is very low. it stops training for a min then move a lit bit then stop then move even a single epoch is not completed yet as shown below.

2022-01-19 13:54:45,069 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

Epoch 1/100
2/160 […] - ETA: 2:40:21 - loss: 52.1415/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (6.909908). Check your callbacks.
% delta_t_median)
39/160 [======>…] - ETA: 19:34 - loss: 52.4700

Training is very slow as m also checking by using nvidia-smi during training.

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 Off | 00000000:17:00.0 Off | Off |
| 33% 38C P8 30W / 230W | 7168MiB / 16125MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 5000 Off | 00000000:B3:00.0 Off | Off |
| 33% 29C P8 4W / 230W | 20MiB / 16122MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 10329 C /usr/bin/python3.6 115MiB |
| 0 N/A N/A 10390 C python3.6 7049MiB |
| 1 N/A N/A 1335 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1414 G /usr/bin/gnome-shell 6MiB |
±----------------------------------------------------------------------------+

So, original issue is gone.
For slow training speed, please try to
#freeze_blocks: 0
n_workers: 4

More, if possible, you can run with 2gpus.

I set

#freeze_blocks: 0
n_workers: 4

and using 2 gpus but still training speed is slow.

Epoch 1/100
0196087d8816:66:139 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.8<0>
0196087d8816:66:139 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
0196087d8816:66:139 [0] NCCL INFO NET/IB : No device found.
0196087d8816:66:139 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.8<0>
0196087d8816:66:139 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
0196087d8816:67:138 [1] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.8<0>
0196087d8816:67:138 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
0196087d8816:67:138 [1] NCCL INFO NET/IB : No device found.
0196087d8816:67:138 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.8<0>
0196087d8816:67:138 [1] NCCL INFO Using network Socket
0196087d8816:66:139 [0] NCCL INFO Channel 00/02 : 0 1
0196087d8816:66:139 [0] NCCL INFO Channel 01/02 : 0 1
0196087d8816:67:138 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
0196087d8816:67:138 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
0196087d8816:67:138 [1] NCCL INFO Setting affinity for GPU 1 to ffff
0196087d8816:66:139 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
0196087d8816:66:139 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
0196087d8816:66:139 [0] NCCL INFO Setting affinity for GPU 0 to ffff
0196087d8816:67:138 [1] NCCL INFO Channel 00 : 1[b3000] → 0[17000] via P2P/IPC
0196087d8816:66:139 [0] NCCL INFO Channel 00 : 0[17000] → 1[b3000] via P2P/IPC
0196087d8816:67:138 [1] NCCL INFO Channel 01 : 1[b3000] → 0[17000] via P2P/IPC
0196087d8816:66:139 [0] NCCL INFO Channel 01 : 0[17000] → 1[b3000] via P2P/IPC
0196087d8816:67:138 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
0196087d8816:67:138 [1] NCCL INFO comm 0x7f49bf00f370 rank 1 nranks 2 cudaDev 1 busId b3000 - Init COMPLETE
0196087d8816:66:139 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
0196087d8816:66:139 [0] NCCL INFO comm 0x7fc5b43b9cb0 rank 0 nranks 2 cudaDev 0 busId 17000 - Init COMPLETE
0196087d8816:66:139 [0] NCCL INFO Launch mode Parallel
14/80 [====>…] - ETA: 17:21 - loss: 52.6657

Nvidia-smi

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 Off | 00000000:17:00.0 Off | Off |
| 33% 40C P2 44W / 230W | 7214MiB / 16125MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 5000 Off | 00000000:B3:00.0 Off | Off |
| 33% 44C P2 39W / 230W | 9165MiB / 16122MiB | 25% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 11505 C /usr/bin/python3.6 115MiB |
| 0 N/A N/A 11568 C python3.6 7095MiB |
| 1 N/A N/A 1335 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1414 G /usr/bin/gnome-shell 6MiB |
| 1 N/A N/A 11569 C python3.6 9143MiB |
±----------------------------------------------------------------------------+
(base) verizon@blreqx492475ws:~$

Please keep training. The training speed may be slow in the first several epochs.

As u can see till now even it is not completed a single epoch.

0196087d8816:66:139 [0] NCCL INFO comm 0x7fc5b43b9cb0 rank 0 nranks 2 cudaDev 0 busId 17000 - Init COMPLETE
0196087d8816:66:139 [0] NCCL INFO Launch mode Parallel
35/80 [============>…] - ETA: 10:21 - loss: 52.3755

0196087d8816:66:139 [0] NCCL INFO Launch mode Parallel
80/80 [==============================] - 1151s 14s/step - loss: 51.3615
Epoch 2/100
80/80 [==============================] - 1004s 13s/step - loss: 44.4567
Epoch 3/100
80/80 [==============================] - 1122s 14s/step - loss: 34.4783
Epoch 4/100
24/80 [========>…] - ETA: 15:03 - loss: 28.4328

Still slow

What is the requirement for the speed?

As I had already trained two more datasets which completed 100 epochs with in 15-20 min. But here in this dataset it is taking 15-25 min to complete single epoch this will take lot of time to complete the training if I train this till 100 epoch

For the training against “two more datasets” , do you mean you are using the same yolov4 network?

More, please

  1. Set more n_workers , for example, 8
  2. Use tfrecord format instead. In tfrecord format, please disable mosaic augmentation (mosaic_prob=0). See more info in YOLOv4 — TAO Toolkit 3.22.05 documentation
  3. use AMP since your GPU(RTX 5000) supports it.
  4. try different max_queue_size

Yes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.