TLT Version → docker_tag: v3.21.08-py3
Network Type → Yolov4
Training Spec file : specfile.txt (5.3 KB)
Hi,
I am trying to train Yolov4 on custom dataset using resnet 18 pretrained model but training got killed as shown below.
2022-01-19 08:58:54,014 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.
Epoch 1/80
2/160 […] - ETA: 3:56:09 - loss: 52.1715/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (21.293008). Check your callbacks.
% delta_t_median)
37/160 [=====>…] - ETA: 28:28 - loss: 52.6629Killed
2022-01-19 14:38:08,958 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
training command : /taoworkspace/dataset$ tao yolo_v4 train --gpus 1 -e /workspace/tao-experiments/specs/spec.txt -r /workspace/tao-experiments/results -k (my key)
using single gpu
Nvidia-smi
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 Off | 00000000:17:00.0 Off | Off |
| 33% 27C P8 8W / 230W | 1MiB / 16125MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 5000 Off | 00000000:B3:00.0 Off | Off |
| 33% 28C P8 4W / 230W | 20MiB / 16122MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 1335 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1414 G /usr/bin/gnome-shell 6MiB |
±----------------------------------------------------------------------------+
I set to batch size 8. But the training speed is very low. it stops training for a min then move a lit bit then stop then move even a single epoch is not completed yet as shown below.
2022-01-19 13:54:45,069 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.
Epoch 1/100
2/160 […] - ETA: 2:40:21 - loss: 52.1415/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (6.909908). Check your callbacks.
% delta_t_median)
39/160 [======>…] - ETA: 19:34 - loss: 52.4700
Training is very slow as m also checking by using nvidia-smi during training.
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 Off | 00000000:17:00.0 Off | Off |
| 33% 38C P8 30W / 230W | 7168MiB / 16125MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Quadro RTX 5000 Off | 00000000:B3:00.0 Off | Off |
| 33% 29C P8 4W / 230W | 20MiB / 16122MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 10329 C /usr/bin/python3.6 115MiB |
| 0 N/A N/A 10390 C python3.6 7049MiB |
| 1 N/A N/A 1335 G /usr/lib/xorg/Xorg 9MiB |
| 1 N/A N/A 1414 G /usr/bin/gnome-shell 6MiB |
±----------------------------------------------------------------------------+
As I had already trained two more datasets which completed 100 epochs with in 15-20 min. But here in this dataset it is taking 15-25 min to complete single epoch this will take lot of time to complete the training if I train this till 100 epoch