Object detection training duration

Hello,

I am currently using NVIDIA TLT to train a custom YOLOV4 object detection model (cspdarknet53 backbone) for 80 epochs using a GeForce GTX 1060 6GB - the estimated training duration is 6 days and a half.

When using darknet (GitHub - AlexeyAB/darknet: YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )) to train the same model, on the same dataset, for the same number of epochs, I am able to train it in a day and a half.

Why does training take so much longer using TLT? Are there any settings I can tweak to reduce training time?

Thanks.

Could you please share the training log and spec file?

Sure (note that the log is incomplete as training had to be resumed from epoch 30 and it hasn’t finished all 80 epochs yet)

output_log.txt (4.1 KB) yolo_v4_train_cspdarknet53_label1.txt (2.0 KB) yolov4_training_log_cspdarknet53.csv (1.1 KB)

Thanks for the info.
Do you ever try a larger batch-size? For example, bs=2 or 4.
If you did not try before, you can setup the experiment in other machine(since you are training current training in your GeForce GTX 1060) or later after your current training is done. You can also just use a smaller part of the training dataset.

More, in your current training, please help capture below log:

  1. open 1st terminal, run following command
    $ nvidia-smi dmon

  2. open 2nd terminal, run following command
    $ top
    Then press 1

Please share the two logs with us.

I’ve tried larger batch_size values but training would error out. The only way I got it to start was by setting batch_size to 1.

nvidia-smi-dmon.txt (2.7 KB) top_1.txt (8.7 KB)

Is it OOM error or “Killed” when you tried larger batch-size?

It was OOM error for batch=4. It seems to be running fine now for batch=2. Is there anything else I could try besides increasing the batch size and reducing the training dataset size?

Please try MPS (Multi_Process_Service)

Start MPS daemon process
nvidia-cuda-mps-control –d

Check MPS process
ps -ef | grep mps

Quit MPS daemon
echo quit | nvidia-cuda-mps-control

I am now training through a tlt docker image (nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3) using cnvrg.io on a server with one A100 gpu. Do I need to run MPS on the PC/server itself, or can I run it from within the docker image?

You can login the tlt docker, then run training directly in the docker. Meanwhile, start MPS.

More, if you are training with A100 gpu, could you try larger batch-size?
To speed up training speed, normally,

  1. increase bs (but please note that this should trade off mAP)
  2. use multi-gpus

MPS is a way which will not increase speed a lot.