YOLO V4 not training

bhargavi.sanadhya · June 21, 2021, 9:16am

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process

name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

Morganh · June 21, 2021, 9:18am

Can you run below to check if it can work?

! yolo_v4 train -e $SPECS_DIR/yolo_v4_train_darknet53_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned_new
-k $KEY
-- gpus 2
--gpu_index 0 1

bhargavi.sanadhya · June 21, 2021, 9:31am

Getting stuck after -

Epoch 1/100

Morganh · June 21, 2021, 10:01am

To narrow down, can you run outside the jupyter notebook?
Please run below in your host PC.
$ tlt yolo_v4 train -e /workspace/docker_path_to_yolo_v4_train_darknet53_kitti.txt
-r /workspace/docker_path_to_experiment_dir_unpruned_ new
-k your_KEY
-- gpus 2
-- gpu_index 0 1

bhargavi.sanadhya · June 21, 2021, 11:21am

Still getting stuck at

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
Using TensorFlow backend.

Morganh · June 21, 2021, 11:24am

I am not sure what happened in your environment. To narrow down, can you run with default Jupyter notebook to train against KITTI dataset instead of your own dataset?

bhargavi.sanadhya · June 21, 2021, 11:24am

I am training on kitti dataset only

bhargavi.sanadhya · June 21, 2021, 11:25am

will restarting the notebook help?

Morganh · June 21, 2021, 11:26am

As my request above, you already run outside the jupyter notebook, right? See YOLO V4 not training - #26 by Morganh

In that way, the training is not inside notebook.

bhargavi.sanadhya · June 21, 2021, 11:41am

One more problem I have been facing is that after running yolo for a particular number of epochs, the system stops working…the server is trt inference server

Morganh · June 21, 2021, 11:45am

Hey, you have raised several issues here. Currently, you mention that “running yolo for a particular number of epoch”, does it mean the original issue (get stuck) is gone?
If possible, can you give a summary for your experiments?

bhargavi.sanadhya · June 21, 2021, 12:08pm

So…There are 2 main issues I would like to mention

YOLO is not working on multiple GPUs
after running YOLOV4 on triton inference server for a particular number of epochs let say 10 epochs, it fails to start from 11th epoch and the server also crashes…which means that we have to restart the server to start YOLOv4 again.

Morganh · June 21, 2021, 12:12pm

For item1, “YOLO is not working on multiple GPUs”, please try to run other network with default jupyter notebooks and multiple GPUs. To narrow down what happened on your machine. For example, you can run LPRnet notebook or SSD notebook.

For item2, do you mean you try to run the training in a docker (triton inference server). Can you share the docker name?

bhargavi.sanadhya · June 21, 2021, 12:16pm

yes, I have pulled the TLT3 image (docker pull nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3) from NVIDIA website. I have made a docker container using this image ID and have been accessing the YOLO notebook inside the container.

Are you asking me to share the docker version on the server?

Morganh · June 21, 2021, 12:20pm

The nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 is TLT 3.0 docker instead of triton inference server.
Why did you “made a docker container using this image ID” ? Can you share the steps?

bhargavi.sanadhya · June 21, 2021, 12:26pm

The server I am using is TRITON INFERENCE SRVER which has 4 GPUs

step 1 - I pulled the image using command docker pull nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3
step 2 - Once the image was pulled on the Triton server, I used the image ID and made the docker container
step 3 - I ran the docker container on a particular port number
step 4 - I went inside the docker container using the command - docker exec -it nameofcontainer bash
step 5 - I started the jupyter notebook inside the container
step 6 - I accessed the jupyter notebook from my local machine using the IP of server and port number to access jupyter notebook outside the docker container

bhargavi.sanadhya · June 21, 2021, 12:27pm

This is the Docker version installed on the triton server -
Client: Docker Engine - Community
Version: 20.10.5
API version: 1.41
Go version: go1.13.15
Git commit: 55c4c88
Built: Tue Mar 2 20:18:05 2021
OS/Arch: linux/amd64
Context: default
Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.6
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 8728dd2
Built: Fri Apr 9 22:44:13 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.4
GitCommit: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
nvidia:
Version: 1.0.0-rc93
GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
docker-init:
Version: 0.19.0
GitCommit: de40ad0

Morganh · June 21, 2021, 1:00pm

As mentioned above, please run below experiments.

For item1, “YOLO is not working on multiple GPUs”, please try to run other network with default jupyter notebooks and multiple GPUs. To narrow down what happened on your machine. For example, you can run LPRnet notebook or SSD notebook.

Morganh · June 22, 2021, 3:07pm

Any update for above experiments? Can LPRnet or SSD run successfully with multi GPUs?
BTW, in your step2, which Triton server docker did you use? I need to check if I can reproduce your error.

Morganh · June 23, 2021, 6:23am

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

May I know which Triton server docker did you use?

Topic		Replies	Views
Yolo_v4 getting stuck while training TAO Toolkit	3	1132	October 9, 2021
Problem with tlt yolo_v4 train TAO Toolkit	3	655	October 12, 2021
Training got killed before start TAO Toolkit	18	1613	February 8, 2022
Training Yolov4 with 4 GPUs cause out of memory TAO Toolkit	4	1050	August 3, 2022
Yolo_v4 getting stuck while training OpenGL yolo , tao	0	956	October 12, 2021
Yolov4 multi-gpu training with Darknet Arch encounters a problem TAO Toolkit	17	1034	July 2, 2023
YOLO V3 not working on TLT container TAO Toolkit	9	1078	October 12, 2021
Training Become very slow Yolov4 TAO Toolkit	25	2400	January 25, 2022
Unable to train yolov4 with Tao succesfully TAO Toolkit	6	609	April 28, 2023
Error when trying to retrain yolo_v4 TAO Toolkit	7	1094	October 31, 2022

YOLO V4 not training

Related topics