Mon Jun 21 09:15:19 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:27:00.0 Off | 0 |
| N/A 75C P0 34W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla T4 Off | 00000000:83:00.0 Off | 0 |
| N/A 65C P0 32W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 Tesla T4 Off | 00000000:A3:00.0 Off | 0 |
| N/A 56C P0 29W / 70W | 0MiB / 15109MiB | 6% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 Tesla T4 Off | 00000000:C3:00.0 Off | 0 |
| N/A 68C P0 32W / 70W | 2188MiB / 15109MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process
name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+
Can you run below to check if it can work?
! yolo_v4 train -e $SPECS_DIR/yolo_v4_train_darknet53_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned_new
-k $KEY
--
gpus 2
--
gpu_index 0 1
To narrow down, can you run outside the jupyter notebook?
Please run below in your host PC.
$ tlt yolo_v4 train -e /workspace/docker_path_to_yolo_v4_train_darknet53_kitti.txt
-r /workspace/docker_path_to_experiment_dir_unpruned_ new
-k your_KEY
--
gpus 2
--
gpu_index 0 1
Still getting stuck at
To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
Using TensorFlow backend.
I am not sure what happened in your environment. To narrow down, can you run with default Jupyter notebook to train against KITTI dataset instead of your own dataset?
I am training on kitti dataset only
will restarting the notebook help?
As my request above, you already run outside the jupyter notebook, right? See YOLO V4 not training - #26 by Morganh
In that way, the training is not inside notebook.
One more problem I have been facing is that after running yolo for a particular number of epochs, the system stops working…the server is trt inference server
Hey, you have raised several issues here. Currently, you mention that “running yolo for a particular number of epoch”, does it mean the original issue (get stuck) is gone?
If possible, can you give a summary for your experiments?
So…There are 2 main issues I would like to mention
- YOLO is not working on multiple GPUs
- after running YOLOV4 on triton inference server for a particular number of epochs let say 10 epochs, it fails to start from 11th epoch and the server also crashes…which means that we have to restart the server to start YOLOv4 again.
For item1, “YOLO is not working on multiple GPUs”, please try to run other network with default jupyter notebooks and multiple GPUs. To narrow down what happened on your machine. For example, you can run LPRnet notebook or SSD notebook.
For item2, do you mean you try to run the training in a docker (triton inference server). Can you share the docker name?
yes, I have pulled the TLT3 image (docker pull nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3) from NVIDIA website. I have made a docker container using this image ID and have been accessing the YOLO notebook inside the container.
Are you asking me to share the docker version on the server?
The nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 is TLT 3.0 docker instead of triton inference server.
Why did you “made a docker container using this image ID” ? Can you share the steps?
The server I am using is TRITON INFERENCE SRVER which has 4 GPUs
step 1 - I pulled the image using command docker pull nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3
step 2 - Once the image was pulled on the Triton server, I used the image ID and made the docker container
step 3 - I ran the docker container on a particular port number
step 4 - I went inside the docker container using the command - docker exec -it nameofcontainer bash
step 5 - I started the jupyter notebook inside the container
step 6 - I accessed the jupyter notebook from my local machine using the IP of server and port number to access jupyter notebook outside the docker container
This is the Docker version installed on the triton server -
Client: Docker Engine - Community
Version: 20.10.5
API version: 1.41
Go version: go1.13.15
Git commit: 55c4c88
Built: Tue Mar 2 20:18:05 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.6
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 8728dd2
Built: Fri Apr 9 22:44:13 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.4
GitCommit: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
nvidia:
Version: 1.0.0-rc93
GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
docker-init:
Version: 0.19.0
GitCommit: de40ad0
As mentioned above, please run below experiments.
For item1, “YOLO is not working on multiple GPUs”, please try to run other network with default jupyter notebooks and multiple GPUs. To narrow down what happened on your machine. For example, you can run LPRnet notebook or SSD notebook.
Any update for above experiments? Can LPRnet or SSD run successfully with multi GPUs?
BTW, in your step2, which Triton server docker did you use? I need to check if I can reproduce your error.
There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks
May I know which Triton server docker did you use?