[TLT] YoloV4 training fails. training process asigned to CPU instead of GPU?

ai12 · July 20, 2022, 6:54pm

Please provide the following information when requesting support.

• Hardware
nvidia GPU (RTX 3060 Ti 12Gb)

• Network Type
Yolo_v4

• TLT Version
[docker image] nvcr.io/nvidia/tlt-streamanalytics v3.0-dp-py3

• Training spec file
The only change on the config file is Batch sizes set to 1

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(114.94, 60.67), (159.06, 114.59), (297.59, 176.38)]"
  mid_anchor_shape: "[(42.99, 31.91), (79.57, 31.75), (56.80, 56.93)]"
  small_anchor_shape: "[(15.60, 13.88), (30.25, 20.25), (20.67, 49.63)]"
  box_matching_iou: 0.25
  arch: "resnet"
  nlayers: 18
  arch_conv_blocks: 2
  loss_loc_weight: 0.8
  loss_neg_obj_weights: 100.0
  loss_class_weights: 0.5
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.1
  small_grid_xy_extend: 0.2
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
}
training_config {
  batch_size_per_gpu: 1
  num_epochs: 80
  enable_qat: false
  checkpoint_interval: 10
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-4
      soft_start: 0.3
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  pretrain_model_path: "/workspace/tlt-experiments/yolo_v4/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5"
}
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 1
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.001
  clustering_iou_threshold: 0.5
  top_k: 200
}
augmentation_config {
  hue: 0.1
  saturation: 1.5
  exposure:1.5
  vertical_flip:0
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 1248
  output_height: 384
  randomize_input_shape_period: 0
  mosaic_prob: 0.5
  mosaic_min_ratio:0.2
}
dataset_config {
  data_sources: {
      label_directory_path: "/workspace/tlt-experiments/data/training/label_2"
      image_directory_path: "/workspace/tlt-experiments/data/training/image_2"
  }
  include_difficult_in_training: true
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "pedestrian"
      value: "pedestrian"
  }
  target_class_mapping {
      key: "cyclist"
      value: "cyclist"
  }
  target_class_mapping {
      key: "van"
      value: "car"
  }
  target_class_mapping {
      key: "person_sitting"
      value: "pedestrian"
  }
  validation_data_sources: {
      label_directory_path: "/workspace/tlt-experiments/data/val/label"
      image_directory_path: "/workspace/tlt-experiments/data/val/image"
  }
}

• How to reproduce the issue ?
(I have had the image for a while)

nvidia-docker run --runtime=nvidia --gpus all --name=tlt-vision3 --entrypoint "" -it -v /home/telconet/dev/tlt/tlt-experiments:/workspace/tlt-experiments -v /home/telconet/dev/notebooks:/workspace/notebooks -p 8888:8888 tlt:220706 /bin/bash
yolo_v4 train -e /workspace/examples/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt -r /workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned -k <myKey> --gpus 1 --log_file traininglog.txt

• Dataset
images: Download
labels: Download

• Model
resnet_18.hdf5

• Aditional Hardware-Software Info:
Host computer:
OS: Ubuntu 18.04
CPU: I5-10400F
MotherBoard: Z590-A PRO (MS-7D09)
Nvidia-driver: 465.19.01
cuda version: V11.1.105 build:cuda_11.1.TC455_06.29190527_0
Tensor-RT: 7.2.1-1+cuda11.1

• Diagnostic
It seems that training is being done on the CPU-RAM instead of GPU-VRAM.
CPU-RAM ussage goes hight on training, and process is not listed on nvidia-smi
Before failing, RAM reaches 100% usage.
Process fails on Epoch #2

commands
(I ran all those commands while the training proces was being executed)
htop

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:03:00.0  On |                  N/A |
| 32%   30C    P2    28W / 170W |    844MiB / 12053MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

ps -aux | grep -i yolo_v4

root      4519  0.0  0.0   4640   832 pts/2    S+   21:28   0:00 /bin/sh -c  bash -c 'CUDA_VISIBLE_DEVICES=0, TF_ENABLE_AUTO_MIXED_PRECISION=0 python3.6 /usr/local/lib/python3.6/dist-packages/iva/yolo_v4/scripts/train.pyc --experiment_spec_file /workspace/examples/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt --results_dir /workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned --key <mySecretKey>'
root      4520  143 69.8 24483320 5614384 pts/2 Sl+ 21:28   1:03 python3.6 /usr/local/lib/python3.6/dist-packages/iva/yolo_v4/scripts/train.pyc --experiment_spec_file /workspace/examples/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt --results_dir /workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned --key <my_secret_key>'

• Question
What could I be missing?
Why could I be getting these flag values?

CUDA_VISIBLE_DEVICES=0, TF_ENABLE_AUTO_MIXED_PRECISION=0

how can I verify nvidia-docker?

UPDATE 1
executing nvidia-smi outside the docker outputs the processes

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:03:00.0  On |                  N/A |
| 31%   48C    P2    96W / 170W |   5324MiB / 12053MiB |     77%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2192      G   /usr/lib/xorg/Xorg                 18MiB |
|    0   N/A  N/A     28881      C   /usr/bin/python3.6                 99MiB |
|    0   N/A  N/A     28937      C   python3.6                        5203MiB |
+-----------------------------------------------------------------------------+

RAM use is still bordering a 100%
gpu usage is bigger now

I changed --gpus all to --gpus=all in the nvidia-docker run command. I’m not sure if that was the “fix”.

Morganh · July 21, 2022, 3:08am

Please install latest tao(22.05) according to TAO Toolkit Quick Start Guide - NVIDIA Docs
$ pip3 install nvidia-tao

Then, run in termimal
$ tao yolov4 xxx

Or run inside the docker,
$ tao yolov4 run /bin/bash
then
# yolov4 train xxx

ai12 · July 21, 2022, 8:13pm

Hello @Morganh.
Is there a known issue with yolo_v4 on TLT or TLT is just out of support?

I’m using TLT instead of TAO because of compatibility reasons and I cannot just update to TAO like I would like to.

PS: would a yolo_v4 model trained under TAO be compatible with Deepstream5.1?

Morganh · July 22, 2022, 2:20am

TAO is just the renaming of TLT since 2021 August.
Above steps I mentioned are in order to narrow down.

And also the TAO 22.02 or 22.05 made improvement for yolov4.

Yes, you can deploy the .etlt model in DS.

ai12 · July 22, 2022, 6:49pm

Thank you @Morganh.
Just added more SWAP to the computer and tlt is working as expected. I will be adding RAM soon.

I am still confused about the use of RAM and VRAM.
Mid-training usage is:
RAM 6.47G/7.67G
SWAP 2.69G/24.0G
VRAM 5322MiB / 12053MiB

What is loaded on RAM an what is loaded on VRAM?
Is it possible to leverage the 6.6GB that are being wasted on the VRAM?
What benefit gives to me having so much VRAM?

Thank you.

Morganh · July 25, 2022, 6:40am

Glad to know you can train now.

What is the original RAM and SWAP in your system?

ai12 · July 25, 2022, 6:18pm

I had originally, when I wrote the post
8GB RAM
0 SWAP
12GB VRAM

then I added swap to overcome my issue
8GB RAM
24GB SWAP
12GB VRAM

and now I bought some more ram
24GB RAM
24GB SWAP
12GB VRAM

Morganh · July 27, 2022, 8:10am

Thanks for the info.

Topic		Replies	Views
Training got killed before start TAO Toolkit	18	1427	February 8, 2022
Training Become very slow Yolov4 TAO Toolkit	25	2069	January 25, 2022
Convert TAO Yolov4 model to DLA engine fails TAO Toolkit	22	1662	March 1, 2022
[TLT-yolov4-Deepstream ] ERROR: [TRT]: UffParser: Unsupported number of graph 0 TAO Toolkit	16	758	October 12, 2021
High ram usage with tlt ResNet TAO Toolkit	42	991	July 6, 2022
Tlt-convert for custom trained YoloV4 model failed on Jetson Nano 4G TAO Toolkit	42	2242	August 27, 2021
Error while converting model using TAO TAO Toolkit	32	796	October 27, 2021
Extremely slow train and evaluation of yolo_v4_tiny TAO Toolkit yolo , tao	12	1223	April 12, 2023
TAO yoloV4 cannot train from checkpoint TAO Toolkit	8	394	August 5, 2022
Instructions to integrate TAO 3.0 YoloV4 model into DeepStream produce no output on Jetson NX DeepStream SDK	10	383	December 5, 2023

[TLT] YoloV4 training fails. training process asigned to CPU instead of GPU?

Related topics