[TLT] YoloV4 training fails. training process asigned to CPU instead of GPU?

Please provide the following information when requesting support.

Hardware
nvidia GPU (RTX 3060 Ti 12Gb)

Network Type
Yolo_v4

TLT Version
[docker image] nvcr.io/nvidia/tlt-streamanalytics v3.0-dp-py3

Training spec file
The only change on the config file is Batch sizes set to 1

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(114.94, 60.67), (159.06, 114.59), (297.59, 176.38)]"
  mid_anchor_shape: "[(42.99, 31.91), (79.57, 31.75), (56.80, 56.93)]"
  small_anchor_shape: "[(15.60, 13.88), (30.25, 20.25), (20.67, 49.63)]"
  box_matching_iou: 0.25
  arch: "resnet"
  nlayers: 18
  arch_conv_blocks: 2
  loss_loc_weight: 0.8
  loss_neg_obj_weights: 100.0
  loss_class_weights: 0.5
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.1
  small_grid_xy_extend: 0.2
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
}
training_config {
  batch_size_per_gpu: 1
  num_epochs: 80
  enable_qat: false
  checkpoint_interval: 10
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-4
      soft_start: 0.3
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  pretrain_model_path: "/workspace/tlt-experiments/yolo_v4/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5"
}
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 1
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.001
  clustering_iou_threshold: 0.5
  top_k: 200
}
augmentation_config {
  hue: 0.1
  saturation: 1.5
  exposure:1.5
  vertical_flip:0
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 1248
  output_height: 384
  randomize_input_shape_period: 0
  mosaic_prob: 0.5
  mosaic_min_ratio:0.2
}
dataset_config {
  data_sources: {
      label_directory_path: "/workspace/tlt-experiments/data/training/label_2"
      image_directory_path: "/workspace/tlt-experiments/data/training/image_2"
  }
  include_difficult_in_training: true
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "pedestrian"
      value: "pedestrian"
  }
  target_class_mapping {
      key: "cyclist"
      value: "cyclist"
  }
  target_class_mapping {
      key: "van"
      value: "car"
  }
  target_class_mapping {
      key: "person_sitting"
      value: "pedestrian"
  }
  validation_data_sources: {
      label_directory_path: "/workspace/tlt-experiments/data/val/label"
      image_directory_path: "/workspace/tlt-experiments/data/val/image"
  }
}

How to reproduce the issue ?
(I have had the image for a while)

nvidia-docker run --runtime=nvidia --gpus all --name=tlt-vision3 --entrypoint "" -it -v /home/telconet/dev/tlt/tlt-experiments:/workspace/tlt-experiments -v /home/telconet/dev/notebooks:/workspace/notebooks -p 8888:8888 tlt:220706 /bin/bash
yolo_v4 train -e /workspace/examples/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt -r /workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned -k <myKey> --gpus 1 --log_file traininglog.txt

Dataset
images: Download
labels: Download

Model
resnet_18.hdf5

Aditional Hardware-Software Info:
Host computer:
OS: Ubuntu 18.04
CPU: I5-10400F
MotherBoard: Z590-A PRO (MS-7D09)
Nvidia-driver: 465.19.01
cuda version: V11.1.105 build:cuda_11.1.TC455_06.29190527_0
Tensor-RT: 7.2.1-1+cuda11.1

Diagnostic
It seems that training is being done on the CPU-RAM instead of GPU-VRAM.
CPU-RAM ussage goes hight on training, and process is not listed on nvidia-smi
Before failing, RAM reaches 100% usage.
Process fails on Epoch #2

commands
(I ran all those commands while the training proces was being executed)
htop

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:03:00.0  On |                  N/A |
| 32%   30C    P2    28W / 170W |    844MiB / 12053MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

ps -aux | grep -i yolo_v4

root      4519  0.0  0.0   4640   832 pts/2    S+   21:28   0:00 /bin/sh -c  bash -c 'CUDA_VISIBLE_DEVICES=0, TF_ENABLE_AUTO_MIXED_PRECISION=0 python3.6 /usr/local/lib/python3.6/dist-packages/iva/yolo_v4/scripts/train.pyc --experiment_spec_file /workspace/examples/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt --results_dir /workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned --key <mySecretKey>'
root      4520  143 69.8 24483320 5614384 pts/2 Sl+ 21:28   1:03 python3.6 /usr/local/lib/python3.6/dist-packages/iva/yolo_v4/scripts/train.pyc --experiment_spec_file /workspace/examples/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt --results_dir /workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned --key <my_secret_key>'

Question
What could I be missing?
Why could I be getting these flag values?

CUDA_VISIBLE_DEVICES=0, TF_ENABLE_AUTO_MIXED_PRECISION=0

how can I verify nvidia-docker?


UPDATE 1
executing nvidia-smi outside the docker outputs the processes

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:03:00.0  On |                  N/A |
| 31%   48C    P2    96W / 170W |   5324MiB / 12053MiB |     77%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2192      G   /usr/lib/xorg/Xorg                 18MiB |
|    0   N/A  N/A     28881      C   /usr/bin/python3.6                 99MiB |
|    0   N/A  N/A     28937      C   python3.6                        5203MiB |
+-----------------------------------------------------------------------------+

RAM use is still bordering a 100%
gpu usage is bigger now

I changed --gpus all to --gpus=all in the nvidia-docker run command. I’m not sure if that was the “fix”.

Please install latest tao(22.05) according to TAO Toolkit Quick Start Guide - NVIDIA Docs
$ pip3 install nvidia-tao

Then, run in termimal
$ tao yolov4 xxx

Or run inside the docker,
$ tao yolov4 run /bin/bash
then
# yolov4 train xxx

Hello @Morganh.
Is there a known issue with yolo_v4 on TLT or TLT is just out of support?

I’m using TLT instead of TAO because of compatibility reasons and I cannot just update to TAO like I would like to.

PS: would a yolo_v4 model trained under TAO be compatible with Deepstream5.1?

TAO is just the renaming of TLT since 2021 August.
Above steps I mentioned are in order to narrow down.

And also the TAO 22.02 or 22.05 made improvement for yolov4.

Yes, you can deploy the .etlt model in DS.

Thank you @Morganh.
Just added more SWAP to the computer and tlt is working as expected. I will be adding RAM soon.

I am still confused about the use of RAM and VRAM.
Mid-training usage is:
RAM 6.47G/7.67G
SWAP 2.69G/24.0G
VRAM 5322MiB / 12053MiB

What is loaded on RAM an what is loaded on VRAM?
Is it possible to leverage the 6.6GB that are being wasted on the VRAM?
What benefit gives to me having so much VRAM?

Thank you.

Glad to know you can train now.

What is the original RAM and SWAP in your system?

I had originally, when I wrote the post
8GB RAM
0 SWAP
12GB VRAM

then I added swap to overcome my issue
8GB RAM
24GB SWAP
12GB VRAM

and now I bought some more ram
24GB RAM
24GB SWAP
12GB VRAM

Thanks for the info.