Please provide the following information when requesting support.
• Hardware
nvidia GPU (RTX 3060 Ti 12Gb)
• Network Type
Yolo_v4
• TLT Version
[docker image] nvcr.io/nvidia/tlt-streamanalytics v3.0-dp-py3
• Training spec file
The only change on the config file is Batch sizes set to 1
random_seed: 42
yolov4_config {
big_anchor_shape: "[(114.94, 60.67), (159.06, 114.59), (297.59, 176.38)]"
mid_anchor_shape: "[(42.99, 31.91), (79.57, 31.75), (56.80, 56.93)]"
small_anchor_shape: "[(15.60, 13.88), (30.25, 20.25), (20.67, 49.63)]"
box_matching_iou: 0.25
arch: "resnet"
nlayers: 18
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 1
num_epochs: 80
enable_qat: false
checkpoint_interval: 10
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: "/workspace/tlt-experiments/yolo_v4/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5"
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 1
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 1248
output_height: 384
randomize_input_shape_period: 0
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
label_directory_path: "/workspace/tlt-experiments/data/training/label_2"
image_directory_path: "/workspace/tlt-experiments/data/training/image_2"
}
include_difficult_in_training: true
target_class_mapping {
key: "car"
value: "car"
}
target_class_mapping {
key: "pedestrian"
value: "pedestrian"
}
target_class_mapping {
key: "cyclist"
value: "cyclist"
}
target_class_mapping {
key: "van"
value: "car"
}
target_class_mapping {
key: "person_sitting"
value: "pedestrian"
}
validation_data_sources: {
label_directory_path: "/workspace/tlt-experiments/data/val/label"
image_directory_path: "/workspace/tlt-experiments/data/val/image"
}
}
• How to reproduce the issue ?
(I have had the image for a while)
nvidia-docker run --runtime=nvidia --gpus all --name=tlt-vision3 --entrypoint "" -it -v /home/telconet/dev/tlt/tlt-experiments:/workspace/tlt-experiments -v /home/telconet/dev/notebooks:/workspace/notebooks -p 8888:8888 tlt:220706 /bin/bash
yolo_v4 train -e /workspace/examples/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt -r /workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned -k <myKey> --gpus 1 --log_file traininglog.txt
• Dataset
images: Download
labels: Download
• Model
resnet_18.hdf5
• Aditional Hardware-Software Info:
Host computer:
OS: Ubuntu 18.04
CPU: I5-10400F
MotherBoard: Z590-A PRO (MS-7D09)
Nvidia-driver: 465.19.01
cuda version: V11.1.105 build:cuda_11.1.TC455_06.29190527_0
Tensor-RT: 7.2.1-1+cuda11.1
• Diagnostic
It seems that training is being done on the CPU-RAM instead of GPU-VRAM.
CPU-RAM ussage goes hight on training, and process is not listed on nvidia-smi
Before failing, RAM reaches 100% usage.
Process fails on Epoch #2
commands
(I ran all those commands while the training proces was being executed)
htop
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:03:00.0 On | N/A |
| 32% 30C P2 28W / 170W | 844MiB / 12053MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
ps -aux | grep -i yolo_v4
root 4519 0.0 0.0 4640 832 pts/2 S+ 21:28 0:00 /bin/sh -c bash -c 'CUDA_VISIBLE_DEVICES=0, TF_ENABLE_AUTO_MIXED_PRECISION=0 python3.6 /usr/local/lib/python3.6/dist-packages/iva/yolo_v4/scripts/train.pyc --experiment_spec_file /workspace/examples/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt --results_dir /workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned --key <mySecretKey>'
root 4520 143 69.8 24483320 5614384 pts/2 Sl+ 21:28 1:03 python3.6 /usr/local/lib/python3.6/dist-packages/iva/yolo_v4/scripts/train.pyc --experiment_spec_file /workspace/examples/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt --results_dir /workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned --key <my_secret_key>'
• Question
What could I be missing?
Why could I be getting these flag values?
CUDA_VISIBLE_DEVICES=0, TF_ENABLE_AUTO_MIXED_PRECISION=0
how can I verify nvidia-docker?
UPDATE 1
executing nvidia-smi outside the docker outputs the processes
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:03:00.0 On | N/A |
| 31% 48C P2 96W / 170W | 5324MiB / 12053MiB | 77% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2192 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 28881 C /usr/bin/python3.6 99MiB |
| 0 N/A N/A 28937 C python3.6 5203MiB |
+-----------------------------------------------------------------------------+
RAM use is still bordering a 100%
gpu usage is bigger now
I changed --gpus all
to --gpus=all
in the nvidia-docker run
command. I’m not sure if that was the “fix”.