Exec inside tao container

Hello,

How do I exec inside the tao container and check the training logs, as the container is running buy the notebook closed?

Thank you!

Check the container id via
$ docker ps

Login inside the docker
$ docker exec -it <container-id> /bin/bash

I want to check the training logs of the docker.
I am able to see the GPU being used but not able to check the logs, where I can see the epoch, acc, loss…

But if you already close the notebook, the training will stop.

Can you open the notebook again? The previous running log should be existing.

Which network did you run?

I did not open the jupyter-notebook instead the book was opened using the Vscode. When I reopened the Vscode I cannot see the logs, but the container is running.

What do you mean by “I did not open the jupyter-notebook”?
May I know how did you trigger training? In jupyter-notebook or in terminal?

sorry I was not clear!
I did not run the “jupyter-notebook” command on the terminal, but instead opened the .ipynb file using the Visual Studio Code.
Now that the VScode closed, and I tried to open it again, I cannot see the logs in the output cell but the training is running inside the container, I am able to see it with the help of docker ps, and nvidia-smi.
So now I want to see the logs of the running container where it can display the no_of_epoch, loss, val_loss, acc, val_acc …

Which network did you run?

detectnet_v2

OK, for detectnet_v2 network, there is not file containing the info. But can you find the .tlt model in the result folder?
How did you confirm that the training is still ongoing?

Well, by running the nvidia-smi command, and also the tao list, and the docker ps.
The .tlt files are being added in the result folder but they have improper naming like this “model.step-152940.tlt”.
I need to know other info as well!

The “tao list” or “docker ps” does not mean the training is still running.
For “nvidia-smi”, can you share the result?

well, you can say that for “docker ps”, but the “tao list” gives the following output:

classification train -e /workspace/tao-
experiments/classification/specs/classification_retrain_spec.cfg -r /workspace/tao-
experiments/classification/output_retrain -k nvidia_tlt_pix88

Can you share your /workspace/tao-
experiments/classification/specs/classification_retrain_spec.cfg?

BTW, can you login the docker and run below and share the result?
$ ps -aux

random_seed: 42
dataset_config {
data_sources {
tfrecords_path: “/workspace/tao-experiments/data/tfrecords/kitti_trainval/*”
image_directory_path: “/workspace/tao-experiments/data/training”
}
image_extension: “jpg”

target_class_mapping {
key: “car”
value: “car”
}
target_class_mapping {
key: “bus”
value: “bus”
}
target_class_mapping {
key: “bike”
value: “bike”
}
target_class_mapping {
key: “auto”
value: “auto”
}
target_class_mapping {
key: “tractor”
value: “tractor”
}
target_class_mapping {
key: “truck”
value: “truck”
}
target_class_mapping {
key: “licence_plate_number”
value: “lpn”
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
postprocessing_config {
target_class_config {
key: “car”
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “bus”
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00499999988824
dbscan_eps: 0.15000000596
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “bike”
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00749999983236
dbscan_eps: 0.230000004172
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “auto”
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00749999983236
dbscan_eps: 0.230000004172
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “tractor”
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00749999983236
dbscan_eps: 0.230000004172
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}

target_class_config {
key: “truck”
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00749999983236
dbscan_eps: 0.230000004172
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}

target_class_config {
key: “lpn”
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00749999983236
dbscan_eps: 0.230000004172
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
}
model_config {
pretrained_model_file: “/workspace/tao-experiments/detectnet_v2/intermediate_trained_model.tlt”
num_layers: 50
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
arch: “resnet”
}
evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 30
minimum_detection_ground_truth_overlap {
key: “car”
value: 0.6
}
minimum_detection_ground_truth_overlap {
key: “bus”
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: “bike”
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: “auto”
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: “tractor”
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: “truck”
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: “lpn”
value: 0.5
}
evaluation_box_config {
key: “car”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “bus”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “bike”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}

evaluation_box_config {
key: “auto”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “tractor”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “truck”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “lpn”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
average_precision_mode: INTEGRATE
}
cost_function_config {
target_classes {
name: “car”
class_weight: 2.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “bus”
class_weight: 4.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 1.0
}
}
target_classes {
name: “bike”
class_weight: 5.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “auto”
class_weight: 7.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “tractor”
class_weight: 8.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “truck”
class_weight: 3.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “lpn”
class_weight: 2.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
}
training_config {
batch_size_per_gpu: 4
num_epochs: 140
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-06
max_learning_rate: 5e-04
soft_start: 0.10000000149
annealing: 0.699999988079
}
}
regularizer {
type: L1
weight: 3.00000002618e-09
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 10
}
bbox_rasterizer_config {
target_class_config {
key: “car”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.40000000596
cov_radius_y: 0.40000000596
bbox_min_radius: 1.0
}
}
target_class_config {
key: “bus”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
target_class_config {
key: “bike”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
target_class_config {
key: “auto”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
target_class_config {
key: “tractor”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
target_class_config {
key: “truck”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
target_class_config {
key: “lpn”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.400000154972
}

From your training spec,

num_epochs: 140
validation_period_during_training: 10
first_validation_epoch: 30
checkpoint_interval: 10

The .tlt model will be saved every 10 epochs after 30th epoch.

Can you check the .tlt file under your result folder?
You mention that there is one intermediate model “model.step-152940.tlt”. It is proper name. But there should be more .tlt model files.
For this .tlt model file, you can run tao evaluation against it to get the AP, mAP ,etc.
But if you can only find only one .tlt file, I am afraid the training already stopped.

Suggest you to resume training.
See DetectNet_v2 — TAO Toolkit 3.21.11 documentation

DetectNet_v2 now supports resuming training from intermediate checkpoints. When a previously running training experiment is stopped prematurely, one may restart the training from the last checkpoint by simply re-running the detectnet_v2 training command with the same command line arguments as before. The trainer for detectnet_v2 finds the last saved checkpoint in the results directory and resumes the training from there. The interval at which the checkpoints are saved are defined by the checkpoint_interval parameter under the “training_config” for detectnet_v2.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.