"tlt-train detectnet_v2" lead core dump

Hi, I just tried TLT and run detectnet_v2 example in jupyter notebook and got core dump in step " Run TLT training". As following is output:

  1. Set up env variables

When using the purpose-built pretrained models from NGC, please make sure to set the $KEY environment variable to the key as mentioned in the model overview. Failing to do so, can lead to errors when trying to load them as pretrained models.

Note: Please make sure to remove any stray artifacts/files from the $USER_EXPERIMENT_DIR or $DATA_DOWNLOAD_DIR paths as mentioned below, that may have been generated from previous experiments. Having checkpoint files etc may interfere with creating a training graph for a new experiment.

Note: This notebook currently is by default set up to run training using 1 GPU. To use more GPU’s please update the env variable $NUM_GPUS accordingly

Setting up env variables for cleaner command line commands.

%env KEY=zhoi

%env USER_EXPERIMENT_DIR=/workspace/tlt-experiments/detectnet_v2

%env DATA_DOWNLOAD_DIR=/workspace/tlt-experiments/data

%env SPECS_DIR=/workspace/examples/detectnet_v2/specs

%env NUM_GPUS=1

env: KEY=zhoi
env: USER_EXPERIMENT_DIR=/workspace/tlt-experiments/detectnet_v2
env: DATA_DOWNLOAD_DIR=/workspace/tlt-experiments/data
env: SPECS_DIR=/workspace/examples/detectnet_v2/specs
env: NUM_GPUS=1

  1. Prepare dataset and pre-trained model

We will be using the kitti object detection dataset for this example. To find more details, please visit http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d. Please download both, the left color images of the object dataset from here and, the training labels for the object dataset from here, and place the zip files in $DATA_DOWNLOAD_DIR

The data will then be extracted to have

training images in $DATA_DOWNLOAD_DIR/training/image_2
training labels in $DATA_DOWNLOAD_DIR/training/label_2
testing images in $DATA_DOWNLOAD_DIR/testing/image_2

Note: There are no labels for the testing images, therefore we use it just to visualize inferences for the trained model.
A. Verify downloaded dataset

Check the dataset is present

!mkdir -p $DATA_DOWNLOAD_DIR

!if [ ! -f $DATA_DOWNLOAD_DIR/data_object_image_2.zip ]; then echo ‘Image zip file not found, please download.’; else echo ‘Found Image zip file.’;fi

!if [ ! -f $DATA_DOWNLOAD_DIR/data_object_label_2.zip ]; then echo ‘Label zip file not found, please download.’; else echo ‘Found Labels zip file.’;fi

Found Image zip file.
Found Labels zip file.

unpack downloaded datasets to $DATA_DOWNLOAD_DIR.

The training images will be under $DATA_DOWNLOAD_DIR/training/image_2 and

labels will be under $DATA_DOWNLOAD_DIR/training/label_2.

The testing images will be under $DATA_DOWNLOAD_DIR/testing/image_2.

!unzip -u $DATA_DOWNLOAD_DIR/data_object_image_2.zip -d $DATA_DOWNLOAD_DIR

!unzip -u $DATA_DOWNLOAD_DIR/data_object_label_2.zip -d $DATA_DOWNLOAD_DIR

Archive: /workspace/tlt-experiments/data/data_object_image_2.zip
creating: /workspace/tlt-experiments/data/training/image_2/
extracting: /workspace/tlt-experiments/data/training/image_2/002480.png
extracting: /workspace/tlt-experiments/data/training/image_2/005952.png
extracting: /workspace/tlt-experiments/data/training/image_2/000709.png
extracting: /workspace/tlt-experiments/data/training/image_2/000814.png
extracting: /workspace/tlt-experiments/data/training/image_2/006192.png
extracting: /workspace/tlt-experiments/data/training/image_2/006017.png
extracting: /workspace/tlt-experiments/data/training/image_2/002731.png

extracting: /workspace/tlt-experiments/data/training/label_2/002777.txt
extracting: /workspace/tlt-experiments/data/training/label_2/001730.txt
extracting: /workspace/tlt-experiments/data/training/label_2/002740.txt
extracting: /workspace/tlt-experiments/data/training/label_2/002057.txt
extracting: /workspace/tlt-experiments/data/training/label_2/004455.txt

verify

import os

DATA_DIR = os.environ.get(‘DATA_DOWNLOAD_DIR’)

num_training_images = len(os.listdir(os.path.join(DATA_DIR, “training/image_2”)))

num_training_labels = len(os.listdir(os.path.join(DATA_DIR, “training/label_2”)))

num_testing_images = len(os.listdir(os.path.join(DATA_DIR, “testing/image_2”)))

print(“Number of images in the trainval set. {}”.format(num_training_images))

print(“Number of labels in the trainval set. {}”.format(num_training_labels))

print(“Number of images in the test set. {}”.format(num_testing_images))

Number of images in the trainval set. 7481
Number of labels in the trainval set. 7481
Number of images in the test set. 7518

Sample kitti label.

!cat $DATA_DOWNLOAD_DIR/training/label_2/000110.txt

Car 0.27 0 2.50 862.65 129.39 1241.00 304.96 1.73 1.74 4.71 5.50 1.30 8.19 3.07
Car 0.68 3 -0.76 1184.97 141.54 1241.00 187.84 1.52 1.60 4.42 22.39 0.48 24.57 -0.03

B. Prepare tf records from kitti format dataset

Update the tfrecords spec file to take in your kitti format dataset
Create the tfrecords using the tlt-dataset-convert

Note: TfRecords only need to be generated once.

print(“TFrecords conversion spec file for kitti training”)

!cat $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt

TFrecords conversion spec file for kitti training
kitti_config {
root_directory_path: “/workspace/tlt-experiments/data/training”
image_dir_name: “image_2”
label_dir_name: “label_2”
image_extension: “.png”
partition_mode: “random”
num_partitions: 2
val_split: 14
num_shards: 10
}
image_directory_path: “/workspace/tlt-experiments/data/training”

Creating a new directory for the output tfrecords dump.

print(“Converting Tfrecords for kitti trainval dataset”)

!tlt-dataset-convert -d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt \

                 -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

Converting Tfrecords for kitti trainval dataset
2020-08-06 06:03:11.912597: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Using TensorFlow backend.
2020-08-06 06:03:30,024 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2020-08-06 06:03:30,024 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Creating output directory /workspace/tlt-experiments/data/tfrecords/kitti_trainval
2020-08-06 06:03:30,055 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 6434 Val: 1047
2020-08-06 06:03:30,055 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2020-08-06 06:03:30,062 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2020-08-06 06:03:30,062 - tensorflow - WARNING - From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:273: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2020-08-06 06:03:31,764 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2020-08-06 06:03:33,241 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2020-08-06 06:03:34,775 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2020-08-06 06:03:36,417 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2020-08-06 06:03:37,988 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2020-08-06 06:03:39,480 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2020-08-06 06:03:40,981 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2020-08-06 06:03:42,589 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2020-08-06 06:03:44,330 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
2020-08-06 06:03:45,944 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’van’: 430
b’car’: 3960
b’dontcare’: 1552
b’truck’: 164
b’cyclist’: 244
b’tram’: 74
b’misc’: 131
b’pedestrian’: 704
b’person_sitting’: 51

2020-08-06 06:03:45,944 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2020-08-06 06:03:55,416 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2020-08-06 06:04:05,608 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2020-08-06 06:04:15,401 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2020-08-06 06:04:25,020 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2020-08-06 06:04:35,083 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2020-08-06 06:04:44,782 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2020-08-06 06:04:54,471 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2020-08-06 06:05:04,096 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2020-08-06 06:05:14,277 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2020-08-06 06:05:24,015 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’car’: 24782
b’van’: 2484
b’dontcare’: 9743
b’pedestrian’: 3783
b’tram’: 437
b’cyclist’: 1383
b’truck’: 930
b’misc’: 842
b’person_sitting’: 171

2020-08-06 06:05:24,015 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2020-08-06 06:05:24,015 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’van’: 2914
b’car’: 28742
b’dontcare’: 11295
b’truck’: 1094
b’cyclist’: 1627
b’tram’: 511
b’misc’: 973
b’pedestrian’: 4487
b’person_sitting’: 222

2020-08-06 06:05:24,016 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
b’Van’: b’van’
b’Car’: b’car’
b’DontCare’: b’dontcare’
b’Truck’: b’truck’
b’Cyclist’: b’cyclist’
b’Tram’: b’tram’
b’Misc’: b’misc’
b’Pedestrian’: b’pedestrian’
b’Person_sitting’: b’person_sitting’
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2020-08-06 06:05:24,016 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.

!ls -rlt $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/

total 7144
-rw-r–r-- 1 root root 104124 Aug 6 06:03 kitti_trainval-fold-000-of-002-shard-00000-of-00010
-rw-r–r-- 1 root root 99766 Aug 6 06:03 kitti_trainval-fold-000-of-002-shard-00001-of-00010
-rw-r–r-- 1 root root 102397 Aug 6 06:03 kitti_trainval-fold-000-of-002-shard-00002-of-00010
-rw-r–r-- 1 root root 99796 Aug 6 06:03 kitti_trainval-fold-000-of-002-shard-00003-of-00010
-rw-r–r-- 1 root root 100265 Aug 6 06:03 kitti_trainval-fold-000-of-002-shard-00004-of-00010
-rw-r–r-- 1 root root 104348 Aug 6 06:03 kitti_trainval-fold-000-of-002-shard-00005-of-00010
-rw-r–r-- 1 root root 99652 Aug 6 06:03 kitti_trainval-fold-000-of-002-shard-00006-of-00010
-rw-r–r-- 1 root root 100436 Aug 6 06:03 kitti_trainval-fold-000-of-002-shard-00007-of-00010
-rw-r–r-- 1 root root 99283 Aug 6 06:03 kitti_trainval-fold-000-of-002-shard-00008-of-00010
-rw-r–r-- 1 root root 111277 Aug 6 06:03 kitti_trainval-fold-000-of-002-shard-00009-of-00010
-rw-r–r-- 1 root root 624431 Aug 6 06:03 kitti_trainval-fold-001-of-002-shard-00000-of-00010
-rw-r–r-- 1 root root 628308 Aug 6 06:04 kitti_trainval-fold-001-of-002-shard-00001-of-00010
-rw-r–r-- 1 root root 628199 Aug 6 06:04 kitti_trainval-fold-001-of-002-shard-00002-of-00010
-rw-r–r-- 1 root root 620144 Aug 6 06:04 kitti_trainval-fold-001-of-002-shard-00003-of-00010
-rw-r–r-- 1 root root 613206 Aug 6 06:04 kitti_trainval-fold-001-of-002-shard-00004-of-00010
-rw-r–r-- 1 root root 631974 Aug 6 06:04 kitti_trainval-fold-001-of-002-shard-00005-of-00010
-rw-r–r-- 1 root root 618500 Aug 6 06:04 kitti_trainval-fold-001-of-002-shard-00006-of-00010
-rw-r–r-- 1 root root 626023 Aug 6 06:05 kitti_trainval-fold-001-of-002-shard-00007-of-00010
-rw-r–r-- 1 root root 630665 Aug 6 06:05 kitti_trainval-fold-001-of-002-shard-00008-of-00010
-rw-r–r-- 1 root root 630464 Aug 6 06:05 kitti_trainval-fold-001-of-002-shard-00009-of-00010

C. Download pre-trained model

Download the correct pretrained model from the NGC model registry for your experiment. Please note that for DetectNet_v2, the input is expected to be 0-1 normalized with input channels in RGB order. Therefore, for optimum results please download model templates from nvidia/tlt_pretrained_detectnet_v2. The templates are now organizede as version strings. For example, to download a resnet18 model suitable for detectnet please resolve to the ngc object shown as nvidia/tlt_pretrained_detectnet_v2:resnet18.

All other models expect input preprocessing with mean subtraction and input channels in BGR order. Thus, using them as pretrained weights may result in suboptimal performance.

List models available in the model registry.

!ngc registry model list nvidia/tlt_pretrained_detectnet_v2:*

±------±------±------±------±------±------±------±------±------+
| Versi | Accur | Epoch | Batch | GPU | Memor | File | Statu | Creat |
| on | acy | s | Size | Model | y Foo | Size | s | ed |
| | | | | | tprin | | | Date |
| | | | | | t | | | |
±------±------±------±------±------±------±------±------±------+
| resne | 79.5 | 80 | 1 | V100 | 163.6 | 163.5 | UPLOA | Aug |
| t34 | | | | | | 5 MB | D_COM | 03, |
| | | | | | | | PLETE | 2020 |
| resne | 79.2 | 80 | 1 | V100 | 38.3 | 38.34 | UPLOA | Apr |
| t10 | | | | | | MB | D_COM | 29, |
| | | | | | | | PLETE | 2020 |
| resne | 79.0 | 80 | 1 | V100 | 89.0 | 89.02 | UPLOA | Apr |
| t18 | | | | | | MB | D_COM | 29, |
| | | | | | | | PLETE | 2020 |
| resne | 82.7 | 80 | 1 | V100 | 294.5 | 294.5 | UPLOA | Apr |
| t50 | | | | | | 3 MB | D_COM | 29, |
| | | | | | | | PLETE | 2020 |
| vgg16 | 82.2 | 80 | 1 | V100 | 113.2 | 113.2 | UPLOA | Apr |
| | | | | | | MB | D_COM | 29, |
| | | | | | | | PLETE | 2020 |
| vgg19 | 82.6 | 80 | 1 | V100 | 153.8 | 153.7 | UPLOA | Apr |
| | | | | | | 7 MB | D_COM | 29, |
| | | | | | | | PLETE | 2020 |
| mobil | 79.5 | 80 | 1 | V100 | 13.4 | 13.37 | UPLOA | Apr |
| enet_ | | | | | | MB | D_COM | 29, |
| v1 | | | | | | | PLETE | 2020 |
| mobil | 77.5 | 80 | 1 | V100 | 5.1 | 5.1 | UPLOA | Apr |
| enet_ | | | | | | MB | D_COM | 29, |
| v2 | | | | | | | PLETE | 2020 |
| googl | 82.2 | 80 | 1 | V100 | 47.7 | 47.74 | UPLOA | Apr |
| enet | | | | | | MB | D_COM | 29, |
| | | | | | | | PLETE | 2020 |
| squee | 65.67 | 80 | 1 | V100 | 6.5 | 6.46 | UPLOA | Apr |
| zenet | | | | | | MB | D_COM | 29, |
| | | | | | | | PLETE | 2020 |
| darkn | 76.44 | 80 | 1 | V100 | 467.3 | 467.3 | UPLOA | Apr |
| et53 | | | | | | 2 MB | D_COM | 29, |
| | | | | | | | PLETE | 2020 |
| darkn | 77.52 | 80 | 1 | V100 | 229.1 | 229.1 | UPLOA | Apr |
| et19 | | | | | | 5 MB | D_COM | 29, |
| | | | | | | | PLETE | 2020 |
±------±------±------±------±------±------±------±------±------+

Create the target destination to download the model.

!mkdir -p $USER_EXPERIMENT_DIR/pretrained_resnet18/

Download the pretrained model from NGC

!ngc registry model download-version nvidia/tlt_pretrained_detectnet_v2:resnet18 \

--dest $USER_EXPERIMENT_DIR/pretrained_resnet18

Downloaded 82.28 MB in 1m 38s, Download speed: 858.68 KB/s

Transfer id: tlt_pretrained_detectnet_v2_vresnet18 Download status: Completed.
Downloaded local path: /workspace/tlt-experiments/detectnet_v2/pretrained_resnet18/tlt_pretrained_detectnet_v2_vresnet18
Total files downloaded: 1
Total downloaded size: 82.28 MB
Started at: 2020-08-06 06:27:12.182955
Completed at: 2020-08-06 06:28:50.305902
Duration taken: 1m 38s

!ls -rlt $USER_EXPERIMENT_DIR/pretrained_resnet18/tlt_pretrained_detectnet_v2_vresnet18

total 91160
-rw------- 1 root root 93345248 Aug 6 06:28 resnet18.hdf5

  1. Provide training specification

    Tfrecords for the train datasets
    In order to use the newly generated tfrecords, update the dataset_config parameter in the spec file at $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt
    Update the fold number to use for evaluation. In case of random data split, please use fold 0 only
    For sequence-wise split, you may use any fold generated from the dataset convert tool
    Pre-trained models
    Augmentation parameters for on the fly data augmentation
    Other training (hyper-)parameters such as batch size, number of epochs, learning rate etc.

!cat $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt

random_seed: 42
dataset_config {
data_sources {
tfrecords_path: “/workspace/tlt-experiments/data/tfrecords/kitti_trainval/*”
image_directory_path: “/workspace/tlt-experiments/data/training”
}
image_extension: “png”
target_class_mapping {
key: “car”
value: “car”
}
target_class_mapping {
key: “cyclist”
value: “cyclist”
}
target_class_mapping {
key: “pedestrian”
value: “pedestrian”
}
target_class_mapping {
key: “person_sitting”
value: “pedestrian”
}
target_class_mapping {
key: “van”
value: “car”
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 1248
output_image_height: 384
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
postprocessing_config {
target_class_config {
key: “car”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “cyclist”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.15000000596
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “pedestrian”
value {
clustering_config {
coverage_threshold: 0.00749999983236
dbscan_eps: 0.230000004172
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
}
model_config {
pretrained_model_file: “/workspace/tlt-experiments/detectnet_v2/pretrained_resnet18/tlt_pretrained_detectnet_v2_vresnet18/resnet18.hdf5”
num_layers: 18
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
training_precision {
backend_floatx: FLOAT32
}
arch: “resnet”
}
evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 30
minimum_detection_ground_truth_overlap {
key: “car”
value: 0.699999988079
}
minimum_detection_ground_truth_overlap {
key: “cyclist”
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: “pedestrian”
value: 0.5
}
evaluation_box_config {
key: “car”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “cyclist”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “pedestrian”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
average_precision_mode: INTEGRATE
}
cost_function_config {
target_classes {
name: “car”
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “cyclist”
class_weight: 8.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 1.0
}
}
target_classes {
name: “pedestrian”
class_weight: 4.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
}
training_config {
batch_size_per_gpu: 4
num_epochs: 120
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-06
max_learning_rate: 5e-04
soft_start: 0.10000000149
annealing: 0.699999988079
}
}
regularizer {
type: L1
weight: 3.00000002618e-09
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 10
}
bbox_rasterizer_config {
target_class_config {
key: “car”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.40000000596
cov_radius_y: 0.40000000596
bbox_min_radius: 1.0
}
}
target_class_config {
key: “cyclist”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
target_class_config {
key: “pedestrian”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.400000154972
}

  1. Run TLT training

    Provide the sample spec file and the output directory location for models

Note: The training may take hours to complete. Also, the remaining notebook, assumes that the training was done in single-GPU mode. When run in multi-GPU mode, please expect to update the pruning and inference steps with new pruning thresholds and updated parameters in the clusterfile.json accordingly for optimum performance.

Detectnet_v2 now supports restart from checkpoint. Incase, the training job is killed prematurely, you may resume training from the closest checkpoint by simply re-running the same command line. Please do make sure to use the same number of GPUs when restarting the training.

When running the training with NUM_GPUs>1, you may need to modify the batc_size_per_gpu and learning_rate to get similar mAP as a 1GPU training run. In most cases, scaling down the batch-size by a factor of NUM_GPU’s or scaling up the learning rate by a factor of NUM_GPU’s would be a good place to start.

!tlt-train detectnet_v2 -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt \

                    -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \

                    -k $KEY \

                    -n resnet18_detector \

                    --gpus $NUM_GPUS -v

Using TensorFlow backend.
2020-08-06 06:33:23.623207: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-06 06:33:26,629 [DEBUG] iva.detectnet_v2.scripts.train: Starting experiment.
2020-08-06 06:33:26.717976: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-06 06:33:26.748348: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:26.748898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
2020-08-06 06:33:26.748927: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-06 06:33:26.748979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-06 06:33:26.749861: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-06 06:33:26.750141: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-06 06:33:26.751464: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-06 06:33:26.752503: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-06 06:33:26.752603: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-06 06:33:26.752750: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:26.753385: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:26.753827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-08-06 06:33:26.753860: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-06 06:33:27.413087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-06 06:33:27.413134: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-08-06 06:33:27.413149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-08-06 06:33:27.413364: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:27.413896: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:27.414397: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:27.414885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9717 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-08-06 06:33:27,415 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/examples/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt.
2020-08-06 06:33:27,417 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/examples/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt
2020-08-06 06:33:27,424 [DEBUG] iva.detectnet_v2.scripts.train: Training gridbox model.
2020-08-06 06:33:27,425 [DEBUG] iva.detectnet_v2.visualization.visualizer: Building visualizer.
2020-08-06 06:33:28,225 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 6434 samples with a batch size of 4; each epoch will therefore take one extra step.
2020-08-06 06:33:28,340 [DEBUG] iva.detectnet_v2.scripts.train: Building DetectNet V2 model
2020-08-06 06:33:31,076 [DEBUG] iva.detectnet_v2.model.detectnet_model: Loading weights from pretrained model file. /workspace/tlt-experiments/detectnet_v2/pretrained_resnet18/tlt_pretrained_detectnet_v2_vresnet18/resnet18.hdf5


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 3, 384, 1248) 0


conv1 (Conv2D) (None, 64, 192, 624) 9472 input_1[0][0]


bn_conv1 (BatchNormalization) (None, 64, 192, 624) 256 conv1[0][0]


activation_1 (Activation) (None, 64, 192, 624) 0 bn_conv1[0][0]


block_1a_conv_1 (Conv2D) (None, 64, 96, 312) 36928 activation_1[0][0]


block_1a_bn_1 (BatchNormalizati (None, 64, 96, 312) 256 block_1a_conv_1[0][0]


block_1a_relu_1 (Activation) (None, 64, 96, 312) 0 block_1a_bn_1[0][0]


block_1a_conv_2 (Conv2D) (None, 64, 96, 312) 36928 block_1a_relu_1[0][0]


block_1a_conv_shortcut (Conv2D) (None, 64, 96, 312) 4160 activation_1[0][0]


block_1a_bn_2 (BatchNormalizati (None, 64, 96, 312) 256 block_1a_conv_2[0][0]


block_1a_bn_shortcut (BatchNorm (None, 64, 96, 312) 256 block_1a_conv_shortcut[0][0]


add_1 (Add) (None, 64, 96, 312) 0 block_1a_bn_2[0][0]
block_1a_bn_shortcut[0][0]


block_1a_relu (Activation) (None, 64, 96, 312) 0 add_1[0][0]


block_1b_conv_1 (Conv2D) (None, 64, 96, 312) 36928 block_1a_relu[0][0]


block_1b_bn_1 (BatchNormalizati (None, 64, 96, 312) 256 block_1b_conv_1[0][0]


block_1b_relu_1 (Activation) (None, 64, 96, 312) 0 block_1b_bn_1[0][0]


block_1b_conv_2 (Conv2D) (None, 64, 96, 312) 36928 block_1b_relu_1[0][0]


block_1b_bn_2 (BatchNormalizati (None, 64, 96, 312) 256 block_1b_conv_2[0][0]


add_2 (Add) (None, 64, 96, 312) 0 block_1b_bn_2[0][0]
block_1a_relu[0][0]


block_1b_relu (Activation) (None, 64, 96, 312) 0 add_2[0][0]


block_2a_conv_1 (Conv2D) (None, 128, 48, 156) 73856 block_1b_relu[0][0]


block_2a_bn_1 (BatchNormalizati (None, 128, 48, 156) 512 block_2a_conv_1[0][0]


block_2a_relu_1 (Activation) (None, 128, 48, 156) 0 block_2a_bn_1[0][0]


block_2a_conv_2 (Conv2D) (None, 128, 48, 156) 147584 block_2a_relu_1[0][0]


block_2a_conv_shortcut (Conv2D) (None, 128, 48, 156) 8320 block_1b_relu[0][0]


block_2a_bn_2 (BatchNormalizati (None, 128, 48, 156) 512 block_2a_conv_2[0][0]


block_2a_bn_shortcut (BatchNorm (None, 128, 48, 156) 512 block_2a_conv_shortcut[0][0]


add_3 (Add) (None, 128, 48, 156) 0 block_2a_bn_2[0][0]
block_2a_bn_shortcut[0][0]


block_2a_relu (Activation) (None, 128, 48, 156) 0 add_3[0][0]


block_2b_conv_1 (Conv2D) (None, 128, 48, 156) 147584 block_2a_relu[0][0]


block_2b_bn_1 (BatchNormalizati (None, 128, 48, 156) 512 block_2b_conv_1[0][0]


block_2b_relu_1 (Activation) (None, 128, 48, 156) 0 block_2b_bn_1[0][0]


block_2b_conv_2 (Conv2D) (None, 128, 48, 156) 147584 block_2b_relu_1[0][0]


block_2b_bn_2 (BatchNormalizati (None, 128, 48, 156) 512 block_2b_conv_2[0][0]


add_4 (Add) (None, 128, 48, 156) 0 block_2b_bn_2[0][0]
block_2a_relu[0][0]


block_2b_relu (Activation) (None, 128, 48, 156) 0 add_4[0][0]


block_3a_conv_1 (Conv2D) (None, 256, 24, 78) 295168 block_2b_relu[0][0]


block_3a_bn_1 (BatchNormalizati (None, 256, 24, 78) 1024 block_3a_conv_1[0][0]


block_3a_relu_1 (Activation) (None, 256, 24, 78) 0 block_3a_bn_1[0][0]


block_3a_conv_2 (Conv2D) (None, 256, 24, 78) 590080 block_3a_relu_1[0][0]


block_3a_conv_shortcut (Conv2D) (None, 256, 24, 78) 33024 block_2b_relu[0][0]


block_3a_bn_2 (BatchNormalizati (None, 256, 24, 78) 1024 block_3a_conv_2[0][0]


block_3a_bn_shortcut (BatchNorm (None, 256, 24, 78) 1024 block_3a_conv_shortcut[0][0]


add_5 (Add) (None, 256, 24, 78) 0 block_3a_bn_2[0][0]
block_3a_bn_shortcut[0][0]


block_3a_relu (Activation) (None, 256, 24, 78) 0 add_5[0][0]


block_3b_conv_1 (Conv2D) (None, 256, 24, 78) 590080 block_3a_relu[0][0]


block_3b_bn_1 (BatchNormalizati (None, 256, 24, 78) 1024 block_3b_conv_1[0][0]


block_3b_relu_1 (Activation) (None, 256, 24, 78) 0 block_3b_bn_1[0][0]


block_3b_conv_2 (Conv2D) (None, 256, 24, 78) 590080 block_3b_relu_1[0][0]


block_3b_bn_2 (BatchNormalizati (None, 256, 24, 78) 1024 block_3b_conv_2[0][0]


add_6 (Add) (None, 256, 24, 78) 0 block_3b_bn_2[0][0]
block_3a_relu[0][0]


block_3b_relu (Activation) (None, 256, 24, 78) 0 add_6[0][0]


block_4a_conv_1 (Conv2D) (None, 512, 24, 78) 1180160 block_3b_relu[0][0]


block_4a_bn_1 (BatchNormalizati (None, 512, 24, 78) 2048 block_4a_conv_1[0][0]


block_4a_relu_1 (Activation) (None, 512, 24, 78) 0 block_4a_bn_1[0][0]


block_4a_conv_2 (Conv2D) (None, 512, 24, 78) 2359808 block_4a_relu_1[0][0]


block_4a_conv_shortcut (Conv2D) (None, 512, 24, 78) 131584 block_3b_relu[0][0]


block_4a_bn_2 (BatchNormalizati (None, 512, 24, 78) 2048 block_4a_conv_2[0][0]


block_4a_bn_shortcut (BatchNorm (None, 512, 24, 78) 2048 block_4a_conv_shortcut[0][0]


add_7 (Add) (None, 512, 24, 78) 0 block_4a_bn_2[0][0]
block_4a_bn_shortcut[0][0]


block_4a_relu (Activation) (None, 512, 24, 78) 0 add_7[0][0]


block_4b_conv_1 (Conv2D) (None, 512, 24, 78) 2359808 block_4a_relu[0][0]


block_4b_bn_1 (BatchNormalizati (None, 512, 24, 78) 2048 block_4b_conv_1[0][0]


block_4b_relu_1 (Activation) (None, 512, 24, 78) 0 block_4b_bn_1[0][0]


block_4b_conv_2 (Conv2D) (None, 512, 24, 78) 2359808 block_4b_relu_1[0][0]


block_4b_bn_2 (BatchNormalizati (None, 512, 24, 78) 2048 block_4b_conv_2[0][0]


add_8 (Add) (None, 512, 24, 78) 0 block_4b_bn_2[0][0]
block_4a_relu[0][0]


block_4b_relu (Activation) (None, 512, 24, 78) 0 add_8[0][0]


output_bbox (Conv2D) (None, 12, 24, 78) 6156 block_4b_relu[0][0]


output_cov (Conv2D) (None, 3, 24, 78) 1539 block_4b_relu[0][0]

Total params: 11,203,023
Trainable params: 11,193,295
Non-trainable params: 9,728


2020-08-06 06:33:38,530 [DEBUG] iva.detectnet_v2.scripts.train: DetectNet V2 model built.
2020-08-06 06:33:38,531 [DEBUG] iva.detectnet_v2.scripts.train: Building rasterizer.
2020-08-06 06:33:38,532 [DEBUG] iva.detectnet_v2.scripts.train: Rasterizers built.
2020-08-06 06:33:38,532 [DEBUG] iva.detectnet_v2.scripts.train: Building training graph.
2020-08-06 06:33:38,534 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-08-06 06:33:38,534 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-08-06 06:33:38,534 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-08-06 06:33:38,534 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 8, io threads: 16, compute threads: 8, buffered batches: 4
2020-08-06 06:33:38,534 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 6434, number of sources: 1, batch size per gpu: 4, steps: 1609
2020-08-06 06:33:38,665 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-08-06 06:33:38.701156: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:38.701671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
2020-08-06 06:33:38.701709: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-06 06:33:38.701755: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-06 06:33:38.701798: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-06 06:33:38.701832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-06 06:33:38.701865: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-06 06:33:38.701898: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-06 06:33:38.701928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-06 06:33:38.702003: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:38.702493: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:38.702931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-08-06 06:33:38,961 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2020-08-06 06:33:38,969 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-08-06 06:33:38,969 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-08-06 06:33:39,552 [INFO] iva.detectnet_v2.scripts.train: Found 6434 samples in training set
2020-08-06 06:33:39,552 [DEBUG] iva.detectnet_v2.scripts.train: Rasterizing tensors.
2020-08-06 06:33:39,890 [DEBUG] iva.detectnet_v2.scripts.train: Tensors rasterized.
2020-08-06 06:33:42,452 [DEBUG] iva.detectnet_v2.scripts.train: Training graph built.
2020-08-06 06:33:42,452 [DEBUG] iva.detectnet_v2.scripts.train: Building validation graph.
2020-08-06 06:33:42,453 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-08-06 06:33:42,453 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-08-06 06:33:42,453 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-08-06 06:33:42,453 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 8, io threads: 16, compute threads: 8, buffered batches: 4
2020-08-06 06:33:42,453 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 1047, number of sources: 1, batch size per gpu: 4, steps: 262
2020-08-06 06:33:42,487 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-08-06 06:33:42,783 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2020-08-06 06:33:42,790 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-08-06 06:33:42,790 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000

2020-08-06 06:33:43,186 [INFO] iva.detectnet_v2.scripts.train: Found 1047 samples in validation set
2020-08-06 06:33:43,186 [DEBUG] iva.detectnet_v2.scripts.train: Rasterizing tensors.
2020-08-06 06:33:43,523 [DEBUG] iva.detectnet_v2.scripts.train: Tensors rasterized.
2020-08-06 06:33:44,020 [DEBUG] iva.detectnet_v2.scripts.train: Validation graph built.
2020-08-06 06:33:45,505 [DEBUG] iva.detectnet_v2.scripts.train: Running training loop.
2020-08-06 06:33:45,507 [DEBUG] iva.detectnet_v2.scripts.train: Checkpoint interval: 10
2020-08-06 06:33:46.787521: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:46.788048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
2020-08-06 06:33:46.788086: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-06 06:33:46.788158: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-06 06:33:46.788225: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-06 06:33:46.788269: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-06 06:33:46.788322: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-06 06:33:46.788375: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-06 06:33:46.788406: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-06 06:33:46.788478: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:46.788994: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:46.789454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-08-06 06:33:46.790657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-06 06:33:46.790676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-08-06 06:33:46.790687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-08-06 06:33:46.790787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:46.791296: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-06 06:33:46.791752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9717 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-08-06 06:34:20.220259: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-06 06:34:22.814302: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x6d300f0
2020-08-06 06:34:22.814516: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-06 06:34:25.814576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
/usr/local/bin/tlt-train: line 32: 554 Illegal instruction (core dumped) tlt-train-g1 ${PYTHON_ARGS[*]}


Any suggestion?

It may be related to missing some instruction of your cpu.
Reference:

OMG! Could it works if I install a tensorflow compiled without AVX2? Or maybe you will provide a compatible NGC version later?

It is not related to ngc.
More reference: https://stackoverflow.com/questions/47068709/your-cpu-supports-instructions-that-this-tensorflow-binary-was-not-compiled-to-u

Hi Morganh, thanks for your reply. May I build tensorflow from source without AVX2 flag, and works on TLT container?

Not sure if it can work.

OK… I may try later. Thanks.