tlt-train error when deploy mobilenet_v2 by using DetectNet

m.billson16 · November 22, 2019, 10:07am

Hello, I’m trying Detectnet with mobilenet_V2 pretrained model, and I got this error when I run tlt-train-detectnet2.

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 10, in
sys.exit(main())
File “./common/magnet_train.py”, line 37, in main
File “</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-2>”, line 2, in main
File “./detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “./detectnet_v2/scripts/train.py”, line 632, in main
File “./detectnet_v2/scripts/train.py”, line 556, in run_experiment
File “./detectnet_v2/scripts/train.py”, line 479, in train_gridbox
File “./detectnet_v2/scripts/train.py”, line 353, in build_validation_graph
File “./detectnet_v2/dataloader/default_dataloader.py”, line 198, in get_dataset_tensors
File “./detectnet_v2/dataloader/utilities.py”, line 181, in extract_tfrecords_features
StopIteration

Do you have any idea?

This is detectnet_v2_train_mobilenet_v2_kitti.txt

random_seed: 42
dataset_config {
data_sources {
tfrecords_path: “/workspace/tlt-experiments/tfrecords/kitti_trainval/*”
image_directory_path: “/workspace/tlt-experiments/data/training”
}
image_extension: “jpg”
target_class_mapping {
key: “Bola”
value: “Bola”
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 1280
output_image_height: 720
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
postprocessing_config {
target_class_config {
key: “Bola”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
}
model_config {
pretrained_model_file: “/workspace/tlt-experiments/pretrained_mobilenet_v2/tlt_mobilenet_v2_detectnet_v2_v1/mobilenet_v2.hdf5”
num_layers: 18
use_batch_norm: true
activation {
activation_type: “relu”
}
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
training_precision {
backend_floatx: FLOAT32
}
arch: “mobilenet_v2”
}
evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 1
minimum_detection_ground_truth_overlap {
key: “Bola”
value: 0.5
}
evaluation_box_config {
key: “Bola”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
average_precision_mode: INTEGRATE
}
cost_function_config {
target_classes {
name: “Bola”
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
}
training_config {
batch_size_per_gpu: 4
num_epochs: 10
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-06
max_learning_rate: 5e-04
soft_start: 0.10000000149
annealing: 0.699999988079
}
}
regularizer {
type: L1
weight: 3.00000002618e-09
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 10
}
bbox_rasterizer_config {
target_class_config {
key: “Bola”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.400000154972
}

and here is my program tlt-train-detectnet_v2

!tlt-train detectnet_v2 -e $SPECS_DIR/detectnet_v2_train_mobilenet_v2_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
-n mobilenet_v2_detector
–gpus 1

Morganh · November 22, 2019, 3:14pm

Hi m.billson16 ,
Please double check your tfrecord files. Did you generate them via “tlt-dataset-convert”?

m.billson16 · November 22, 2019, 3:30pm

Hello Morganh, thanks for the reply.
About the tfrecord files, sure. I generated them via tlt-dataset-convert

here is the result from tlt-dataset-convert

Converting Tfrecords for kitti trainval dataset
Using TensorFlow backend.
2019-11-22 15:27:01,588 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2019-11-22 15:27:01,588 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 38 Val: 9
2019-11-22 15:27:01,588 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2019-11-22 15:27:01,588 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
2019-11-22 15:27:01,588 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2019-11-22 15:27:01,589 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2019-11-22 15:27:01,589 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2019-11-22 15:27:01,589 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2019-11-22 15:27:01,589 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2019-11-22 15:27:01,589 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2019-11-22 15:27:01,589 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2019-11-22 15:27:01,590 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2019-11-22 15:27:01,590 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:266: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2019-11-22 15:27:01,650 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
bola: 9

2019-11-22 15:27:01,650 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2019-11-22 15:27:01,664 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2019-11-22 15:27:01,677 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2019-11-22 15:27:01,693 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2019-11-22 15:27:01,708 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2019-11-22 15:27:01,724 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2019-11-22 15:27:01,738 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2019-11-22 15:27:01,751 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2019-11-22 15:27:01,765 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2019-11-22 15:27:01,782 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2019-11-22 15:27:01,829 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
bola: 38

2019-11-22 15:27:01,829 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2019-11-22 15:27:01,829 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
bola: 47

2019-11-22 15:27:01,830 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
Bola: bola
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2019-11-22 15:27:01,830 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.

and here is my kitti config:

kitti_config {
root_directory_path: “/workspace/tlt-experiments/data/training”
image_dir_name: “Bola”
label_dir_name: “Label_Bola”
image_extension: “.jpg”
partition_mode: “random”
num_partitions: 2
val_split: 20
num_shards: 10
}
image_directory_path: “/workspace/tlt-experiments/data/training”

Morganh · November 22, 2019, 3:47pm

Please go through https://devtalk.nvidia.com/default/topic/1066112/transfer-learning-toolkit/unable-to-train-ssd-resnet-18-/post/5398854/#5398854 to get more hints.

Especially to check the KITTI label txt files. It should have 15 feilds.
Also, for detectnet_v2, the tlt-train tool does not support training on images of multiple resolutions, or resizing images during training. All of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.

m.billson16 · November 24, 2019, 11:08am

Hello Morganh, Thank you for the advice
I have checked my tfrecords and all the images have the same resolution which is 1280 X 720. and also my KITTI Label have 15 fields.
from https://devtalk.nvidia.com/default/topic/1066112/transfer-learning-toolkit/unable-to-train-ssd-resnet-18-/post/5398854/#5398854 it still couldn’t solve my error.
Do you have any idea?

m.billson16 · November 25, 2019, 8:23am

Hello Morganh, Are there any accurate MobileNet V2 spesification files for DetectNet V2 ?
and also Is it a problem for me to have input dataset with resolution about 1280 X 720?

Morganh · November 25, 2019, 8:29am

Hi m.billson16
Could you help do one experiment to narrow down?
Please change your pretrained model to resnet18.hdf5 and modify corresponding setting in spec.

Morganh · November 25, 2019, 9:08am

More, could you paste one of your KITTI Label text file here?
Also, please make sure each line in your label file contains 15 fields.

m.billson16 · November 25, 2019, 9:35am

Hello Morganh, I have tried to narrow down the resolution from 1280 X 720 to 800 X 600. When I tried to convert dataset to tfrecords, I got this error.

Traceback (most recent call last):
File “/usr/local/bin/tlt-dataset-convert”, line 10, in
sys.exit(main())
File “./detectnet_v2/scripts/dataset_convert.py”, line 64, in main
File “./detectnet_v2/dataio/dataset_converter_lib.py”, line 74, in convert
File “./detectnet_v2/dataio/dataset_converter_lib.py”, line 108, in _write_partitions
File “./detectnet_v2/dataio/dataset_converter_lib.py”, line 149, in _write_shard
File “./detectnet_v2/dataio/kitti_converter_lib.py”, line 169, in _create_example_proto
File “./detectnet_v2/dataio/kitti_converter_lib.py”, line 272, in _add_targets
TypeError: object of type ‘int’ has no len()

Do you have any idea?

This is the result when I tried to convert TFrecords for kitti trainval dataset

Converting Tfrecords for kitti trainval dataset
Using TensorFlow backend.
2019-11-25 09:32:34,874 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2019-11-25 09:32:34,874 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 9	Val: 1
2019-11-25 09:32:34,874 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2019-11-25 09:32:34,874 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
2019-11-25 09:32:34,875 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2019-11-25 09:32:34,875 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2019-11-25 09:32:34,875 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2019-11-25 09:32:34,875 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2019-11-25 09:32:34,875 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2019-11-25 09:32:34,875 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2019-11-25 09:32:34,875 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2019-11-25 09:32:34,876 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2019-11-25 09:32:34,876 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:266: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2019-11-25 09:32:34,883 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - 
Wrote the following numbers of objects:
bola: 1

2019-11-25 09:32:34,883 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2019-11-25 09:32:34,883 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2019-11-25 09:32:34,884 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2019-11-25 09:32:34,884 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2019-11-25 09:32:34,884 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2019-11-25 09:32:34,884 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2019-11-25 09:32:34,884 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2019-11-25 09:32:34,884 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2019-11-25 09:32:34,884 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2019-11-25 09:32:34,884 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
Traceback (most recent call last):
  File "/usr/local/bin/tlt-dataset-convert", line 10, in <module>
    sys.exit(main())
  File "./detectnet_v2/scripts/dataset_convert.py", line 64, in main
  File "./detectnet_v2/dataio/dataset_converter_lib.py", line 74, in convert
  File "./detectnet_v2/dataio/dataset_converter_lib.py", line 108, in _write_partitions
  File "./detectnet_v2/dataio/dataset_converter_lib.py", line 149, in _write_shard
  File "./detectnet_v2/dataio/kitti_converter_lib.py", line 169, in _create_example_proto
  File "./detectnet_v2/dataio/kitti_converter_lib.py", line 272, in _add_targets
TypeError: object of type 'int' has no len()

m.billson16 · November 25, 2019, 9:39am

And also this is one of my KITTI Label

Bola 0.0 0 0.0 446 261 549 367 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Morganh · November 25, 2019, 10:05am

Hi m.billson16,
Since there are only 10 images when you run “tlt-dataset-convert”, could you split them to find which image is the culprit?

Morganh · November 25, 2019, 10:16am

Also, make sure that for detectnet_v2, the W and H should be multiples of 16 according to tlt doc.
So, you should have another resolution instead of 800 X 600 because 600 is not multiples of 16.

Input size: C * W * H (where C = 1 or 3, W > =480, H >=272 and W,H are multiples of 16)

m.billson16 · November 25, 2019, 10:50am

Hello Morganh, Thank you for the assists. Actually I have tried an experiment. I changed the resolution from 800 X 600 to 1024 X 576. I can successfully convert TFrecords for kitti trainval dataset. But I got the same error when I run TLT training.

Here is the error:

Using TensorFlow backend.
2019-11-25 10:47:32.266149: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-25 10:47:32.319302: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-25 10:47:32.319755: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5e89d70 executing computations on platform CUDA. Devices:
2019-11-25 10:47:32.319787: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 950M, Compute Capability 5.0
2019-11-25 10:47:32.322018: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2593785000 Hz
2019-11-25 10:47:32.322477: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5fa1f40 executing computations on platform Host. Devices:
2019-11-25 10:47:32.322503: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-11-25 10:47:32.322652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
pciBusID: 0000:0a:00.0
totalMemory: 3.95GiB freeMemory: 3.65GiB
2019-11-25 10:47:32.322675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-11-25 10:47:32.323227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-25 10:47:32.323242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-11-25 10:47:32.323260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-11-25 10:47:32.323324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3440 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950M, pci bus id: 0000:0a:00.0, compute capability: 5.0)
2019-11-25 10:47:32,324 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/tlt-experiments/detectnet_v2_train_resnet18_kitti.txt.
2019-11-25 10:47:32,325 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/tlt-experiments/detectnet_v2_train_resnet18_kitti.txt
WARNING:tensorflow:From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
2019-11-25 10:47:32,337 [WARNING] tensorflow: From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
2019-11-25 10:47:32,378 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 19 samples with a batch size of 4; each epoch will therefore take one extra step.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-11-25 10:47:32,384 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py:91: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
2019-11-25 10:47:32,398 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py:91: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, 576, 1024) 0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, 288, 512) 9472        input_1[0][0]                    
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 64, 288, 512) 256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 64, 288, 512) 0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (None, 64, 144, 256) 36928       activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (None, 64, 144, 256) 256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 64, 144, 256) 0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (None, 64, 144, 256) 36928       activation_2[0][0]               
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 64, 144, 256) 4160        activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (None, 64, 144, 256) 256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (None, 64, 144, 256) 256         block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_1 (Add)                     (None, 64, 144, 256) 0           block_1a_bn_2[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 64, 144, 256) 0           add_1[0][0]                      
__________________________________________________________________________________________________
block_1b_conv_1 (Conv2D)        (None, 64, 144, 256) 36928       activation_3[0][0]               
__________________________________________________________________________________________________
block_1b_bn_1 (BatchNormalizati (None, 64, 144, 256) 256         block_1b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_4 (Activation)       (None, 64, 144, 256) 0           block_1b_bn_1[0][0]              
__________________________________________________________________________________________________
block_1b_conv_2 (Conv2D)        (None, 64, 144, 256) 36928       activation_4[0][0]               
__________________________________________________________________________________________________
block_1b_bn_2 (BatchNormalizati (None, 64, 144, 256) 256         block_1b_conv_2[0][0]            
__________________________________________________________________________________________________
add_2 (Add)                     (None, 64, 144, 256) 0           block_1b_bn_2[0][0]              
                                                                 activation_3[0][0]               
__________________________________________________________________________________________________
activation_5 (Activation)       (None, 64, 144, 256) 0           add_2[0][0]                      
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (None, 128, 72, 128) 73856       activation_5[0][0]               
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (None, 128, 72, 128) 512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_6 (Activation)       (None, 128, 72, 128) 0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (None, 128, 72, 128) 147584      activation_6[0][0]               
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 128, 72, 128) 8320        activation_5[0][0]               
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (None, 128, 72, 128) 512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (None, 128, 72, 128) 512         block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_3 (Add)                     (None, 128, 72, 128) 0           block_2a_bn_2[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_7 (Activation)       (None, 128, 72, 128) 0           add_3[0][0]                      
__________________________________________________________________________________________________
block_2b_conv_1 (Conv2D)        (None, 128, 72, 128) 147584      activation_7[0][0]               
__________________________________________________________________________________________________
block_2b_bn_1 (BatchNormalizati (None, 128, 72, 128) 512         block_2b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_8 (Activation)       (None, 128, 72, 128) 0           block_2b_bn_1[0][0]              
__________________________________________________________________________________________________
block_2b_conv_2 (Conv2D)        (None, 128, 72, 128) 147584      activation_8[0][0]               
__________________________________________________________________________________________________
block_2b_bn_2 (BatchNormalizati (None, 128, 72, 128) 512         block_2b_conv_2[0][0]            
__________________________________________________________________________________________________
add_4 (Add)                     (None, 128, 72, 128) 0           block_2b_bn_2[0][0]              
                                                                 activation_7[0][0]               
__________________________________________________________________________________________________
activation_9 (Activation)       (None, 128, 72, 128) 0           add_4[0][0]                      
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (None, 256, 36, 64)  295168      activation_9[0][0]               
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (None, 256, 36, 64)  1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_10 (Activation)      (None, 256, 36, 64)  0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, 36, 64)  590080      activation_10[0][0]              
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 256, 36, 64)  33024       activation_9[0][0]               
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, 36, 64)  1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 256, 36, 64)  1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_5 (Add)                     (None, 256, 36, 64)  0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_11 (Activation)      (None, 256, 36, 64)  0           add_5[0][0]                      
__________________________________________________________________________________________________
block_3b_conv_1 (Conv2D)        (None, 256, 36, 64)  590080      activation_11[0][0]              
__________________________________________________________________________________________________
block_3b_bn_1 (BatchNormalizati (None, 256, 36, 64)  1024        block_3b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_12 (Activation)      (None, 256, 36, 64)  0           block_3b_bn_1[0][0]              
__________________________________________________________________________________________________
block_3b_conv_2 (Conv2D)        (None, 256, 36, 64)  590080      activation_12[0][0]              
__________________________________________________________________________________________________
block_3b_bn_2 (BatchNormalizati (None, 256, 36, 64)  1024        block_3b_conv_2[0][0]            
__________________________________________________________________________________________________
add_6 (Add)                     (None, 256, 36, 64)  0           block_3b_bn_2[0][0]              
                                                                 activation_11[0][0]              
__________________________________________________________________________________________________
activation_13 (Activation)      (None, 256, 36, 64)  0           add_6[0][0]                      
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (None, 512, 36, 64)  1180160     activation_13[0][0]              
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (None, 512, 36, 64)  2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_14 (Activation)      (None, 512, 36, 64)  0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (None, 512, 36, 64)  2359808     activation_14[0][0]              
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 512, 36, 64)  131584      activation_13[0][0]              
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (None, 512, 36, 64)  2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (None, 512, 36, 64)  2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_7 (Add)                     (None, 512, 36, 64)  0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_15 (Activation)      (None, 512, 36, 64)  0           add_7[0][0]                      
__________________________________________________________________________________________________
block_4b_conv_1 (Conv2D)        (None, 512, 36, 64)  2359808     activation_15[0][0]              
__________________________________________________________________________________________________
block_4b_bn_1 (BatchNormalizati (None, 512, 36, 64)  2048        block_4b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_16 (Activation)      (None, 512, 36, 64)  0           block_4b_bn_1[0][0]              
__________________________________________________________________________________________________
block_4b_conv_2 (Conv2D)        (None, 512, 36, 64)  2359808     activation_16[0][0]              
__________________________________________________________________________________________________
block_4b_bn_2 (BatchNormalizati (None, 512, 36, 64)  2048        block_4b_conv_2[0][0]            
__________________________________________________________________________________________________
add_8 (Add)                     (None, 512, 36, 64)  0           block_4b_bn_2[0][0]              
                                                                 activation_15[0][0]              
__________________________________________________________________________________________________
activation_17 (Activation)      (None, 512, 36, 64)  0           add_8[0][0]                      
__________________________________________________________________________________________________
output_bbox (Conv2D)            (None, 4, 36, 64)    2052        activation_17[0][0]              
__________________________________________________________________________________________________
output_cov (Conv2D)             (None, 1, 36, 64)    513         activation_17[0][0]              
==================================================================================================
Total params: 11,197,893
Trainable params: 11,188,165
Non-trainable params: 9,728
__________________________________________________________________________________________________

target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2019-11-25 10:47:46,752 [INFO] iva.detectnet_v2.scripts.train: Found 19 samples in training set
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 37, in main
  File "</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-2>", line 2, in main
  File "./detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "./detectnet_v2/scripts/train.py", line 632, in main
  File "./detectnet_v2/scripts/train.py", line 556, in run_experiment
  File "./detectnet_v2/scripts/train.py", line 479, in train_gridbox
  File "./detectnet_v2/scripts/train.py", line 353, in build_validation_graph
  File "./detectnet_v2/dataloader/default_dataloader.py", line 198, in get_dataset_tensors
  File "./detectnet_v2/dataloader/utilities.py", line 181, in extract_tfrecords_features
StopIteration

Here is detectnet_v2_train_resnet18_kitti.txt

random_seed: 42
dataset_config {
  data_sources {
    tfrecords_path: "/workspace/tlt-experiments/tfrecords/kitti_trainval/*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "jpg"
  target_class_mapping {
    key: "Bola"
    value: "Bola"
  }
  validation_fold: 0
}
augmentation_config {
  preprocessing {
    output_image_width: 1024
    output_image_height: 576
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    output_image_channel: 3 
  }
  spatial_augmentation {
    hflip_probability: 0.5
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
postprocessing_config {
  target_class_config {
    key: "Bola"
    value {
      clustering_config {
        coverage_threshold: 0.00499999988824
        dbscan_eps: 0.20000000298
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 20
      }
    }
  }
}
model_config {
  pretrained_model_file: "/workspace/tlt-experiments/pretrained_resnet18/tlt_resnet18_detectnet_v2_v1/resnet18.hdf5"
  num_layers: 18
  use_batch_norm: true
  activation {
    activation_type: "relu"
  }
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
}
evaluation_config {
  validation_period_during_training: 10
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "Bola"
    value: 0.699999988079
  }
  evaluation_box_config {
    key: "Bola"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  average_precision_mode: INTEGRATE
}
cost_function_config {
  target_classes {
    name: "Bola"
    class_weight: 1.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}
training_config {
  batch_size_per_gpu: 4
  num_epochs: 120
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-06
      max_learning_rate: 5e-04
      soft_start: 0.10000000149
      annealing: 0.699999988079
    }
  }
  regularizer {
    type: L1
    weight: 3.00000002618e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
  cost_scaling {
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 10
}
bbox_rasterizer_config {
  target_class_config {
    key: "Bola"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.40000000596
      cov_radius_y: 0.40000000596
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.67
}

Here is my detectnet_v2_tfrecords_kitti_trainval.txt

kitti_config {
  root_directory_path: "/workspace/tlt-experiments/data/training"
  image_dir_name: "Bola2"
  label_dir_name: "Label2"
  image_extension: ".jpg"
  partition_mode: "random"
  num_partitions: 2
  val_split: 5
  num_shards: 10
}
image_directory_path: "/workspace/tlt-experiments/data/training"

Do you have any idea?

Morganh · November 26, 2019, 2:58am

Hi m.billson16,
I can reproduce your error when only generate tfrecord file against very small amount of images/labels via “tlt-dataset-convert”.

The root cause is that some tfrecord files’ size is zero.
Please double check your tfrecord files.
Thanks.

More, please consider below solution:

Increase more images/labels
or
increase val_split value. Validation fold (defaults to fold=0) contains val_split% of data, while train
fold contains (100-val_split)% of data.
or
decrease the num_shards

m.billson16 · November 26, 2019, 4:41am

Hello Morganh, Thank you for your help. I will do another experiment.
For the images and labels, are there any minimum requirements? like the datasets should more than 100 or 1000?
For the val_split, I think I will set it into 20.
For the num_shards, can I know what actually it is? Because the support values are between 1 - 20, and the default value is 10, at what is the best value to generate the TFrecords?

m.billson16 · November 26, 2019, 7:01am

Hello Morganh. I tried your advice. I increased iamges and labels to 150. and decrease the num_shards to 5, val_split to 20. But I still got this error when using tlt-dataset-convert. All my images resolution is 1024 X 576. and all my labels have 15 fields.

Converting Tfrecords for kitti trainval dataset
Using TensorFlow backend.
2019-11-26 07:05:02,698 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2019-11-26 07:05:02,699 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 120	Val: 30
2019-11-26 07:05:02,700 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2019-11-26 07:05:02,700 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:266: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2019-11-26 07:05:02,713 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2019-11-26 07:05:02,717 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2019-11-26 07:05:02,722 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
Traceback (most recent call last):
  File "/usr/local/bin/tlt-dataset-convert", line 10, in <module>
    sys.exit(main())
  File "./detectnet_v2/scripts/dataset_convert.py", line 64, in main
  File "./detectnet_v2/dataio/dataset_converter_lib.py", line 74, in convert
  File "./detectnet_v2/dataio/dataset_converter_lib.py", line 108, in _write_partitions
  File "./detectnet_v2/dataio/dataset_converter_lib.py", line 149, in _write_shard
  File "./detectnet_v2/dataio/kitti_converter_lib.py", line 169, in _create_example_proto
  File "./detectnet_v2/dataio/kitti_converter_lib.py", line 272, in _add_targets
TypeError: object of type 'int' has no len()

Do you have any idea?
And also I want to ask, how to calculate the num_shards? I think the main problem is because I set the wrong num_shards

Morganh · November 26, 2019, 8:44am

Hi m.billson16,
The minimum requirement are 2 images.
One image is for train, another image is for val. Then set (val_split,num_shards) to (50,1).
val_images is (val_split)% of total images.train_images is (100-val_split)% of total images.

Please make sure below at the same time.

val_images >= num_shards
train_images >= num_shards

You original issue results from “Train: 38 Val: 9” . The val images are less than num_shards(10)

m.billson16 · November 26, 2019, 9:08am

Hello Morganh, actually I have updated my datasets. Now hI have 150 images. By using val_split = 20, that means I have 120 train_images and 30 val_images. I tried to use the num_shards = 5. I still got the same error. Do you have any idea how to solve this?

Converting Tfrecords for kitti trainval dataset
Using TensorFlow backend.
2019-11-26 09:08:03,816 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2019-11-26 09:08:03,817 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 120	Val: 30
2019-11-26 09:08:03,817 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2019-11-26 09:08:03,817 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:266: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
Traceback (most recent call last):
  File "/usr/local/bin/tlt-dataset-convert", line 10, in <module>
    sys.exit(main())
  File "./detectnet_v2/scripts/dataset_convert.py", line 64, in main
  File "./detectnet_v2/dataio/dataset_converter_lib.py", line 74, in convert
  File "./detectnet_v2/dataio/dataset_converter_lib.py", line 108, in _write_partitions
  File "./detectnet_v2/dataio/dataset_converter_lib.py", line 149, in _write_shard
  File "./detectnet_v2/dataio/kitti_converter_lib.py", line 169, in _create_example_proto
  File "./detectnet_v2/dataio/kitti_converter_lib.py", line 272, in _add_targets
TypeError: object of type 'int' has no len()

Morganh · November 26, 2019, 10:09am

Hi m.billson16,
OK, so original issue “extract_tfrecords_features StopIteration” is resolved.
For issue “TypeError: object of type ‘int’ has no len()”, could you please narrow down it by decreasing your images/labels?
It is high probability there is something wrong in KITTI label folder or each kitti label text file.

m.billson16 · November 26, 2019, 10:18am

Hello Morganh, Thank again for your big help.
Actually I tried another experiments by using my last datasets which is contains 47 images for training data, with val_split = 20 and num_shards = 2. and the resolution is about 1280 X 720 . I can successfully generate TFrecords and even run TLT-Train, but the problem is the average precision is 0 %. And I trust later when I deploy it to deepstream via tlt-converter, my system will not detect the object that I want.

Validation cost: 0.000003
Mean average_precision (in %): 0.0000

class name      average precision (in %)
------------  --------------------------
Bola                                   0

Do you have any idea about this problem?

Topic		Replies	Views
Training detectnet_v2 Issue TAO Toolkit	15	1848	October 12, 2021
TFRecord creation process TAO Toolkit	6	809	October 12, 2021
Tlt-train loss is minimal but performances are bad TAO Toolkit	11	519	October 12, 2021
Error on tlt-training detectnet_v2? TAO Toolkit	6	474	October 12, 2021
TLT training error : Key cost_sums/cyclist-bbox not found in checkpoint TAO Toolkit	6	1202	October 12, 2021
Error with tlt train in official Jupyter notebook TLT 3.0 TAO Toolkit	7	800	October 12, 2021
Error training Faster RCNN model TAO Toolkit	17	1555	October 12, 2021
Tao detectnet_v2 train failed with g_error_metadata.to_exception in autograph module TAO Toolkit tao	12	1393	January 10, 2022
Core dump Illegal Instruction on detectnet_v2 example TAO Toolkit	17	1997	October 12, 2021
ValueError: No dataset tfrecords file found at path TAO Toolkit	10	1670	October 12, 2021

tlt-train error when deploy mobilenet_v2 by using DetectNet

Related topics