Error training Faster RCNN model

Hi,

I am trying to train a Faster RCNN model (i.e., tlt_restnet10_faster_rcnn_v1).

When I run this command:

tlt-train faster_rcnn -e /workspace/nvidia_experiment/frcnn.config

I get this error:

2019-10-16 07:10:39,722 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: Loading pretrained weights from /workspace/nvidia_experiment/tlt_resnet10_faster_rcnn_v1/resnet10.h5
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 30, in main
  File "./faster_rcnn/scripts/train.py", line 232, in main
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/network.py", line 1163, in load_weights
    reshape=reshape)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/saving.py", line 1130, in load_weights_from_hdf5_group_by_name
    ' element(s).')
ValueError: Layer #4 (named "block_1a_conv_1") expects 1 weight(s), but the saved weights have 2 element(s).

I am also getting the same error when I change the model to the resnet50_faster_rcnn model.

Any help on how to fix this would be great!

My config file is:

random_seed: 42
enc_key: "mYaPIkey"
verbose: True
network_config {
input_image_config {
image_type: RGB
image_channel_order: 'bgr'
    size_min {
min:600
}
    image_channel_mean {
        key: 'b'
        value: 103.939
}
    image_channel_mean {
        key: 'g'
        value: 116.779
}
    image_channel_mean {
        key: 'r'
        value: 123.68
}
    image_scaling_factor: 1.0
}
feature_extractor: "resnet:50"
anchor_box_config {
scale: 128.0
scale: 256.0
scale: 512.0
ratio: 1.0
ratio: 0.5
ratio: 2.0
}
freeze_bn: True
freeze_blocks: 1
freeze_blocks: 2
roi_mini_batch: 256
rpn_stride: 16
conv_bn_share_bias: False
roi_pooling_config {
pool_size: 7
pool_size_2x: True
}
all_projections: True
use_pooling: False
}


training_config {
kitti_data_config {
images_dir: '/workspace/nvidia_experiment/dataset/images'
labels_dir: '/workspace/nvidia_experiment/dataset/labels'
}
training_data_parser: 'raw_kitti'
data_augmentation {
use_augmentation: True
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 0
translate_max_y: 0
}
color_augmentation {
color_shift_stddev: 0.1
hue_rotation_max: 25.0
saturation_shift_max: 0.2
contrast_scale_max: 0.2
contrast_center: 0.5
}
}
num_epochs: 5
class_mapping {
key: 'customer'
value: 0
}

pretrained_model: ""
pretrained_weights: "/workspace/nvidia_experiment/tlt_resnet10_faster_rcnn_v1/resnet10.h5"
output_weights: "/workspace/nvidia_experiment/training_output/frcnn_resnet10_epochs5.tltw"
output_model: "/workspace/nvidia_experiment/training_output/frcnn_resnet10_epochs5.tlt"
rpn_min_overlap: 0.3
rpn_max_overlap: 0.7
classifier_min_overlap: 0.0
classifier_max_overlap: 0.5
gt_as_roi: False
std_scaling: 1.0
classifier_regr_std {
key: 'x'
value: 10.0
}
classifier_regr_std {
key: 'y'
value: 10.0
}
classifier_regr_std {
key: 'w'
value: 5.0
}
classifier_regr_std {
key: 'h'
value: 5.0
}

rpn_mini_batch: 256
rpn_pre_nms_top_N: 12000
rpn_nms_max_boxes: 2000
rpn_nms_overlap_threshold: 0.7
reg_config {
reg_type: 'L2'
weight_decay: 1e-4
}

optimizer {
adam {
lr: 0.00001
beta_1: 0.9
beta_2: 0.999
decay: 0.0
}
}

lr_scheduler {
step {
base_lr: 0.00001
gamma: 1.0
step_size: 30
}
}

lambda_rpn_regr: 1.0
lambda_rpn_class: 1.0
lambda_cls_regr: 1.0
lambda_cls_class: 1.0

}

Hi
In your spec file, below two lines does not match. Could you please align them?
Line25 ==> feature_extractor: “resnet:50
Line80 ==> pretrained_weights: “/workspace/nvidia_experiment/tlt_resnet10_faster_rcnn_v1/resnet10.h5”

Indeed. My bad.

Done.

PS: the error still persists though…

Could you paste your latest spec? Thanks.

Here you go…

random_seed: 42
enc_key: "apiKey"
verbose: True
network_config {
input_image_config {
image_type: RGB
image_channel_order: 'bgr'
    size_min {
min:600
}
    image_channel_mean {
        key: 'b'
        value: 103.939
}
    image_channel_mean {
        key: 'g'
        value: 116.779
}
    image_channel_mean {
        key: 'r'
        value: 123.68
}
    image_scaling_factor: 1.0
}
feature_extractor: "resnet:10"
anchor_box_config {
scale: 128.0
scale: 256.0
scale: 512.0
ratio: 1.0
ratio: 0.5
ratio: 2.0
}
freeze_bn: True
freeze_blocks: 1
freeze_blocks: 2
roi_mini_batch: 256
rpn_stride: 16
conv_bn_share_bias: False
roi_pooling_config {
pool_size: 7
pool_size_2x: True
}
all_projections: True
use_pooling: False
}


training_config {
kitti_data_config {
images_dir: '/workspace/nvidia_experiment/dataset/images'
labels_dir: '/workspace/nvidia_experiment/dataset/labels'
}
training_data_parser: 'raw_kitti'
data_augmentation {
use_augmentation: True
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 0
translate_max_y: 0
}
color_augmentation {
color_shift_stddev: 0.1
hue_rotation_max: 25.0
saturation_shift_max: 0.2
contrast_scale_max: 0.2
contrast_center: 0.5
}
}
num_epochs: 5
class_mapping {
key: 'customer'
value: 0
}

pretrained_model: ""
pretrained_weights: "/workspace/nvidia_experiment/tlt_resnet10_faster_rcnn_v1/resnet10.h5"
output_weights: "/workspace/nvidia_experiment/training_output/frcnn_resnet10_epochs5.tltw"
output_model: "/workspace/nvidia_experiment/training_output/frcnn_resnet10_epochs5.tlt"
rpn_min_overlap: 0.3
rpn_max_overlap: 0.7
classifier_min_overlap: 0.0
classifier_max_overlap: 0.5
gt_as_roi: False
std_scaling: 1.0
classifier_regr_std {
key: 'x'
value: 10.0
}
classifier_regr_std {
key: 'y'
value: 10.0
}
classifier_regr_std {
key: 'w'
value: 5.0
}
classifier_regr_std {
key: 'h'
value: 5.0
}

rpn_mini_batch: 256
rpn_pre_nms_top_N: 12000
rpn_nms_max_boxes: 2000
rpn_nms_overlap_threshold: 0.7
reg_config {
reg_type: 'L2'
weight_decay: 1e-4
}

optimizer {
adam {
lr: 0.00001
beta_1: 0.9
beta_2: 0.999
decay: 0.0
}
}

lr_scheduler {
step {
base_lr: 0.00001
gamma: 1.0
step_size: 30
}
}

lambda_rpn_regr: 1.0
lambda_rpn_class: 1.0
lambda_cls_regr: 1.0
lambda_cls_class: 1.0

}

Hi pushkar,
Could you run tlt-train successfully with default faster_rcnn notebook?

Hi Morgan,

I have not tried the notebook yet.

OK, you can run the notebook with your downloaded pre-trained model(h5) to narrow down the issue.
I think the difference between it and your current setting are as below.

  1. the notebook is using kITTI dataset.
  2. the notebook is using default training spec

You can try to narrow down the issue as above mentioned. I will address into checking your spec at the same time.
Thanks.

Will do.

Thanks!

Hi Morgan,

The Jupyter notebook runs fine…

That being said, I have already converted my dataset to KITTI format and have used TLT to fine tune detectnet models.

Hi,

I was wondering if there were any updates to this?

Thanks

Hi pushkar,
What’s your remaining issue? Do you mean you can run default jupyter notebook well but cannot run well with your new spec and your own dataset?

Hi Morgan,

That is correct.

It worked fine with the jupyter notebook (on the dataset that the tutorial ask you to download), but it did not work with my dataset.

Just as a note: my dataset worked just fine with the resnet10_detectnet, resenet18_detectnet and resnet50_detectnet models using “tlt-train”.

I am trying to use a resnet10_faster_rcnn model to train this same dataset of mine using tlt-train, but I am getting the error I mentioned in #1

Hi pushkcar,
Could you please paste the full log which you mentioned in #1? Thanks.

Hi Morgan,

Here you go…

root@7c8a1e37c38d:/workspace# tlt-train faster_rcnn -e /workspace/nvidia_experiment/frcnn.config
Using TensorFlow backend.
2019-10-28 14:41:43.653385: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-28 14:41:43.739853: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-10-28 14:41:43.740640: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x62110a0 executing computations on platform CUDA. Devices:
2019-10-28 14:41:43.740658: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1070 with Max-Q Design, Compute Capability 6.1
2019-10-28 14:41:43.742567: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
2019-10-28 14:41:43.743261: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x627c3e0 executing computations on platform Host. Devices:
2019-10-28 14:41:43.743279: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-10-28 14:41:43.743585: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1070 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.379
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 7.47GiB
2019-10-28 14:41:43.743601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-10-28 14:41:43.744450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-28 14:41:43.744463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-10-28 14:41:43.744470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-10-28 14:41:43.744725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7266 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-10-28 14:41:43,749 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: valid_class_mapping: {u'customer': 0}
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-10-28 14:41:43,755 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
/usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/models/custom_layers.py:119: RuntimeWarning: divide by zero encountered in long_scalars
/usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/models/custom_layers.py:121: RuntimeWarning: invalid value encountered in long_scalars
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, None, None 0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, None, Non 9472        input_1[0][0]                    
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 64, None, Non 256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 64, None, Non 0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (None, 64, None, Non 36864       activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (None, 64, None, Non 256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 64, None, Non 0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (None, 64, None, Non 36864       activation_2[0][0]               
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 64, None, Non 4096        activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (None, 64, None, Non 256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (None, 64, None, Non 256         block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_1 (Add)                     (None, 64, None, Non 0           block_1a_bn_2[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 64, None, Non 0           add_1[0][0]                      
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (None, 128, None, No 73728       activation_3[0][0]               
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (None, 128, None, No 512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_4 (Activation)       (None, 128, None, No 0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (None, 128, None, No 147456      activation_4[0][0]               
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 128, None, No 8192        activation_3[0][0]               
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (None, 128, None, No 512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (None, 128, None, No 512         block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_2 (Add)                     (None, 128, None, No 0           block_2a_bn_2[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_5 (Activation)       (None, 128, None, No 0           add_2[0][0]                      
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (None, 256, None, No 294912      activation_5[0][0]               
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (None, 256, None, No 1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_6 (Activation)       (None, 256, None, No 0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, None, No 589824      activation_6[0][0]               
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 256, None, No 32768       activation_5[0][0]               
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, None, No 1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 256, None, No 1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_3 (Add)                     (None, 256, None, No 0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_7 (Activation)       (None, 256, None, No 0           add_3[0][0]                      
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 256, 4, 1)    0                                            
__________________________________________________________________________________________________
crop_and_resize_1 (CropAndResiz (256, 256, 14, 14)   0           activation_7[0][0]               
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (256, 512, 7, 7)     1179648     crop_and_resize_1[0][0]          
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (256, 512, 7, 7)     2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_8 (Activation)       (256, 512, 7, 7)     0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (256, 512, 7, 7)     2359296     activation_8[0][0]               
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (256, 512, 7, 7)     131072      crop_and_resize_1[0][0]          
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (256, 512, 7, 7)     2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (256, 512, 7, 7)     2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_4 (Add)                     (256, 512, 7, 7)     0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_9 (Activation)       (256, 512, 7, 7)     0           add_4[0][0]                      
__________________________________________________________________________________________________
avg_pool (AveragePooling2D)     (256, 512, 1, 1)     0           activation_9[0][0]               
__________________________________________________________________________________________________
classifier_flatten (Flatten)    (256, 512)           0           avg_pool[0][0]                   
__________________________________________________________________________________________________
rpn_conv1 (Conv2D)              (None, 512, None, No 1180160     activation_7[0][0]               
__________________________________________________________________________________________________
dense_class (Dense)             (256, 1)             513         classifier_flatten[0][0]         
__________________________________________________________________________________________________
dense_regress (Dense)           (256, 0)             0           classifier_flatten[0][0]         
__________________________________________________________________________________________________
rpn_out_class (Conv2D)          (None, 9, None, None 4617        rpn_conv1[0][0]                  
__________________________________________________________________________________________________
rpn_out_regress (Conv2D)        (None, 36, None, Non 18468       rpn_conv1[0][0]                  
__________________________________________________________________________________________________
TF_reshape_2_class (TFReshape)  (1, 256, 1)          0           dense_class[0][0]                
__________________________________________________________________________________________________
TF_reshape_3_regr (TFReshape)   (1, 256, 0)          0           dense_regress[0][0]              
==================================================================================================
Total params: 6,119,726
Trainable params: 5,806,638
Non-trainable params: 313,088
__________________________________________________________________________________________________
2019-10-28 14:41:44,209 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: Loading pretrained weights from /workspace/nvidia_experiment/tlt_resnet10_faster_rcnn_v1/resnet10.h5
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 30, in main
  File "./faster_rcnn/scripts/train.py", line 232, in main
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/network.py", line 1163, in load_weights
    reshape=reshape)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/saving.py", line 1130, in load_weights_from_hdf5_group_by_name
    ' element(s).')
ValueError: Layer #4 (named "block_1a_conv_1") expects 1 weight(s), but the saved weights have 2 element(s).

The background class is missed in spec file. The class_mapping should always have a background field.

See tlt doc 5.3 section,
For FasterRCNN, the class that mapped to the largest number is always the ‘background’ due to the implementation. Also, if you want to ignore some classes in the dataset, simply map them to -1. In the previous example, their 5 classes: ‘Car’, ‘Van’, ‘Person’, ‘Cyclist’, ‘Truck’ in the dataset. You want to group ‘Car’ and ‘Van’, so map them to 0. You also want to exclude ‘Truck’, so map Truck into -1. Finally, add a dummy ‘background’ class that is mapped to the largest number(3).

Yep. I missed that bit. Thanks!