IndexError: index 6 is out of bounds for axis 1 with size 6 while training by using FasterRCNN.

root@3b0b9c604317:/home/samjth/NVIDIA_Transfer_Learning _Toolkit# tlt-train faster_rcnn -e specs/test_config.txt 
Using TensorFlow backend.
2019-11-29 13:06:46.235100: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-29 13:06:46.317166: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-29 13:06:46.317583: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x780d820 executing computations on platform CUDA. Devices:
2019-11-29 13:06:46.317604: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1
2019-11-29 13:06:46.319434: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3192000000 Hz
2019-11-29 13:06:46.320171: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x792a210 executing computations on platform Host. Devices:
2019-11-29 13:06:46.320189: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-11-29 13:06:46.320274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.7465
pciBusID: 0000:01:00.0
totalMemory: 7.93GiB freeMemory: 7.25GiB
2019-11-29 13:06:46.320291: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-11-29 13:06:46.320767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-29 13:06:46.320781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-11-29 13:06:46.320789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-11-29 13:06:46.320846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7056 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-11-29 13:06:46,328 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: valid_class_mapping: {u'Closed-column-tip': 4, u'Stack-diameter-change-zone': 5, u'Platforms': 0, u'Stack-shells': 2, u'Ladders': 1, u'Stack-tip': 3, u'Flare-tip': 6}
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-11-29 13:06:46,334 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
2019-11-29 13:06:46,576 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, None, None 0                                            
__________________________________________________________________________________________________
block1_conv1 (Conv2D)           (None, 64, None, Non 1792        input_1[0][0]                    
__________________________________________________________________________________________________
block1_conv2 (Conv2D)           (None, 64, None, Non 36928       block1_conv1[0][0]               
__________________________________________________________________________________________________
block1_pool (MaxPooling2D)      (None, 64, None, Non 0           block1_conv2[0][0]               
__________________________________________________________________________________________________
block2_conv1 (Conv2D)           (None, 128, None, No 73856       block1_pool[0][0]                
__________________________________________________________________________________________________
block2_conv2 (Conv2D)           (None, 128, None, No 147584      block2_conv1[0][0]               
__________________________________________________________________________________________________
block2_pool (MaxPooling2D)      (None, 128, None, No 0           block2_conv2[0][0]               
__________________________________________________________________________________________________
block3_conv1 (Conv2D)           (None, 256, None, No 295168      block2_pool[0][0]                
__________________________________________________________________________________________________
block3_conv2 (Conv2D)           (None, 256, None, No 590080      block3_conv1[0][0]               
__________________________________________________________________________________________________
block3_conv3 (Conv2D)           (None, 256, None, No 590080      block3_conv2[0][0]               
__________________________________________________________________________________________________
block3_pool (MaxPooling2D)      (None, 256, None, No 0           block3_conv3[0][0]               
__________________________________________________________________________________________________
block4_conv1 (Conv2D)           (None, 512, None, No 1180160     block3_pool[0][0]                
__________________________________________________________________________________________________
block4_conv2 (Conv2D)           (None, 512, None, No 2359808     block4_conv1[0][0]               
__________________________________________________________________________________________________
block4_conv3 (Conv2D)           (None, 512, None, No 2359808     block4_conv2[0][0]               
__________________________________________________________________________________________________
block4_pool (MaxPooling2D)      (None, 512, None, No 0           block4_conv3[0][0]               
__________________________________________________________________________________________________
block5_conv1 (Conv2D)           (None, 512, None, No 2359808     block4_pool[0][0]                
__________________________________________________________________________________________________
block5_conv2 (Conv2D)           (None, 512, None, No 2359808     block5_conv1[0][0]               
__________________________________________________________________________________________________
block5_conv3 (Conv2D)           (None, 512, None, No 2359808     block5_conv2[0][0]               
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 256, 4, 1)    0                                            
__________________________________________________________________________________________________
crop_and_resize_1 (CropAndResiz (256, 512, 14, 14)   0           block5_conv3[0][0]               
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
classifier_pool (MaxPooling2D)  (256, 512, 7, 7)     0           crop_and_resize_1[0][0]          
__________________________________________________________________________________________________
classifier_flatten (Flatten)    (256, 25088)         0           classifier_pool[0][0]            
__________________________________________________________________________________________________
fc1 (Dense)                     (256, 4096)          102764544   classifier_flatten[0][0]         
__________________________________________________________________________________________________
dropout_1 (Dropout)             (256, 4096)          0           fc1[0][0]                        
__________________________________________________________________________________________________
fc2 (Dense)                     (256, 4096)          16781312    dropout_1[0][0]                  
__________________________________________________________________________________________________
dropout_2 (Dropout)             (256, 4096)          0           fc2[0][0]                        
__________________________________________________________________________________________________
rpn_conv1 (Conv2D)              (None, 512, None, No 2359808     block5_conv3[0][0]               
__________________________________________________________________________________________________
dense_class (Dense)             (256, 7)             28679       dropout_2[0][0]                  
__________________________________________________________________________________________________
dense_regress (Dense)           (256, 24)            98328       dropout_2[0][0]                  
__________________________________________________________________________________________________
rpn_out_class (Conv2D)          (None, 36, None, Non 18468       rpn_conv1[0][0]                  
__________________________________________________________________________________________________
rpn_out_regress (Conv2D)        (None, 144, None, No 73872       rpn_conv1[0][0]                  
__________________________________________________________________________________________________
TF_reshape_2_class (TFReshape)  (1, 256, 7)          0           dense_class[0][0]                
__________________________________________________________________________________________________
TF_reshape_3_regr (TFReshape)   (1, 256, 24)         0           dense_regress[0][0]              
==================================================================================================
Total params: 136,839,699
Trainable params: 136,579,539
Non-trainable params: 260,160
__________________________________________________________________________________________________
2019-11-29 13:06:46,636 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: Loading pretrained weights from /home/samjth/NVIDIA_Transfer_Learning _Toolkit/tlt_resnet50_faster_rcnn_v1/resnet50.h5
2019-11-29 13:06:46,850 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: Pretrained weights loaded!
2019-11-29 13:06:47,082 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: training example num: 1952
2019-11-29 13:06:47,533 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: Starting training
2019-11-29 13:06:47,533 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/scripts/train.pyc: Epoch 1/12
Found 1952 examples in training dataset, valid image extension isjpg, jpeg and png(case sensitive)

Compressed_class_mapping: {u'Closed-column-tip': 4, u'Stack-diameter-change-zone': 5, u'Stack-tip': 3, u'Stack-shells': 2, u'Ladders': 1, u'Platforms': 0, u'Flare-tip': 6}

Name mapping:{u'Ladders': u'Ladders', u'Platforms': u'Platforms', u'Stack-diameter-change-zone': u'Stack-diameter-change-zone', u'Closed-column-tip': u'Closed-column-tip', u'Stack-tip': u'Stack-tip', u'Stack-shells': u'Stack-shells', u'Flare-tip': u'Flare-tip'}

Training dataset stats(compressed via class mapping):

{u'Closed-column-tip': 238, u'Stack-diameter-change-zone': 196, u'Platforms': 3844, u'Stack-shells': 3704, u'Ladders': 3041, u'Stack-tip': 685, u'Flare-tip': 86}


WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2019-11-29 13:06:56,728 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2019-11-29 13:07:08.670807: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-11-29 13:07:08.979468: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.29GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-11-29 13:07:09.060185: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.29GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-11-29 13:07:09.603786: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.01GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
   3/1952 [..............................] - ETA: 4:14:00 - rpn_cls: 0.7031 - rpn_regr: 0.1813 - detector_cls: 1.6736 - detector_regr: 0.1816Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 30, in main
  File "./faster_rcnn/scripts/train.py", line 287, in main
  File "./faster_rcnn/utils/roi_helpers.py", line 76, in calc_iou_np
IndexError: index 6 is out of bounds for axis 1 with size 6

Please check the attached config file of fasterrcnn
test_config.txt (3.72 KB)

The last class must be background. Please see frcnn spec part in tlt user guide.

I have added the background class in the config file, now training started,but its showing No GT bboxes found in image and also No positive ROIs

61/1952 [..............................] - ETA: 38:00 - rpn_cls: 0.5184 - rpn_regr: 0.2091 - detector_cls: 0.7940 - detector_regr: 0.2975No positive ROIs.
  66/1952 [>.............................] - ETA: 36:54 - rpn_cls: 0.5079 - rpn_regr: 0.2043 - detector_cls: 0.7728 - detector_regr: 0.2960No positive ROIs.
  67/1952 [>.............................] - ETA: 36:41 - rpn_cls: 0.5059 - rpn_regr: 0.2034 - detector_cls: 0.7687 - detector_regr: 0.2956No positive ROIs.
  71/1952 [>.............................] - ETA: 35:55 - rpn_cls: 0.4980 - rpn_regr: 0.1997 - detector_cls: 0.7531 - detector_regr: 0.2940No positive ROIs.
  83/1952 [>.............................] - ETA: 34:03 - rpn_cls: 0.4779 - rpn_regr: 0.1913 - detector_cls: 0.7123 - detector_regr: 0.2885No positive ROIs.
 107/1952 [>.............................] - ETA: 31:32 - rpn_cls: 0.4479 - rpn_regr: 0.1783 - detector_cls: 0.6531 - detector_regr: 0.2793No GT bboxes found in image/home/samjth/NVIDIA_Transfer_Learning _Toolkit/dataset/images/014m_NE_6820.jpg

 109/1952 [>.............................] - ETA: 31:24 - rpn_cls: 0.4457 - rpn_regr: 0.1774 - detector_cls: 0.6492 - detector_regr: 0.2787No GT bboxes found in image/home/samjth/NVIDIA_Transfer_Learning _Toolkit/dataset/images/010m_SS_598 (2).jpg

For “no GT bbox”, please check the corresponding label one by one.
For “no positive ROI”, please refer to https://devtalk.nvidia.com/default/topic/1065592/transfer-learning-toolkit/faster-rcnn-roi-issue/

I’m getting no GT bbox For every images while training

Hi samjith888,
Could you please paste several label files here?
For example,
$ cat 0001_label.txt

More, your spec file should set width/height as below

size_height_width {
height: 2160
width: 4096

If you set as below, it will resize image and label.

size_min {
min:600

More, according to the requirement in tlt user guide, W and H should be multiples of 32.
But 2160/32=67.5
It is needed to change.

What should i change ? Please suggest a final solution to solve this error.

Please try as below.

size_height_width {
height: 2176
width: 4096
}

Where should i replace above data in my config file ?

size_min {
min:600
}

I can see that size_min { min:600} is in following two sections of the config file. Do i need to replace the two sections?

random_seed: 42
enc_key: "<your_enc_key>"
verbose: True
network_config {
input_image_config {
image_type: RGB
image_channel_order: 'bgr'
    size_min {
min:600
}
}
network_config {
input_image_config {
image_type: RGB
image_channel_order: 'bgr'
    size_min {
min:600
}

Hi samjith888,
There are two “network_config" in your spec file. You make mistake to add a repeated one.
Please check it and remove.

I have removed the extra network_config from spec file and also edited it by using the above informations. but getting the following errors. “Resource exhausted: OOM when allocating tensor with shape[1,1024,136,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc” . I got the same error already while training by using detectnet which is not resolved yet. Please suggest a solution for this.

2019-12-02 07:50:49.625268: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ****************************************************************************************************
2019-12-02 07:50:49.625291: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at conv_ops.cc:446 : Resource exhausted: OOM when allocating tensor with shape[1,1024,136,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 30, in main
  File "./faster_rcnn/scripts/train.py", line 309, in main
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1217, in train_on_batch
    outputs = self.train_function(ins)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,1024,136,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node block_3a_bn_shortcut/batchnorm/add_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[{{node loss_1/add_61}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Hi samjith888
Can you run nvidia-smi and paste the result here?
$ nvidia-smi

result is below

root@3b0b9c604317:/home/samjth/NVIDIA_Transfer_Learning _Toolkit# nvidia-smi
Mon Dec  2 08:15:04 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.26       Driver Version: 440.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   38C    P8    17W / 200W |    191MiB /  8116MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Hi samjith888,
The FasterRCNN app will resize the input images on-the-fly during training/evaluation/inference, when the images’ sizes are different from that specified in the experiment spec.
So, for faster-rcnn, could you set lower input_size in spec and re-try?

Change below

size_height_width {
height: 2176
width: 4096

to

size_height_width {
height: 544
width: 1024

Hi Morganh,

 I have trained the model after changing the parameters in spec file. But got mAP =0.3684 in 14 epoch, which is lesser accuracy . Even i have noticed that the label.txt values are changed

Stack-shells 0.0 0 0.0 1494.294528 2.90952 2040.758272 2160.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Stack-shells 0.0 0 0.0 2842.312704 0.0 3525.218304 2160.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

TO

Ladders 0 0 0 3442.40 86.15 3634.27 1091.55 0 0 0 0 0 0 0
Platforms 0 0 0 1090.98 20.34 3895.21 1311.55 0 0 0 0 0 0 0
Stack-shells 0 0 0 2843.30 0.00 3440.48 2089.02 0 0 0 0 0 0 0
Stack-shells 0 0 0 1516.00 77.19 2077.95 2129.76 0 0 0 0 0 0 0
Stack-shells 0 0 0 1148.16 19.90 3311.54 2147.41 0 0 0 0 0 0 0

This means, the bbox values are changed and even other values. Added extra classes too. This all things happened after the training happened. Please check my spec file.

Hi samjith888,
During faster-rcnn training, it will not “write” anything to the label txt files.
Why did you observe that label.txt are changed?
Can you give more details info?