How to resize KITTI dataset images and labels

xhuv_NV · March 10, 2020, 5:07am

I have learnt that in order to train on TLT, the dataset images must have same size. But in KITTI dataset, image size varies. Is there any script to resize all the images in KITTI dataset with labels simultaneously?

Morganh · March 10, 2020, 5:37am

Hi xhuv,
In TLT 1.0.1 version, for detectnet_v2 and ssd network, the tlt-train tool does not support training on images of multiple resolutions, or resizing images during training. All of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.

For KITTI dataset, the image size are mostly the same. So, it is not needed to resize.

xhuv_NV · March 10, 2020, 7:36pm

With the original dataset without resizing, when I run

!tlt-train detectnet_v2 -e $SPECS_DIR/train.txt \
                        -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                        -k $KEY \
                        -n resnet18_detector

The train completes, but shows ZERO average precision after 120 epoch

Using TensorFlow backend.
--------------------------------------------------------------------------
[[5279,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: 629ffbf9ff63

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
2020-03-10 17:56:15.977105: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-10 17:56:16.112012: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-10 17:56:16.112770: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5f44020 executing computations on platform CUDA. Devices:
2020-03-10 17:56:16.112787: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2020-03-10 17:56:16.114468: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3408000000 Hz
2020-03-10 17:56:16.114843: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5facdb0 executing computations on platform Host. Devices:
2020-03-10 17:56:16.114860: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-03-10 17:56:16.115005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.86
pciBusID: 0000:01:00.0
totalMemory: 7.76GiB freeMemory: 7.19GiB
2020-03-10 17:56:16.115023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-03-10 17:56:16.116016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-10 17:56:16.116030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-03-10 17:56:16.116038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-03-10 17:56:16.116112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6998 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-03-10 17:56:16,116 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/spec_files/train.txt.
2020-03-10 17:56:16,117 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/spec_files/train.txt
WARNING:tensorflow:From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
2020-03-10 17:56:16,125 [WARNING] tensorflow: From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
2020-03-10 17:56:16,192 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 6359 samples with a batch size of 16; each epoch will therefore take one extra step.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-03-10 17:56:16,199 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py:91: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
2020-03-10 17:56:16,211 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py:91: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, 128, 512)  0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, 64, 256)  9472        input_1[0][0]                    
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 64, 64, 256)  256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 64, 64, 256)  0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (None, 64, 32, 128)  36928       activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (None, 64, 32, 128)  256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 64, 32, 128)  0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (None, 64, 32, 128)  36928       activation_2[0][0]               
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 64, 32, 128)  4160        activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (None, 64, 32, 128)  256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (None, 64, 32, 128)  256         block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_1 (Add)                     (None, 64, 32, 128)  0           block_1a_bn_2[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 64, 32, 128)  0           add_1[0][0]                      
__________________________________________________________________________________________________
block_1b_conv_1 (Conv2D)        (None, 64, 32, 128)  36928       activation_3[0][0]               
__________________________________________________________________________________________________
block_1b_bn_1 (BatchNormalizati (None, 64, 32, 128)  256         block_1b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_4 (Activation)       (None, 64, 32, 128)  0           block_1b_bn_1[0][0]              
__________________________________________________________________________________________________
block_1b_conv_2 (Conv2D)        (None, 64, 32, 128)  36928       activation_4[0][0]               
__________________________________________________________________________________________________
block_1b_bn_2 (BatchNormalizati (None, 64, 32, 128)  256         block_1b_conv_2[0][0]            
__________________________________________________________________________________________________
add_2 (Add)                     (None, 64, 32, 128)  0           block_1b_bn_2[0][0]              
                                                                 activation_3[0][0]               
__________________________________________________________________________________________________
activation_5 (Activation)       (None, 64, 32, 128)  0           add_2[0][0]                      
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (None, 128, 16, 64)  73856       activation_5[0][0]               
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (None, 128, 16, 64)  512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_6 (Activation)       (None, 128, 16, 64)  0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (None, 128, 16, 64)  147584      activation_6[0][0]               
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 128, 16, 64)  8320        activation_5[0][0]               
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (None, 128, 16, 64)  512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (None, 128, 16, 64)  512         block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_3 (Add)                     (None, 128, 16, 64)  0           block_2a_bn_2[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_7 (Activation)       (None, 128, 16, 64)  0           add_3[0][0]                      
__________________________________________________________________________________________________
block_2b_conv_1 (Conv2D)        (None, 128, 16, 64)  147584      activation_7[0][0]               
__________________________________________________________________________________________________
block_2b_bn_1 (BatchNormalizati (None, 128, 16, 64)  512         block_2b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_8 (Activation)       (None, 128, 16, 64)  0           block_2b_bn_1[0][0]              
__________________________________________________________________________________________________
block_2b_conv_2 (Conv2D)        (None, 128, 16, 64)  147584      activation_8[0][0]               
__________________________________________________________________________________________________
block_2b_bn_2 (BatchNormalizati (None, 128, 16, 64)  512         block_2b_conv_2[0][0]            
__________________________________________________________________________________________________
add_4 (Add)                     (None, 128, 16, 64)  0           block_2b_bn_2[0][0]              
                                                                 activation_7[0][0]               
__________________________________________________________________________________________________
activation_9 (Activation)       (None, 128, 16, 64)  0           add_4[0][0]                      
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (None, 256, 8, 32)   295168      activation_9[0][0]               
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (None, 256, 8, 32)   1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_10 (Activation)      (None, 256, 8, 32)   0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, 8, 32)   590080      activation_10[0][0]              
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 256, 8, 32)   33024       activation_9[0][0]               
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, 8, 32)   1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 256, 8, 32)   1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_5 (Add)                     (None, 256, 8, 32)   0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_11 (Activation)      (None, 256, 8, 32)   0           add_5[0][0]                      
__________________________________________________________________________________________________
block_3b_conv_1 (Conv2D)        (None, 256, 8, 32)   590080      activation_11[0][0]              
__________________________________________________________________________________________________
block_3b_bn_1 (BatchNormalizati (None, 256, 8, 32)   1024        block_3b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_12 (Activation)      (None, 256, 8, 32)   0           block_3b_bn_1[0][0]              
__________________________________________________________________________________________________
block_3b_conv_2 (Conv2D)        (None, 256, 8, 32)   590080      activation_12[0][0]              
__________________________________________________________________________________________________
block_3b_bn_2 (BatchNormalizati (None, 256, 8, 32)   1024        block_3b_conv_2[0][0]            
__________________________________________________________________________________________________
add_6 (Add)                     (None, 256, 8, 32)   0           block_3b_bn_2[0][0]              
                                                                 activation_11[0][0]              
__________________________________________________________________________________________________
activation_13 (Activation)      (None, 256, 8, 32)   0           add_6[0][0]                      
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (None, 512, 8, 32)   1180160     activation_13[0][0]              
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (None, 512, 8, 32)   2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_14 (Activation)      (None, 512, 8, 32)   0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (None, 512, 8, 32)   2359808     activation_14[0][0]              
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 512, 8, 32)   131584      activation_13[0][0]              
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (None, 512, 8, 32)   2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (None, 512, 8, 32)   2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_7 (Add)                     (None, 512, 8, 32)   0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_15 (Activation)      (None, 512, 8, 32)   0           add_7[0][0]                      
__________________________________________________________________________________________________
block_4b_conv_1 (Conv2D)        (None, 512, 8, 32)   2359808     activation_15[0][0]              
__________________________________________________________________________________________________
block_4b_bn_1 (BatchNormalizati (None, 512, 8, 32)   2048        block_4b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_16 (Activation)      (None, 512, 8, 32)   0           block_4b_bn_1[0][0]              
__________________________________________________________________________________________________
block_4b_conv_2 (Conv2D)        (None, 512, 8, 32)   2359808     activation_16[0][0]              
__________________________________________________________________________________________________
block_4b_bn_2 (BatchNormalizati (None, 512, 8, 32)   2048        block_4b_conv_2[0][0]            
__________________________________________________________________________________________________
add_8 (Add)                     (None, 512, 8, 32)   0           block_4b_bn_2[0][0]              
                                                                 activation_15[0][0]              
__________________________________________________________________________________________________
activation_17 (Activation)      (None, 512, 8, 32)   0           add_8[0][0]                      
__________________________________________________________________________________________________
output_bbox (Conv2D)            (None, 12, 8, 32)    6156        activation_17[0][0]              
__________________________________________________________________________________________________
output_cov (Conv2D)             (None, 3, 8, 32)     1539        activation_17[0][0]              
==================================================================================================
Total params: 11,203,023
Trainable params: 11,183,823
Non-trainable params: 19,200
__________________________________________________________________________________________________
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2020-03-10 17:56:34,721 [INFO] iva.detectnet_v2.scripts.train: Found 6359 samples in training set
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2020-03-10 17:56:50,115 [INFO] iva.detectnet_v2.scripts.train: Found 1122 samples in validation set
INFO:tensorflow:Create CheckpointSaverHook.
2020-03-10 17:57:03,418 [INFO] tensorflow: Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2020-03-10 17:57:04,534 [INFO] tensorflow: Graph was finalized.
2020-03-10 17:57:04.535111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-03-10 17:57:04.535159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-10 17:57:04.535187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-03-10 17:57:04.535194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-03-10 17:57:04.535313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6998 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
2020-03-10 17:57:07,660 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2020-03-10 17:57:08,314 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2020-03-10 17:57:36,451 [INFO] tensorflow: Saving checkpoints for step-0.
2020-03-10 17:58:19.675048: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-03-10 17:58:19.947042: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x5fee740
INFO:tensorflow:epoch = 0.0, loss = 0.08104235, step = 0
2020-03-10 17:58:24,004 [INFO] tensorflow: epoch = 0.0, loss = 0.08104235, step = 0
2020-03-10 17:58:24,006 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/task_progress_monitor_hook.pyc: Epoch 0/120: loss: 0.08104 Time taken: 0:00:00 ETA: 0:00:00
INFO:tensorflow:epoch = 0.005025125628140704, loss = 0.07796858, step = 2 (12.107 sec)
2020-03-10 17:58:36,111 [INFO] tensorflow: epoch = 0.005025125628140704, loss = 0.07796858, step = 2 (12.107 sec)
2020-03-10 17:58:37,909 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 16.226
INFO:tensorflow:global_step/sec: 1.8675
2020-03-10 17:58:44,889 [INFO] tensorflow: global_step/sec: 1.8675
INFO:tensorflow:epoch = 0.10050251256281408, loss = 0.049496885, step = 40 (8.868 sec)
2020-03-10 17:58:44,979 [INFO] tensorflow: epoch = 0.10050251256281408, loss = 0.049496885, step = 40 (8.868 sec)
2020-03-10 17:58:45,711 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 51.275
2020-03-10 17:58:47,717 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 199.452
INFO:tensorflow:global_step/sec: 12.3764
2020-03-10 17:58:48,040 [INFO] tensorflow: global_step/sec: 12.3764
2020-03-10 17:58:49,728 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 198.929
INFO:tensorflow:epoch = 0.2613065326633166, loss = 0.013850397, step = 104 (5.148 sec)
2020-03-10 17:58:50,126 [INFO] tensorflow: epoch = 0.2613065326633166, loss = 0.013850397, step = 104 (5.148 sec)
INFO:tensorflow:global_step/sec: 12.4458
2020-03-10 17:58:51,173 [INFO] tensorflow: global_step/sec: 12.4458
2020-03-10 17:58:51,737 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 199.081
2020-03-10 17:58:53,742 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 199.494
INFO:tensorflow:global_step/sec: 12.4714
2020-03-10 17:58:54,301 [INFO] tensorflow: global_step/sec: 12.4714
INFO:tensorflow:epoch = 0.4221105527638191, loss = 0.0035675862, step = 168 (5.131 sec)
2020-03-10 17:58:55,257 [INFO] tensorflow: epoch = 0.4221105527638191, loss = 0.0035675862, step = 168 (5.131 sec)
2020-03-10 17:58:55,746 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 199.704
INFO:tensorflow:global_step/sec: 12.3694
2020-03-10 17:58:57,453 [INFO] tensorflow: global_step/sec: 12.3694
2020-03-10 17:58:57,777 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 196.937
2020-03-10 17:58:59,794 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 198.315
INFO:tensorflow:epoch = 0.5804020100502513, loss = 0.0027891295, step = 231 (5.102 sec)
2020-03-10 17:59:00,359 [INFO] tensorflow: epoch = 0.5804020100502513, loss = 0.0027891295, step = 231 (5.102 sec)
INFO:tensorflow:global_step/sec: 12.3756
2020-03-10 17:59:00,605 [INFO] tensorflow: global_step/sec: 12.3756
2020-03-10 17:59:01,814 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 198.064
INFO:tensorflow:global_step/sec: 12.4258
2020-03-10 17:59:03,743 [INFO] tensorflow: global_step/sec: 12.4258
2020-03-10 17:59:03,827 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 198.771
INFO:tensorflow:epoch = 0.7412060301507538, loss = 0.0019985866, step = 295 (5.156 sec)
2020-03-10 17:59:05,515 [INFO] tensorflow: epoch = 0.7412060301507538, loss = 0.0019985866, step = 295 (5.156 sec)
2020-03-10 17:59:05,841 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 198.595
INFO:tensorflow:global_step/sec: 12.3988
2020-03-10 17:59:06,889 [INFO] tensorflow: global_step/sec: 12.3988
2020-03-10 17:59:07,858 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 198.285
2020-03-10 17:59:09,864 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 199.485
INFO:tensorflow:global_step/sec: 12.4174
2020-03-10 17:59:10,030 [INFO] tensorflow: global_step/sec: 12.4174
INFO:tensorflow:epoch = 0.8994974874371859, loss = 0.0016591488, step = 358 (5.083 sec)
2020-03-10 17:59:10,598 [INFO] tensorflow: epoch = 0.8994974874371859, loss = 0.0016591488, step = 358 (5.083 sec)
2020-03-10 17:59:11,883 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 198.096
INFO:tensorflow:global_step/sec: 12.3317
2020-03-10 17:59:13,192 [INFO] tensorflow: global_step/sec: 12.3317
2020-03-10 17:59:14,363 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 70, 0.00s/step
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/evaluation/metadata.py:38: UserWarning: One or more metadata field(s) are missing from ground_truth batch_data, and will be replaced with defaults: ['frame/camera_location']
2020-03-10 17:59:21,060 [INFO] iva.detectnet_v2.evaluation.evaluation: step 10 / 70, 0.67s/step
2020-03-10 17:59:23,421 [INFO] iva.detectnet_v2.evaluation.evaluation: step 20 / 70, 0.24s/step
2020-03-10 17:59:25,764 [INFO] iva.detectnet_v2.evaluation.evaluation: step 30 / 70, 0.23s/step
2020-03-10 17:59:28,108 [INFO] iva.detectnet_v2.evaluation.evaluation: step 40 / 70, 0.23s/step
2020-03-10 17:59:30,458 [INFO] iva.detectnet_v2.evaluation.evaluation: step 50 / 70, 0.24s/step
2020-03-10 17:59:32,803 [INFO] iva.detectnet_v2.evaluation.evaluation: step 60 / 70, 0.23s/step
Matching predictions to ground truth, class 1/3.: 100%|#| 8/8 [00:00<00:00, 8790.79it/s]
Matching predictions to ground truth, class 2/3.: 100%|#| 9/9 [00:00<00:00, 8167.19it/s]
Matching predictions to ground truth, class 3/3.: 100%|#| 1/1 [00:00<00:00, 889.57it/s]
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/evaluation/compute_metrics.py:717: RuntimeWarning: invalid value encountered in true_divide
Epoch 1/120
=========================

Validation cost: 0.000008
Mean average_precision (in %): 0.0000

class name      average precision (in %)
------------  --------------------------
car                                    0
cyclist                                0
pedestrian                             0

Median Inference Time: 0.003606
INFO:tensorflow:epoch = 1.0, loss = 0.00052329456, step = 398 (24.667 sec)
2020-03-10 17:59:35,265 [INFO] tensorflow: epoch = 1.0, loss = 0.00052329456, step = 398 (24.667 sec)
2020-03-10 17:59:35,265 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/task_progress_monitor_hook.pyc: Epoch 1/120: loss: 0.00052 Time taken: 0:01:21.906016 ETA: 2:42:26.815904
2020-03-10 17:59:35,359 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 17.039
2020-03-10 17:59:37,375 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 198.457
INFO:tensorflow:global_step/sec: 1.58629
2020-03-10 17:59:37,778 [INFO] tensorflow: global_step/sec: 1.58629
2020-03-10 17:59:39,390 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 198.539
INFO:tensorflow:epoch = 1.1582914572864322, loss = 0.00058055145, step = 461 (5.096 sec)
2020-03-10 17:59:40,361 [INFO] tensorflow: epoch = 1.1582914572864322, loss = 0.00058055145, step = 461 (5.096 sec)
INFO:tensorflow:global_step/sec: 12.3628
2020-03-10 17:59:40,933 [INFO] tensorflow: global_step/sec: 12.3628
2020-03-10 17:59:41,421 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 196.973
2020-03-10 17:59:43,440 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 198.187
INFO:tensorflow:global_step/sec: 12.2945
2020-03-10 17:59:44,105 [INFO] tensorflow: global_step/sec: 12.2945
INFO:tensorflow:epoch = 1.3165829145728642, loss = 0.00052331056, step = 524 (5.131 sec)
2020-03-10 17:59:45,492 [INFO] tensorflow: epoch = 1.3165829145728642, loss = 0.00052331056, step = 524 (5.131 sec)
2020-03-10 17:59:45,493 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 194.860
INFO:tensorflow:global_step/sec: 12.3807
2020-03-10 17:59:47,255 [INFO] tensorflow: global_step/sec: 12.3807
2020-03-10 17:59:47,499 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 199.339
2020-03-10 17:59:49,515 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 198.521
INFO:tensorflow:global_step/sec: 12.3445
2020-03-10 17:59:50,414 [INFO] tensorflow: global_step/sec: 12.3445
INFO:tensorflow:epoch = 1.4748743718592965, loss = 0.00052351336, step = 587 (5.101 sec)
2020-03-10 17:59:50,594 [INFO] tensorflow: epoch = 1.4748743718592965, loss = 0.00052351336, step = 587 (5.101 sec)
2020-03-10 17:59:51,553 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 196.228
INFO:tensorflow:global_step/sec: 12.3396
2020-03-10 17:59:53,575 [INFO] tensorflow: global_step/sec: 12.3396
2020-03-10 17:59:53,575 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 197.819
2020-03-10 17:59:55,607 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/tfhooks/sample_counter_hook.pyc: Samples / sec: 196.891
INFO:tensorflow:epoch = 1.6331658291457287, loss = 0.00052330946, step = 650 (5.095 sec)
2020-03-10 17:59:55,688 [INFO] tensorflow: epoch = 1.6331658291457287, loss = 0.00052330946, step = 650 (5.095 sec)
---------------------------------------------------------------------
---------------------------------------------------------------------

Epoch 120/120
=========================

Validation cost: 0.000006
Mean average_precision (in %): 0.0000

class name      average precision (in %)
------------  --------------------------
car                                    0
cyclist                                0
pedestrian                             0

Median Inference Time: 0.003395
Time taken to run iva.detectnet_v2.scripts.train:main: 1:10:33.431308.

My train.txt file looks like this:

random_seed: 42
model_config {
  pretrained_model_file: "/workspace/pretrained_model/tlt_resnet18_detectnet_v2_v1/resnet18.hdf5"
  num_layers: 18
  
  freeze_blocks: 0
  arch: "resnet"
  use_batch_norm: true
  activation {
    activation_type: "relu"
  }
  dropout_rate: 0.1
  objective_set: {
    cov {}
    bbox {
      scale: 35.0
      offset: 0.5
    }
  }
  training_precision {
    backend_floatx: FLOAT32
 }
}

bbox_rasterizer_config {
  target_class_config {
    key: "car"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "pedestrian"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "cyclist"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.67
}

cost_function_config {
  target_classes {
    name: "car"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "pedestrian"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }
  target_classes {
    name: "cyclist"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: True
  max_objective_weight: 0.9999
  min_objective_weight: 0.0001
}

training_config {
  batch_size_per_gpu: 16
  num_epochs: 120
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-6
      max_learning_rate: 5e-4
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 1e-08
      beta1: 0.9
      beta2: 0.999
    }
  }
  cost_scaling {
    enabled: False
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 10
}

augmentation_config {
  preprocessing {
    output_image_width: 512
    output_image_height: 128
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    color_shift_stddev: 0.0
    hue_rotation_max: 25.0
    saturation_shift_max: 0.2
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}

postprocessing_config {
  target_class_config {
    key: "car"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.13
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 1
      }
    }
  }
  target_class_config {
    key: "pedestrian"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 1
      }
    }
  }
  target_class_config {
    key: "cyclist"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 1
      }
    }
  }
}

dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/tf_records/*"
    image_directory_path: "/workspace/dataset/KITTI_original/training"
  }
  image_extension: "jpg"
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "pedestrian"
      value: "pedestrian"
  }
  target_class_mapping {
      key: "cyclist"
      value: "cyclist"
  }
  validation_fold: 0
}

evaluation_config {
  validation_period_during_training: 10
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "car"
    value: 0.7
  }
  minimum_detection_ground_truth_overlap {
    key: "pedestrian"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "cyclist"
    value: 0.5
  }
  evaluation_box_config {
    key: "car"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "pedestrian"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "cyclist"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
}

I am confused whether I am doing anything wrong. Please help

Morganh · March 11, 2020, 1:48am

You are setting below in training spec. It does not match the actual size of KITTI dataset. KITTI dataset is about 1248x384. So, please modify to 1248x384.

output_image_width: 512
    output_image_height: 128

More, TLT 1.0.1 docker has the jupyter notebooks for referece. There are also training specs for KITTI dataset by default.

xhuv_NV · March 12, 2020, 6:46am

Thanks. I have updated those parameters and training works good with the following result:

Epoch 120/120
=========================

Validation cost: 0.000082
Mean average_precision (in %): 71.5213

class name      average precision (in %)
------------  --------------------------
car                              85.4542
cyclist                          56.6675
pedestrian                       72.442

Median Inference Time: 0.011675
Time taken to run iva.detectnet_v2.scripts.train:main: 5:36:31.637871.

But during evaluation

!tlt-evaluate detectnet_v2 -e $SPECS_DIR/train.txt\
                           -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/resnet10_detector.tlt \
                           -k $KEY \

It throws back this error:

Using TensorFlow backend.
2020-03-12 06:39:52,966 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/spec_files/train.txt
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-03-12 06:39:53,506 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-03-12 06:39:54.628252: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-12 06:39:54.735884: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-12 06:39:54.736622: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x8097390 executing computations on platform CUDA. Devices:
2020-03-12 06:39:54.736657: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2020-03-12 06:39:54.758318: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3408000000 Hz
2020-03-12 06:39:54.758873: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x81000e0 executing computations on platform Host. Devices:
2020-03-12 06:39:54.758909: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-03-12 06:39:54.759088: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.86
pciBusID: 0000:01:00.0
totalMemory: 7.76GiB freeMemory: 6.89GiB
2020-03-12 06:39:54.759118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-03-12 06:39:54.760118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-12 06:39:54.760132: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-03-12 06:39:54.760139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-03-12 06:39:54.760237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6707 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
/usr/local/lib/python2.7/dist-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
WARNING:tensorflow:From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
2020-03-12 06:39:55,753 [WARNING] tensorflow: From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2020-03-12 06:40:01,867 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/evaluation/build_evaluator.pyc: Found 1122 samples in validation set
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, 384, 1248) 0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, 192, 624) 9472        input_1[0][0]                    
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 64, 192, 624) 256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 64, 192, 624) 0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (None, 64, 96, 312)  36928       activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (None, 64, 96, 312)  256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 64, 96, 312)  0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (None, 64, 96, 312)  36928       activation_2[0][0]               
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 64, 96, 312)  4160        activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (None, 64, 96, 312)  256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (None, 64, 96, 312)  256         block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_1 (Add)                     (None, 64, 96, 312)  0           block_1a_bn_2[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 64, 96, 312)  0           add_1[0][0]                      
__________________________________________________________________________________________________
block_1b_conv_1 (Conv2D)        (None, 64, 96, 312)  36928       activation_3[0][0]               
__________________________________________________________________________________________________
block_1b_bn_1 (BatchNormalizati (None, 64, 96, 312)  256         block_1b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_4 (Activation)       (None, 64, 96, 312)  0           block_1b_bn_1[0][0]              
__________________________________________________________________________________________________
block_1b_conv_2 (Conv2D)        (None, 64, 96, 312)  36928       activation_4[0][0]               
__________________________________________________________________________________________________
block_1b_bn_2 (BatchNormalizati (None, 64, 96, 312)  256         block_1b_conv_2[0][0]            
__________________________________________________________________________________________________
add_2 (Add)                     (None, 64, 96, 312)  0           block_1b_bn_2[0][0]              
                                                                 activation_3[0][0]               
__________________________________________________________________________________________________
activation_5 (Activation)       (None, 64, 96, 312)  0           add_2[0][0]                      
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (None, 128, 48, 156) 73856       activation_5[0][0]               
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (None, 128, 48, 156) 512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_6 (Activation)       (None, 128, 48, 156) 0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (None, 128, 48, 156) 147584      activation_6[0][0]               
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 128, 48, 156) 8320        activation_5[0][0]               
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (None, 128, 48, 156) 512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (None, 128, 48, 156) 512         block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_3 (Add)                     (None, 128, 48, 156) 0           block_2a_bn_2[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_7 (Activation)       (None, 128, 48, 156) 0           add_3[0][0]                      
__________________________________________________________________________________________________
block_2b_conv_1 (Conv2D)        (None, 128, 48, 156) 147584      activation_7[0][0]               
__________________________________________________________________________________________________
block_2b_bn_1 (BatchNormalizati (None, 128, 48, 156) 512         block_2b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_8 (Activation)       (None, 128, 48, 156) 0           block_2b_bn_1[0][0]              
__________________________________________________________________________________________________
block_2b_conv_2 (Conv2D)        (None, 128, 48, 156) 147584      activation_8[0][0]               
__________________________________________________________________________________________________
block_2b_bn_2 (BatchNormalizati (None, 128, 48, 156) 512         block_2b_conv_2[0][0]            
__________________________________________________________________________________________________
add_4 (Add)                     (None, 128, 48, 156) 0           block_2b_bn_2[0][0]              
                                                                 activation_7[0][0]               
__________________________________________________________________________________________________
activation_9 (Activation)       (None, 128, 48, 156) 0           add_4[0][0]                      
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (None, 256, 24, 78)  295168      activation_9[0][0]               
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (None, 256, 24, 78)  1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_10 (Activation)      (None, 256, 24, 78)  0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, 24, 78)  590080      activation_10[0][0]              
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 256, 24, 78)  33024       activation_9[0][0]               
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, 24, 78)  1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 256, 24, 78)  1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_5 (Add)                     (None, 256, 24, 78)  0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_11 (Activation)      (None, 256, 24, 78)  0           add_5[0][0]                      
__________________________________________________________________________________________________
block_3b_conv_1 (Conv2D)        (None, 256, 24, 78)  590080      activation_11[0][0]              
__________________________________________________________________________________________________
block_3b_bn_1 (BatchNormalizati (None, 256, 24, 78)  1024        block_3b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_12 (Activation)      (None, 256, 24, 78)  0           block_3b_bn_1[0][0]              
__________________________________________________________________________________________________
block_3b_conv_2 (Conv2D)        (None, 256, 24, 78)  590080      activation_12[0][0]              
__________________________________________________________________________________________________
block_3b_bn_2 (BatchNormalizati (None, 256, 24, 78)  1024        block_3b_conv_2[0][0]            
__________________________________________________________________________________________________
add_6 (Add)                     (None, 256, 24, 78)  0           block_3b_bn_2[0][0]              
                                                                 activation_11[0][0]              
__________________________________________________________________________________________________
activation_13 (Activation)      (None, 256, 24, 78)  0           add_6[0][0]                      
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (None, 512, 24, 78)  1180160     activation_13[0][0]              
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (None, 512, 24, 78)  2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_14 (Activation)      (None, 512, 24, 78)  0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (None, 512, 24, 78)  2359808     activation_14[0][0]              
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 512, 24, 78)  131584      activation_13[0][0]              
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (None, 512, 24, 78)  2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (None, 512, 24, 78)  2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_7 (Add)                     (None, 512, 24, 78)  0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_15 (Activation)      (None, 512, 24, 78)  0           add_7[0][0]                      
__________________________________________________________________________________________________
block_4b_conv_1 (Conv2D)        (None, 512, 24, 78)  2359808     activation_15[0][0]              
__________________________________________________________________________________________________
block_4b_bn_1 (BatchNormalizati (None, 512, 24, 78)  2048        block_4b_conv_1[0][0]            
__________________________________________________________________________________________________
activation_16 (Activation)      (None, 512, 24, 78)  0           block_4b_bn_1[0][0]              
__________________________________________________________________________________________________
block_4b_conv_2 (Conv2D)        (None, 512, 24, 78)  2359808     activation_16[0][0]              
__________________________________________________________________________________________________
block_4b_bn_2 (BatchNormalizati (None, 512, 24, 78)  2048        block_4b_conv_2[0][0]            
__________________________________________________________________________________________________
add_8 (Add)                     (None, 512, 24, 78)  0           block_4b_bn_2[0][0]              
                                                                 activation_15[0][0]              
__________________________________________________________________________________________________
activation_17 (Activation)      (None, 512, 24, 78)  0           add_8[0][0]                      
__________________________________________________________________________________________________
output_bbox (Conv2D)            (None, 12, 24, 78)   6156        activation_17[0][0]              
__________________________________________________________________________________________________
output_cov (Conv2D)             (None, 3, 24, 78)    1539        activation_17[0][0]              
==================================================================================================
Total params: 11,203,023
Trainable params: 11,183,823
Non-trainable params: 19,200
__________________________________________________________________________________________________
INFO:tensorflow:Graph was finalized.
2020-03-12 06:40:09,368 [INFO] tensorflow: Graph was finalized.
2020-03-12 06:40:09.369112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-03-12 06:40:09.369166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-12 06:40:09.369174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-03-12 06:40:09.369180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-03-12 06:40:09.369294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6707 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
2020-03-12 06:40:10,590 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2020-03-12 06:40:10,887 [INFO] tensorflow: Done running local_init_op.
2020-03-12 06:40:12,780 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 71, 0.00s/step
2020-03-12 06:40:16.856709: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-03-12 06:40:17.693190: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-12 06:40:17.697520: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/usr/local/bin/tlt-evaluate", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_evaluate.py", line 38, in main
  File "</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-2>", line 2, in main
  File "./detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "./detectnet_v2/scripts/evaluate.py", line 126, in main
  File "./detectnet_v2/evaluation/evaluation.py", line 156, in evaluate
  File "./detectnet_v2/evaluation/evaluation.py", line 116, in _get_validation_iterator
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node resnet18_nopool_bn_detectnet_v2/conv1/convolution (defined at /opt/nvidia/third_party/keras/tensorflow_backend.py:93) ]]
	 [[node strided_slice_352 (defined at ./detectnet_v2/model/utilities.py:53) ]]

Caused by op u'resnet18_nopool_bn_detectnet_v2/conv1/convolution', defined at:
  File "/usr/local/bin/tlt-evaluate", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_evaluate.py", line 38, in main
  File "</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-2>", line 2, in main
  File "./detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "./detectnet_v2/scripts/evaluate.py", line 119, in main
  File "./detectnet_v2/evaluation/build_evaluator.py", line 124, in build_evaluator_for_trained_gridbox
  File "./detectnet_v2/model/utilities.py", line 26, in _fn_wrapper
  File "./detectnet_v2/model/detectnet_model.py", line 617, in build_validation_graph
  File "./detectnet_v2/model/utilities.py", line 26, in _fn_wrapper
  File "./detectnet_v2/model/detectnet_model.py", line 577, in build_inference_graph
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/network.py", line 564, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/network.py", line 721, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/convolutional.py", line 171, in call
    dilation_rate=self.dilation_rate)
  File "/opt/nvidia/third_party/keras/tensorflow_backend.py", line 93, in conv2d
    data_format=tf_data_format)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 851, in convolution
    return op(input, filter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 966, in __call__
    return self.conv_op(inp, filter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 591, in __call__
    return self.call(inp, filter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 208, in __call__
    name=self.name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node resnet18_nopool_bn_detectnet_v2/conv1/convolution (defined at /opt/nvidia/third_party/keras/tensorflow_backend.py:93) ]]
	 [[node strided_slice_352 (defined at ./detectnet_v2/model/utilities.py:53) ]]

Morganh · March 12, 2020, 7:11am

According to your training spec, you were training a resnet18 tlt model.
Why is resnet10_detector.tlt in your tlt-evaluate command?

xhuv_NV · March 12, 2020, 7:33am

I have previously attempted resnet 18. Currently, my train specs looks like this which I have trained on.

random_seed: 42
model_config {
  pretrained_model_file: "/workspace/pretrained_model/tlt_resnet10_detectnet_v2_v1/resnet10.hdf5"
  num_layers: 18
  
  freeze_blocks: 0
  arch: "resnet"
  use_batch_norm: true
  activation {
    activation_type: "relu"
  }
  dropout_rate: 0.1
  objective_set: {
    cov {}
    bbox {
      scale: 35.0
      offset: 0.5
    }
  }
  training_precision {
    backend_floatx: FLOAT32
 }
}

bbox_rasterizer_config {
  target_class_config {
    key: "car"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "pedestrian"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "cyclist"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 1.0
      cov_radius_y: 1.0
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.67
}

cost_function_config {
  target_classes {
    name: "car"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "pedestrian"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }
  target_classes {
    name: "cyclist"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: True
  max_objective_weight: 0.9999
  min_objective_weight: 0.0001
}

training_config {
  batch_size_per_gpu: 16
  num_epochs: 120
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-6
      max_learning_rate: 5e-4
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 1e-08
      beta1: 0.9
      beta2: 0.999
    }
  }
  cost_scaling {
    enabled: False
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 10
}

augmentation_config {
  preprocessing {
    output_image_width: 1248
    output_image_height: 384
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    color_shift_stddev: 0.0
    hue_rotation_max: 25.0
    saturation_shift_max: 0.2
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}

postprocessing_config {
  target_class_config {
    key: "car"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.13
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 1
      }
    }
  }
  target_class_config {
    key: "pedestrian"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 1
      }
    }
  }
  target_class_config {
    key: "cyclist"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 1
      }
    }
  }
}

dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/tf_records/*"
    image_directory_path: "/workspace/dataset/KITTI_original/training"
  }
  image_extension: "jpg"
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "pedestrian"
      value: "pedestrian"
  }
  target_class_mapping {
      key: "cyclist"
      value: "cyclist"
  }
  validation_fold: 0
}

evaluation_config {
  validation_period_during_training: 10
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "car"
    value: 0.7
  }
  minimum_detection_ground_truth_overlap {
    key: "pedestrian"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "cyclist"
    value: 0.5
  }
  evaluation_box_config {
    key: "car"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "pedestrian"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "cyclist"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
}

xhuv_NV · March 12, 2020, 7:45am

Sorry there might be some issues with

num_layers: 18

. I’ll correct it and post updates.

xhuv_NV · March 12, 2020, 4:34pm

The error remains same:

Using TensorFlow backend.
2020-03-12 16:13:47,534 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/spec_files/train.txt
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-03-12 16:13:48,077 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-03-12 16:13:48.729739: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-12 16:13:48.831875: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-12 16:13:48.832559: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x7664140 executing computations on platform CUDA. Devices:
2020-03-12 16:13:48.832594: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2020-03-12 16:13:48.858352: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3408000000 Hz
2020-03-12 16:13:48.859118: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x76cce90 executing computations on platform Host. Devices:
2020-03-12 16:13:48.859139: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-03-12 16:13:48.859275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.86
pciBusID: 0000:01:00.0
totalMemory: 7.76GiB freeMemory: 6.95GiB
2020-03-12 16:13:48.859292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-03-12 16:13:48.860084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-12 16:13:48.860097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-03-12 16:13:48.860105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-03-12 16:13:48.860186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6764 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
/usr/local/lib/python2.7/dist-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
WARNING:tensorflow:From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
2020-03-12 16:13:49,968 [WARNING] tensorflow: From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2020-03-12 16:13:56,039 [INFO] /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/evaluation/build_evaluator.pyc: Found 1122 samples in validation set
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, 384, 1248) 0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, 192, 624) 9472        input_1[0][0]                    
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 64, 192, 624) 256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 64, 192, 624) 0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (None, 64, 96, 312)  36928       activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (None, 64, 96, 312)  256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 64, 96, 312)  0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (None, 64, 96, 312)  36928       activation_2[0][0]               
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 64, 96, 312)  4160        activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (None, 64, 96, 312)  256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (None, 64, 96, 312)  256         block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_1 (Add)                     (None, 64, 96, 312)  0           block_1a_bn_2[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 64, 96, 312)  0           add_1[0][0]                      
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (None, 128, 48, 156) 73856       activation_3[0][0]               
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (None, 128, 48, 156) 512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_4 (Activation)       (None, 128, 48, 156) 0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (None, 128, 48, 156) 147584      activation_4[0][0]               
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 128, 48, 156) 8320        activation_3[0][0]               
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (None, 128, 48, 156) 512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (None, 128, 48, 156) 512         block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_2 (Add)                     (None, 128, 48, 156) 0           block_2a_bn_2[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_5 (Activation)       (None, 128, 48, 156) 0           add_2[0][0]                      
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (None, 256, 24, 78)  295168      activation_5[0][0]               
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (None, 256, 24, 78)  1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_6 (Activation)       (None, 256, 24, 78)  0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, 24, 78)  590080      activation_6[0][0]               
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 256, 24, 78)  33024       activation_5[0][0]               
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, 24, 78)  1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 256, 24, 78)  1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_3 (Add)                     (None, 256, 24, 78)  0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_7 (Activation)       (None, 256, 24, 78)  0           add_3[0][0]                      
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (None, 512, 24, 78)  1180160     activation_7[0][0]               
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (None, 512, 24, 78)  2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_8 (Activation)       (None, 512, 24, 78)  0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (None, 512, 24, 78)  2359808     activation_8[0][0]               
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 512, 24, 78)  131584      activation_7[0][0]               
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (None, 512, 24, 78)  2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (None, 512, 24, 78)  2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_4 (Add)                     (None, 512, 24, 78)  0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_9 (Activation)       (None, 512, 24, 78)  0           add_4[0][0]                      
__________________________________________________________________________________________________
output_bbox (Conv2D)            (None, 12, 24, 78)   6156        activation_9[0][0]               
__________________________________________________________________________________________________
output_cov (Conv2D)             (None, 3, 24, 78)    1539        activation_9[0][0]               
==================================================================================================
Total params: 4,926,543
Trainable params: 4,911,183
Non-trainable params: 15,360
__________________________________________________________________________________________________
INFO:tensorflow:Graph was finalized.
2020-03-12 16:14:03,516 [INFO] tensorflow: Graph was finalized.
2020-03-12 16:14:03.516861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-03-12 16:14:03.516900: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-12 16:14:03.516909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-03-12 16:14:03.516916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-03-12 16:14:03.516994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6764 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
2020-03-12 16:14:04,576 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2020-03-12 16:14:04,847 [INFO] tensorflow: Done running local_init_op.
2020-03-12 16:14:06,510 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 71, 0.00s/step
2020-03-12 16:14:10.638651: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-03-12 16:14:11.514059: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-12 16:14:11.540502: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/usr/local/bin/tlt-evaluate", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_evaluate.py", line 38, in main
  File "</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-2>", line 2, in main
  File "./detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "./detectnet_v2/scripts/evaluate.py", line 126, in main
  File "./detectnet_v2/evaluation/evaluation.py", line 156, in evaluate
  File "./detectnet_v2/evaluation/evaluation.py", line 116, in _get_validation_iterator
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node resnet10_nopool_bn_detectnet_v2/conv1/convolution (defined at /opt/nvidia/third_party/keras/tensorflow_backend.py:93) ]]
	 [[node strided_slice_355 (defined at ./detectnet_v2/model/utilities.py:53) ]]

Caused by op u'resnet10_nopool_bn_detectnet_v2/conv1/convolution', defined at:
  File "/usr/local/bin/tlt-evaluate", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_evaluate.py", line 38, in main
  File "</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-2>", line 2, in main
  File "./detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "./detectnet_v2/scripts/evaluate.py", line 119, in main
  File "./detectnet_v2/evaluation/build_evaluator.py", line 124, in build_evaluator_for_trained_gridbox
  File "./detectnet_v2/model/utilities.py", line 26, in _fn_wrapper
  File "./detectnet_v2/model/detectnet_model.py", line 617, in build_validation_graph
  File "./detectnet_v2/model/utilities.py", line 26, in _fn_wrapper
  File "./detectnet_v2/model/detectnet_model.py", line 577, in build_inference_graph
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/network.py", line 564, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/network.py", line 721, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/convolutional.py", line 171, in call
    dilation_rate=self.dilation_rate)
  File "/opt/nvidia/third_party/keras/tensorflow_backend.py", line 93, in conv2d
    data_format=tf_data_format)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 851, in convolution
    return op(input, filter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 966, in __call__
    return self.conv_op(inp, filter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 591, in __call__
    return self.call(inp, filter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 208, in __call__
    name=self.name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node resnet10_nopool_bn_detectnet_v2/conv1/convolution (defined at /opt/nvidia/third_party/keras/tensorflow_backend.py:93) ]]
	 [[node strided_slice_355 (defined at ./detectnet_v2/model/utilities.py:53) ]]

Morganh · March 13, 2020, 4:40am

Please try https://devtalk.nvidia.com/default/topic/1066050/transfer-learning-toolkit/error-with-cudnn-when-attempting-to-perform-inference-after-training-an-ssd-model-with-tlt/post/5398505/#5398505

Topic		Replies	Views
Training with TLT a detectnet_v2 resnet18 pre-trained model failed TAO Toolkit	2	611	October 12, 2021
Training detectnet_v2 Issue TAO Toolkit	15	1846	October 12, 2021
tlt-train error when deploy mobilenet_v2 by using DetectNet TAO Toolkit	28	2366	October 12, 2021
tlt first tutorial error TAO Toolkit	3	770	October 12, 2021
SSD Resnet 18 only learns 3 out of 5 classes TAO Toolkit	5	613	October 12, 2021
Model retraining warning TAO Toolkit	7	1025	October 12, 2021
For same frame I get different output using .tlt and .engine TAO Toolkit	24	1628	October 12, 2021
Tlt-train loss is minimal but performances are bad TAO Toolkit	11	518	October 12, 2021
Core dump Illegal Instruction on detectnet_v2 example TAO Toolkit	17	1993	October 12, 2021
Error on tlt-training detectnet_v2? TAO Toolkit	6	473	October 12, 2021

How to resize KITTI dataset images and labels

Related topics