Error with cuDNN when attempting to perform inference after training an SSD model with TLT

I have trained an SSD model using TLT and when I attempt to use it for inferencing I get the error shown below. It appears that the cuDNN failed to initialize but I can’t make out why.

Can anyone comment as to what has caused this error and/or how to get past it? Thanks in advance for any insight or suggestions.

# tlt-infer ssd -i test_images/handgun_shooter -o test_images/output -e specs/ssd_resnet10_weapons_train.txt -m output/ssd_20191104_unpruned/weights/ssd_resnet10_epoch_225.tlt -k ${NGC_API_KEY}
Using TensorFlow backend.
2019-11-05 16:39:25,879 [INFO] iva.ssd.scripts.inference: Loading experiment spec at specs/ssd_resnet10_weapons_train.txt.
2019-11-05 16:39:25,881 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from specs/ssd_resnet10_weapons_train.txt
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-11-05 16:39:26,260 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-11-05 16:39:27.240404: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-05 16:39:27.347941: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-05 16:39:27.349339: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x946f980 executing computations on platform CUDA. Devices:
2019-11-05 16:39:27.349356: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1660 Ti, Compute Capability 7.5
2019-11-05 16:39:27.374523: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2019-11-05 16:39:27.375076: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x94d86a0 executing computations on platform Host. Devices:
2019-11-05 16:39:27.375114: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-11-05 16:39:27.375235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1660 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:01:00.0
totalMemory: 5.77GiB freeMemory: 4.88GiB
2019-11-05 16:39:27.375249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-11-05 16:39:27.376397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-05 16:39:27.376412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-11-05 16:39:27.376420: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-11-05 16:39:27.376663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4699 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
/usr/local/lib/python2.7/dist-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
Input (InputLayer)              (None, 3, 768, 1024) 0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, 384, 512) 9472        Input[0][0]                      
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 64, 384, 512) 256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_19 (Activation)      (None, 64, 384, 512) 0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (None, 64, 192, 256) 36928       activation_19[0][0]              
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (None, 64, 192, 256) 256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_20 (Activation)      (None, 64, 192, 256) 0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (None, 64, 192, 256) 36928       activation_20[0][0]              
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 64, 192, 256) 4160        activation_19[0][0]              
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (None, 64, 192, 256) 256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (None, 64, 192, 256) 256         block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_9 (Add)                     (None, 64, 192, 256) 0           block_1a_bn_2[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_21 (Activation)      (None, 64, 192, 256) 0           add_9[0][0]                      
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (None, 128, 96, 128) 73856       activation_21[0][0]              
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (None, 128, 96, 128) 512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_22 (Activation)      (None, 128, 96, 128) 0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (None, 128, 96, 128) 147584      activation_22[0][0]              
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 128, 96, 128) 8320        activation_21[0][0]              
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (None, 128, 96, 128) 512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (None, 128, 96, 128) 512         block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_10 (Add)                    (None, 128, 96, 128) 0           block_2a_bn_2[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_23 (Activation)      (None, 128, 96, 128) 0           add_10[0][0]                     
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (None, 256, 48, 64)  295168      activation_23[0][0]              
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (None, 256, 48, 64)  1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_24 (Activation)      (None, 256, 48, 64)  0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, 48, 64)  590080      activation_24[0][0]              
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 256, 48, 64)  33024       activation_23[0][0]              
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, 48, 64)  1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 256, 48, 64)  1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_11 (Add)                    (None, 256, 48, 64)  0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_25 (Activation)      (None, 256, 48, 64)  0           add_11[0][0]                     
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (None, 512, 48, 64)  1180160     activation_25[0][0]              
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (None, 512, 48, 64)  2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
activation_26 (Activation)      (None, 512, 48, 64)  0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (None, 512, 48, 64)  2359808     activation_26[0][0]              
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 512, 48, 64)  131584      activation_25[0][0]              
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (None, 512, 48, 64)  2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (None, 512, 48, 64)  2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_12 (Add)                    (None, 512, 48, 64)  0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
activation_27 (Activation)      (None, 512, 48, 64)  0           add_12[0][0]                     
__________________________________________________________________________________________________
expand_conv1 (Conv2D)           (None, 1024, 48, 64) 4719616     activation_27[0][0]              
__________________________________________________________________________________________________
expand1_relu (ReLU)             (None, 1024, 48, 64) 0           expand_conv1[0][0]               
__________________________________________________________________________________________________
expand_conv2 (Conv2D)           (None, 1024, 48, 64) 1049600     expand1_relu[0][0]               
__________________________________________________________________________________________________
expand2_relu (ReLU)             (None, 1024, 48, 64) 0           expand_conv2[0][0]               
__________________________________________________________________________________________________
additional_map0_0 (Conv2D)      (None, 256, 48, 64)  262400      expand2_relu[0][0]               
__________________________________________________________________________________________________
additional_map0_0_relu (ReLU)   (None, 256, 48, 64)  0           additional_map0_0[0][0]          
__________________________________________________________________________________________________
additional_map0_1 (Conv2D)      (None, 512, 24, 32)  1180160     additional_map0_0_relu[0][0]     
__________________________________________________________________________________________________
additional_map0_1_relu (ReLU)   (None, 512, 24, 32)  0           additional_map0_1[0][0]          
__________________________________________________________________________________________________
additional_map1_0 (Conv2D)      (None, 128, 24, 32)  65664       additional_map0_1_relu[0][0]     
__________________________________________________________________________________________________
additional_map1_0_relu (ReLU)   (None, 128, 24, 32)  0           additional_map1_0[0][0]          
__________________________________________________________________________________________________
additional_map1_1 (Conv2D)      (None, 256, 12, 16)  295168      additional_map1_0_relu[0][0]     
__________________________________________________________________________________________________
additional_map1_1_relu (ReLU)   (None, 256, 12, 16)  0           additional_map1_1[0][0]          
__________________________________________________________________________________________________
additional_map2_0 (Conv2D)      (None, 128, 12, 16)  32896       additional_map1_1_relu[0][0]     
__________________________________________________________________________________________________
additional_map2_0_relu (ReLU)   (None, 128, 12, 16)  0           additional_map2_0[0][0]          
__________________________________________________________________________________________________
additional_map2_1 (Conv2D)      (None, 256, 6, 8)    295168      additional_map2_0_relu[0][0]     
__________________________________________________________________________________________________
additional_map2_1_relu (ReLU)   (None, 256, 6, 8)    0           additional_map2_1[0][0]          
__________________________________________________________________________________________________
additional_map3_0 (Conv2D)      (None, 128, 6, 8)    32896       additional_map2_1_relu[0][0]     
__________________________________________________________________________________________________
additional_map3_0_relu (ReLU)   (None, 128, 6, 8)    0           additional_map3_0[0][0]          
__________________________________________________________________________________________________
additional_map3_1 (Conv2D)      (None, 256, 3, 4)    295168      additional_map3_0_relu[0][0]     
__________________________________________________________________________________________________
additional_map3_1_relu (ReLU)   (None, 256, 3, 4)    0           additional_map3_1[0][0]          
__________________________________________________________________________________________________
ssd_conf_0 (Conv2D)             (None, 12, 96, 128)  13836       activation_23[0][0]              
__________________________________________________________________________________________________
ssd_conf_1 (Conv2D)             (None, 12, 48, 64)   55308       activation_27[0][0]              
__________________________________________________________________________________________________
ssd_conf_2 (Conv2D)             (None, 12, 24, 32)   55308       additional_map0_1_relu[0][0]     
__________________________________________________________________________________________________
ssd_conf_3 (Conv2D)             (None, 12, 12, 16)   27660       additional_map1_1_relu[0][0]     
__________________________________________________________________________________________________
ssd_conf_4 (Conv2D)             (None, 12, 6, 8)     27660       additional_map2_1_relu[0][0]     
__________________________________________________________________________________________________
ssd_conf_5 (Conv2D)             (None, 12, 3, 4)     27660       additional_map3_1_relu[0][0]     
__________________________________________________________________________________________________
permute_25 (Permute)            (None, 96, 128, 12)  0           ssd_conf_0[0][0]                 
__________________________________________________________________________________________________
permute_27 (Permute)            (None, 48, 64, 12)   0           ssd_conf_1[0][0]                 
__________________________________________________________________________________________________
permute_29 (Permute)            (None, 24, 32, 12)   0           ssd_conf_2[0][0]                 
__________________________________________________________________________________________________
permute_31 (Permute)            (None, 12, 16, 12)   0           ssd_conf_3[0][0]                 
__________________________________________________________________________________________________
permute_33 (Permute)            (None, 6, 8, 12)     0           ssd_conf_4[0][0]                 
__________________________________________________________________________________________________
permute_35 (Permute)            (None, 3, 4, 12)     0           ssd_conf_5[0][0]                 
__________________________________________________________________________________________________
ssd_loc_0 (Conv2D)              (None, 24, 96, 128)  27672       activation_23[0][0]              
__________________________________________________________________________________________________
ssd_loc_1 (Conv2D)              (None, 24, 48, 64)   110616      activation_27[0][0]              
__________________________________________________________________________________________________
ssd_loc_2 (Conv2D)              (None, 24, 24, 32)   110616      additional_map0_1_relu[0][0]     
__________________________________________________________________________________________________
ssd_loc_3 (Conv2D)              (None, 24, 12, 16)   55320       additional_map1_1_relu[0][0]     
__________________________________________________________________________________________________
ssd_loc_4 (Conv2D)              (None, 24, 6, 8)     55320       additional_map2_1_relu[0][0]     
__________________________________________________________________________________________________
ssd_loc_5 (Conv2D)              (None, 24, 3, 4)     55320       additional_map3_1_relu[0][0]     
__________________________________________________________________________________________________
conf_reshape_0 (Reshape)        (None, 73728, 1, 2)  0           permute_25[0][0]                 
__________________________________________________________________________________________________
conf_reshape_1 (Reshape)        (None, 18432, 1, 2)  0           permute_27[0][0]                 
__________________________________________________________________________________________________
conf_reshape_2 (Reshape)        (None, 4608, 1, 2)   0           permute_29[0][0]                 
__________________________________________________________________________________________________
conf_reshape_3 (Reshape)        (None, 1152, 1, 2)   0           permute_31[0][0]                 
__________________________________________________________________________________________________
conf_reshape_4 (Reshape)        (None, 288, 1, 2)    0           permute_33[0][0]                 
__________________________________________________________________________________________________
conf_reshape_5 (Reshape)        (None, 72, 1, 2)     0           permute_35[0][0]                 
__________________________________________________________________________________________________
permute_26 (Permute)            (None, 96, 128, 24)  0           ssd_loc_0[0][0]                  
__________________________________________________________________________________________________
permute_28 (Permute)            (None, 48, 64, 24)   0           ssd_loc_1[0][0]                  
__________________________________________________________________________________________________
permute_30 (Permute)            (None, 24, 32, 24)   0           ssd_loc_2[0][0]                  
__________________________________________________________________________________________________
permute_32 (Permute)            (None, 12, 16, 24)   0           ssd_loc_3[0][0]                  
__________________________________________________________________________________________________
permute_34 (Permute)            (None, 6, 8, 24)     0           ssd_loc_4[0][0]                  
__________________________________________________________________________________________________
permute_36 (Permute)            (None, 3, 4, 24)     0           ssd_loc_5[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_0 (AnchorBoxes)      (None, 12288, 6, 8)  0           ssd_loc_0[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_1 (AnchorBoxes)      (None, 3072, 6, 8)   0           ssd_loc_1[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_2 (AnchorBoxes)      (None, 768, 6, 8)    0           ssd_loc_2[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_3 (AnchorBoxes)      (None, 192, 6, 8)    0           ssd_loc_3[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_4 (AnchorBoxes)      (None, 48, 6, 8)     0           ssd_loc_4[0][0]                  
__________________________________________________________________________________________________
ssd_anchor_5 (AnchorBoxes)      (None, 12, 6, 8)     0           ssd_loc_5[0][0]                  
__________________________________________________________________________________________________
mbox_conf (Concatenate)         (None, 98280, 1, 2)  0           conf_reshape_0[0][0]             
                                                                 conf_reshape_1[0][0]             
                                                                 conf_reshape_2[0][0]             
                                                                 conf_reshape_3[0][0]             
                                                                 conf_reshape_4[0][0]             
                                                                 conf_reshape_5[0][0]             
__________________________________________________________________________________________________
loc_reshape_0 (Reshape)         (None, 73728, 1, 4)  0           permute_26[0][0]                 
__________________________________________________________________________________________________
loc_reshape_1 (Reshape)         (None, 18432, 1, 4)  0           permute_28[0][0]                 
__________________________________________________________________________________________________
loc_reshape_2 (Reshape)         (None, 4608, 1, 4)   0           permute_30[0][0]                 
__________________________________________________________________________________________________
loc_reshape_3 (Reshape)         (None, 1152, 1, 4)   0           permute_32[0][0]                 
__________________________________________________________________________________________________
loc_reshape_4 (Reshape)         (None, 288, 1, 4)    0           permute_34[0][0]                 
__________________________________________________________________________________________________
loc_reshape_5 (Reshape)         (None, 72, 1, 4)     0           permute_36[0][0]                 
__________________________________________________________________________________________________
anchor_reshape_0 (Reshape)      (None, 73728, 1, 8)  0           ssd_anchor_0[0][0]               
__________________________________________________________________________________________________
anchor_reshape_1 (Reshape)      (None, 18432, 1, 8)  0           ssd_anchor_1[0][0]               
__________________________________________________________________________________________________
anchor_reshape_2 (Reshape)      (None, 4608, 1, 8)   0           ssd_anchor_2[0][0]               
__________________________________________________________________________________________________
anchor_reshape_3 (Reshape)      (None, 1152, 1, 8)   0           ssd_anchor_3[0][0]               
__________________________________________________________________________________________________
anchor_reshape_4 (Reshape)      (None, 288, 1, 8)    0           ssd_anchor_4[0][0]               
__________________________________________________________________________________________________
anchor_reshape_5 (Reshape)      (None, 72, 1, 8)     0           ssd_anchor_5[0][0]               
__________________________________________________________________________________________________
mbox_conf_sigmoid (Activation)  (None, 98280, 1, 2)  0           mbox_conf[0][0]                  
__________________________________________________________________________________________________
mbox_loc (Concatenate)          (None, 98280, 1, 4)  0           loc_reshape_0[0][0]              
                                                                 loc_reshape_1[0][0]              
                                                                 loc_reshape_2[0][0]              
                                                                 loc_reshape_3[0][0]              
                                                                 loc_reshape_4[0][0]              
                                                                 loc_reshape_5[0][0]              
__________________________________________________________________________________________________
mbox_priorbox (Concatenate)     (None, 98280, 1, 8)  0           anchor_reshape_0[0][0]           
                                                                 anchor_reshape_1[0][0]           
                                                                 anchor_reshape_2[0][0]           
                                                                 anchor_reshape_3[0][0]           
                                                                 anchor_reshape_4[0][0]           
                                                                 anchor_reshape_5[0][0]           
__________________________________________________________________________________________________
concatenate_3 (Concatenate)     (None, 98280, 1, 14) 0           mbox_conf_sigmoid[0][0]          
                                                                 mbox_loc[0][0]                   
                                                                 mbox_priorbox[0][0]              
__________________________________________________________________________________________________
ssd_predictions (Reshape)       (None, 98280, 14)    0           concatenate_3[0][0]              
==================================================================================================
Total params: 13,769,880
Trainable params: 13,754,520
Non-trainable params: 15,360
__________________________________________________________________________________________________
WARNING:tensorflow:From ./ssd/box_coder/output_decoder_layer.py:83: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
2019-11-05 16:39:29,109 [WARNING] tensorflow: From ./ssd/box_coder/output_decoder_layer.py:83: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
  0%|          | 0/133 [00:00<?, ?it/s]2019-11-05 16:39:30.768956: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-11-05 16:39:30.777215: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
  0%|          | 0/133 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/usr/local/bin/tlt-infer", line 10, in <module>
    sys.exit(main())
  File "./common/magnet_infer.py", line 32, in main
  File "./ssd/scripts/inference.py", line 173, in main
  File "./ssd/scripts/inference.py", line 141, in inference
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1169, in predict
    steps=steps)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training_arrays.py", line 294, in predict_loop
    batch_outs = f(ins_batch)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node model_1/conv1/convolution}}]]

This turned out to be an out-of-memory issue that was solved by setting the TF_FORCE_GPU_ALLOW_GROWTH environment variable to true:

$ export TF_FORCE_GPU_ALLOW_GROWTH=true
4 Likes