Issues on using Retail Object Detection as the pre-trained weights

Hi, I’m facing two issues when I tried to use the Nvidia Retail Object Detection as the pre-trained weights:

  1. I’m not able to use multi gpus
  2. with --gpus 1, it could complete the first epoch, but failed in the beginning of the second epoch. seems out of memory…

Any suggestions would be greatly appreciated. Thanks!

• Machine: GCP Vertex Notebook (Debian 10 + python 3.7 + Driver Version: 510.47.03 + CUDA Version: 11.6)
• Hardware: V100 x 2
• Network Type: EfficientDet TF2
• TLT Version: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf2.9.1
• Training spec file

data:
  loader:
    prefetch_size: 4
    shuffle_file: True
  max_instances_per_image: 100
  skip_crowd_during_training: True
  image_size: '640x640'
  num_classes: 4
  train_tfrecords:
    - '/workspace/efficientdet/tfrecords/train/train-*'
  val_tfrecords:
    - '/workspace/efficientdet/tfrecords/val/val-*'
  val_json_file: '/workspace/efficientdet/datasets/ap_od_03292023_val/labels.json'
train:
  optimizer:
    name: 'sgd'
    momentum: 0.9
  lr_schedule:
    name: 'cosine'
    warmup_epoch: 5
    warmup_init: 0.0001
    learning_rate: 0.2
  amp: True
  checkpoint: "/workspace/efficientdet/efficientdet-d5_038.tlt"
  num_examples_per_epoch: 26972
  moving_average_decay: 0.999
  batch_size: 1
  checkpoint_interval: 10
  l2_weight_decay: 0.00004
  l1_weight_decay: 0.0
  clip_gradients_norm: 10.0
  image_preview: True
  qat: False
  random_seed: 42
  pruned_model_path: ''
  num_epochs: 200
model:
  name: 'efficientdet-d5'
  aspect_ratios: '[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]'
  anchor_scale: 4
  min_level: 3
  max_level: 7
  num_scales: 3
  freeze_bn: False
  freeze_blocks: []
augment:
  rand_hflip: True
  random_crop_min_scale: 0.1
  random_crop_max_scale: 2
  auto_color_distortion: False
  auto_translate_xy: True
evaluate:
  batch_size: 1
  num_samples: 4391
  max_detections_per_image: 100
  model_path: ''
export:
  max_batch_size: 8
  dynamic_batch_size: True
  min_score_thresh: 0.4
  model_path: ""
  output_path: ""
inference:
  model_path: ""
  image_dir: ""
  output_dir: ""
  dump_label: False
  batch_size: 1
prune:
  model_path: ""
  normalizer: 'max'
  output_path: ""
  equalization_criterion: 'union'
  granularity: 8
  threshold: 0.5
  min_num_filters: 16
  excluded_layers: []
key: 'nvidia-tlt'
results_dir: '/workspace/efficientdet/experiment_dir_unpruned'

• How to reproduce the issue ?

  • command:
docker run -it --rm --gpus all -v /home/jupyter:/workspace --shm-size=32gb nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf2.9.1 efficientdet_tf2 train -e /workspace/efficientdet/specs/train.yaml --gpus 1
  • multi gpus error log:
                                                                                                                                                          class-3-bn-4 (BatchNormalizati  (None, 40, 40, 288)  1152       ['class-3[1][0]']                                                                                  
 on)                                                                                                                                                                                                                                                                                                                                       class-3-bn-5 (BatchNormalizati  (None, 20, 20, 288)  1152       ['class-3[2][0]']                                                                                   on)                                                                                                                                                                
                                                                                                                                                                    
 class-3-bn-6 (BatchNormalizati  (None, 10, 10, 288)  1152       ['class-3[3][0]']                                                                                    on)                                                                                                                                                                                                                                                                                                                                      class-3-bn-7 (BatchNormalizati  (None, 5, 5, 288)   1152        ['class-3[4][0]']                                                                                  
 on)                                                                                                                                                                 
                                                                                                                                                                      box-3-bn-3 (BatchNormalization  (None, 80, 80, 288)  1152       ['box-3[0][0]']                                                                                      )                                                                                                                                                                                                                                                                                                                                      
 box-3-bn-4 (BatchNormalization  (None, 40, 40, 288)  1152       ['box-3[1][0]']                                                                                    
 )                                                                                                                                                                                                                                                                                                                                         box-3-bn-5 (BatchNormalization  (None, 20, 20, 288)  1152       ['box-3[2][0]']                                                                                      )                                                                                                                                                                  
                                                                                                                                                                    
 box-3-bn-6 (BatchNormalization  (None, 10, 10, 288)  1152       ['box-3[3][0]']                                                                                      )                                                                                                                                                                                                                                                                                                                                         box-3-bn-7 (BatchNormalization  (None, 5, 5, 288)   1152        ['box-3[4][0]']                                                                                    
 )                                                                                                                                                                  
                                                                                                                                                                      activation_59 (Activation)     (None, 80, 80, 288)  0           ['class-3-bn-3[0][0]']                                                                                                                                                                                                                                                    activation_63 (Activation)     (None, 40, 40, 288)  0           ['class-3-bn-4[0][0]']                                                                             
                                                                                                                                                                    
 activation_67 (Activation)     (None, 20, 20, 288)  0           ['class-3-bn-5[0][0]']                                                                                                                                                                                                                                                    activation_71 (Activation)     (None, 10, 10, 288)  0           ['class-3-bn-6[0][0]']                                                                                                                                                                                                                                                  
 activation_75 (Activation)     (None, 5, 5, 288)    0           ['class-3-bn-7[0][0]']                                                                             
                                                                                                                                                                      activation_79 (Activation)     (None, 80, 80, 288)  0           ['box-3-bn-3[0][0]']                                                                                                                                                                                                                                                      activation_83 (Activation)     (None, 40, 40, 288)  0           ['box-3-bn-4[0][0]']                                                                               
                                                                                                                                                                    
 activation_87 (Activation)     (None, 20, 20, 288)  0           ['box-3-bn-5[0][0]']                                                                                                                                                                                                                                                      activation_91 (Activation)     (None, 10, 10, 288)  0           ['box-3-bn-6[0][0]']                                                                                
                                                                                                                                                                    
 activation_95 (Activation)     (None, 5, 5, 288)    0           ['box-3-bn-7[0][0]']        
                                                                                                                                  
 class-predict (SeparableConv2D  multiple            12996       ['activation_59[0][0]',                                                                             
 )                                                                'activation_63[0][0]',                                                                             
                                                                  'activation_67[0][0]',                                            
                                                                  'activation_71[0][0]',                                                                            
                                                                  'activation_75[0][0]']                                                                   
                                                                                                                                                                     
 box-predict (SeparableConv2D)  multiple             12996       ['activation_79[0][0]',                                                                             
                                                                  'activation_83[0][0]',                                                  
                                                                  'activation_87[0][0]',     
                                                                  'activation_91[0][0]',                                                                                                                       
                                                                  'activation_95[0][0]']                                                                                                                                                                                                                                                  ==================================================================================================                                                                  Total params: 33,657,021                                                                                                                                            
Trainable params: 33,429,629                                                                                                                                        
Non-trainable params: 227,392                                                                                                                                        __________________________________________________________________________________________________                                                                   LR schedule method: cosine                                                                                                                                          Use SGD optimizer                                                                                                                                                   
/usr/local/lib/python3.8/dist-packages/keras/backend.py:450: UserWarning: `tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. T
o update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.                                                warnings.warn('`tf.keras.backend.set_learning_phase` is deprecated and '                                                                                           WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.                                      `period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.                                                         
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.                                      
`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.                                                          Epoch 1/200                                                                                                                                                          /usr/local/lib/python3.8/dist-packages/keras/backend.py:450: UserWarning: `tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.                                             
  warnings.warn('`tf.keras.backend.set_learning_phase` is deprecated and '                                                                                          
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f89f00b0e50> and will run it as-is.                Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f89f00b0e50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f89f00b0e50> and will run it as-is.                                   Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f89f00b0e50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f89f00b0670> and will run it as-is.          Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f89f00b0670>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f89f00b0670> and will run it as-is.                             Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f89f00b0670>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f6534255430> and will run it as-is.                Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f6534255430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain th
e code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f6534255430> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f6534255430>. Note that functions defined in certain envi
ronments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain th
e code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655b1dc8b0> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655b1dc8b0>. Note that functions defined in certai
n environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are cert
ain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655b1dc8b0> and will run it as-is.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655b1dc8b0> and will run it as-is.                            
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655b1dc8b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f88be7adaf0> and will run it as-is.               
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f88be7adaf0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f88be7adaf0> and will run it as-is.                                  
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f88be7adaf0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f88be7adee0> and will run it as-is.         
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f88be7adee0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f88be7adee0> and will run it as-is.                            
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f88be7adee0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f655ac5ce50> and will run it as-is.               
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f655ac5ce50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f655ac5ce50> and will run it as-is.                                  
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f655ac5ce50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655ab83430> and will run it as-is.         
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655ab83430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655ab83430> and will run it as-is.                            
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655ab83430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node 30fb12a8e965 exited on signal 9 (Killed).
--------------------------------------------------------------------------
Sending telemetry data.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
  • epoch 2 error log:
activation_58 (Activation)     (None, 80, 80, 288)  0           ['class-2-bn-3[0][0]']                                                                     

activation_62 (Activation)     (None, 40, 40, 288)  0           ['class-2-bn-4[0][0]']                                                                       

activation_66 (Activation)     (None, 20, 20, 288)  0           ['class-2-bn-5[0][0]']                                                                    

activation_70 (Activation)     (None, 10, 10, 288)  0           ['class-2-bn-6[0][0]']                                                                       

activation_74 (Activation)     (None, 5, 5, 288)    0           ['class-2-bn-7[0][0]']                                                                       

activation_78 (Activation)     (None, 80, 80, 288)  0           ['box-2-bn-3[0][0]']                                                                         

activation_82 (Activation)     (None, 40, 40, 288)  0           ['box-2-bn-4[0][0]']                                                                         

activation_86 (Activation)     (None, 20, 20, 288)  0           ['box-2-bn-5[0][0]']                                                                         

activation_90 (Activation)     (None, 10, 10, 288)  0           ['box-2-bn-6[0][0]']                                                                         

activation_94 (Activation)     (None, 5, 5, 288)    0           ['box-2-bn-7[0][0]']                                                                         

class-3 (SeparableConv2D)      multiple             85824       ['activation_58[0][0]',                                                                    
                                                                 'activation_62[0][0]',                                                                     
                                                                 'activation_66[0][0]',                                                                    
                                                                 'activation_70[0][0]',                                                                     
                                                                 'activation_74[0][0]']                                                                     

box-3 (SeparableConv2D)        multiple             85824       ['activation_78[0][0]',                                                                    
                                                                 'activation_82[0][0]',                                                                     
                                                                 'activation_86[0][0]',                                                                     
                                                                 'activation_90[0][0]',                                                                     
                                                                 'activation_94[0][0]']                                                                     

class-3-bn-3 (BatchNormalizati  (None, 80, 80, 288)  1152       ['class-3[0][0]']                                                                           
on)                                                                                                                                                         

class-3-bn-4 (BatchNormalizati  (None, 40, 40, 288)  1152       ['class-3[1][0]']                                                                          
on)                                                                                    

class-3-bn-5 (BatchNormalizati  (None, 20, 20, 288)  1152       ['class-3[2][0]']                                                                          
on)                                                      

class-3-bn-6 (BatchNormalizati  (None, 10, 10, 288)  1152       ['class-3[3][0]']                                                                          
on)                                                                                              

class-3-bn-7 (BatchNormalizati  (None, 5, 5, 288)   1152        ['class-3[4][0]']
on)                                                                                                                          
                                                                                                           
box-3-bn-3 (BatchNormalization  (None, 80, 80, 288)  1152       ['box-3[0][0]']                                              
)                                                                                                                                                           

box-3-bn-4 (BatchNormalization  (None, 40, 40, 288)  1152       ['box-3[1][0]']                                                                            
)                                                                                                                                                           

box-3-bn-5 (BatchNormalization  (None, 20, 20, 288)  1152       ['box-3[2][0]']                                                                            
)                                                                                                                                                           

box-3-bn-6 (BatchNormalization  (None, 10, 10, 288)  1152       ['box-3[3][0]']                                                                            
)                                                                                                                                                           

box-3-bn-7 (BatchNormalization  (None, 5, 5, 288)   1152        ['box-3[4][0]']                                                                            
)                                                                                                                                                           

activation_59 (Activation)     (None, 80, 80, 288)  0           ['class-3-bn-3[0][0]']                                                                       

activation_63 (Activation)     (None, 40, 40, 288)  0           ['class-3-bn-4[0][0]']                                                                       

activation_67 (Activation)     (None, 20, 20, 288)  0           ['class-3-bn-5[0][0]']                                                                       

activation_71 (Activation)     (None, 10, 10, 288)  0           ['class-3-bn-6[0][0]']                                                                       

activation_75 (Activation)     (None, 5, 5, 288)    0           ['class-3-bn-7[0][0]']                                                                       

activation_79 (Activation)     (None, 80, 80, 288)  0           ['box-3-bn-3[0][0]']                                                                         

activation_83 (Activation)     (None, 40, 40, 288)  0           ['box-3-bn-4[0][0]']                                                                         

activation_87 (Activation)     (None, 20, 20, 288)  0           ['box-3-bn-5[0][0]']                                                                         

activation_91 (Activation)     (None, 10, 10, 288)  0           ['box-3-bn-6[0][0]']                                                                         

activation_95 (Activation)     (None, 5, 5, 288)    0           ['box-3-bn-7[0][0]']                                                                         

class-predict (SeparableConv2D  multiple            12996       ['activation_59[0][0]',                                                                    
)                                                                'activation_63[0][0]',                                                                     
                                                                 'activation_67[0][0]',                                                                     
                                                                 'activation_71[0][0]',                                                                     
                                                                 'activation_75[0][0]']                                                                    

box-predict (SeparableConv2D)  multiple             12996       ['activation_79[0][0]',                                                                     
                                                                 'activation_83[0][0]',                                                                     
                                                                 'activation_87[0][0]',                                                                     
                                                                 'activation_91[0][0]',
                                                                 'activation_95[0][0]']
                                                                                                                                                             

==================================================================================================                                                          
Total params: 33,657,021                                  
Trainable params: 33,429,629                                                                                                                                
Non-trainable params: 227,392                                                                                                                               

__________________________________________________________________________________________________
LR schedule method: cosine                                                                                                                                   

Use SGD optimizer                                                   
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.

Epoch 1/200                              

WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7fac1430e430> and will run it as-is.       
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7fac1430e430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code   

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7fac1430e430> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7fac1430e430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code   

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4da988b0> and will run it as-is. 

Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4da988b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4da988b0> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4da988b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7fac4d557e50> and will run it as-is.      
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7fac4d557e50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code  

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7fac4d557e50> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7fac4d557e50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code  

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4d4fc430> and will run it as-is.

Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4d4fc430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4d4fc430> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4d4fc430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

6/26972 [..............................] - ETA: 3:59:32 - det_loss: 1.3752 - cls_loss: 0.6926 - box_loss: 0.0137 - reg_l2_loss: 0.2557 - reg_l1_loss: 0.0000e+00
- loss: 1.6308 - learning_rate: 1.0371e-04 - gradient_norm: nanWARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.5248s vs `on_train_batch_end` time: 3.8910s). Check your callbacks.

Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.5248s vs `on_train_batch_end` time: 3.8910s). Check your callbacks.    
26972/26972 [==============================] - ETA: 0s - det_loss: 0.8992 - cls_loss: 0.4544 - box_loss: 0.0089 - reg_l2_loss: 0.2541 - reg_l1_loss: 0.0000e+00 - loss: 1.1534 - learning_rate: 0.0201 - gradient_norm: nanNone

26972/26972 [==============================] - 11382s 403ms/step - det_loss: 0.8992 - cls_loss: 0.4544 - box_loss: 0.0089 - reg_l2_loss: 0.2541 - reg_l1_loss: 0.0000e+00 - loss: 1.1534 - learning_rate: 0.0201 - gradient_norm: nan - val_det_loss: 0.7016 - val_cls_loss: 0.3337 - val_box_loss: 0.0074 - val_loss: 0.9463            

Epoch 2/200
2088/26972 [=>............................] - ETA: 2:40:13 - det_loss: 0.8325 - cls_loss: 0.3928 - box_loss: 0.0088 - reg_l2_loss: 0.2435 - reg_l1_loss: 0.0000e+00
- loss: 1.0760 - learning_rate: 0.0416 - gradient_norm: 1.3853Killed
Sending telemetry data.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

Could you try below?
docker run --runtime=nvidia -it --rm --gpus all -v /home/jupyter:/workspace --shm-size=32gb nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf2.9.1 efficientdet_tf2 train -e /workspace/efficientdet/specs/train.yaml --gpus 1

not working. same issue. I think --gpus all already uses the nvidia runtime.

error log:

Epoch 1/300WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f705c591430> and will run it as-is.                Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f705c591430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the
code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                   
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f705c591430> and will run it as-is.                                   Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f705c591430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the
code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                   
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f708c5588b0> and will run it as-is.          Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f708c5588b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                             
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f708c5588b0> and will run it as-is.                             Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f708c5588b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                             
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f7075f9ce50> and will run it as-is.                Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f7075f9ce50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the
code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                   
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f7075f9ce50> and will run it as-is.                                   Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f7075f9ce50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the
code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                   
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f7075ec1430> and will run it as-is.          Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f7075ec1430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                             
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f7075ec1430> and will run it as-is.                             Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f7075ec1430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                             To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                            6/26972 [..............................] - ETA: 3:39:29 - det_loss: 1.0984 - cls_loss: 0.5159 - box_loss: 0.0117 - reg_l2_loss: 0.2557 - reg_l1_loss: 0.0000e+00 - loss: 1.3541 - learning_rate: 1.0371e-04 - gradient_norm: 10.0000WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.4799s vs `on_train_batch_end` time: 4.1284s). Check your callbacks.Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.4799s vs `on_train_batch_end` time: 4.1284s). Check your callbacks.           26972/26972 [==============================] - ETA: 0s - det_loss: 0.9064 - cls_loss: 0.4599 - box_loss: 0.0089 - reg_l2_loss: 0.2546 - reg_l1_loss: 0.0000e+00 - loss: 1.1610 - learning_rate: 0.0201 - gradient_norm: nanNone
26972/26972 [==============================] - 11935s 422ms/step - det_loss: 0.9064 - cls_loss: 0.4599 - box_loss: 0.0089 - reg_l2_loss: 0.2546 - reg_l1_loss: 0.0000e+00 - loss: 1.1611 - learning_rate: 0.0201 - gradient_norm: nan - val_det_loss: 0.6999 - val_cls_loss: 0.3439 - val_box_loss: 0.0071 - val_loss: 0.9462
Epoch 2/300
 1858/26972 [=>............................] - ETA: 2:50:18 - det_loss: 0.8381 - cls_loss: 0.3934 - box_loss: 0.0089 - reg_l2_loss: 0.2459 - reg_l1_loss: 0.0000e+00 - loss: 1.0840 - learning_rate: 0.0415 - gradient_norm: 1.3892

error log from gcp:

Jun 29 13:28:17 zhongjin-ap-od-training-nvidia-retail-od cron[1400]: /usr/sbin/sendmail: Cannot allocate memory\r\n

Please try to set a lower learning rate, for example learning_rate: 0.1.
And set smaller clip_gradients_norm: 5.0

still not working, same error. I also tried smaller image size 416x416. and I downloaded the updated pre-trained weights from here. the issue persists.

Could you retry with small part of the training dataset instead?

Please change to below.

prefetch_size: 1
shuffle_file: false

Please set lower value for L2. l2_weight_decay: 0.000004

Please set amp:false as well.

I changed to use the validation set (4391 images) as the training set, it seems working, finished a few epochs. I’m curious about the dataset size limit if you have a rough number. Is there a plan to optimize the memory using?

Could you please modify the parameters mentioned above and retry the entire dataset? It will optimize the memory.

I tried. still OOM. I also verified that if I reduce the train set size, I can then use multiple gpus.

OK, when it is running successfully, how many training images and validation images? And how about the result of $nvidia-smi ?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Maybe there are issue in training images? Are there any corrupt or different kinds of image? Suggest to use part of training images to narrow down.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.