Issues on using Retail Object Detection as the pre-trained weights

zhongjin.xu.01 · June 27, 2023, 10:47pm

Hi, I’m facing two issues when I tried to use the Nvidia Retail Object Detection as the pre-trained weights:

I’m not able to use multi gpus
with --gpus 1, it could complete the first epoch, but failed in the beginning of the second epoch. seems out of memory…

Any suggestions would be greatly appreciated. Thanks!

• Machine: GCP Vertex Notebook (Debian 10 + python 3.7 + Driver Version: 510.47.03 + CUDA Version: 11.6)
• Hardware: V100 x 2
• Network Type: EfficientDet TF2
• TLT Version: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf2.9.1
• Training spec file

data:
  loader:
    prefetch_size: 4
    shuffle_file: True
  max_instances_per_image: 100
  skip_crowd_during_training: True
  image_size: '640x640'
  num_classes: 4
  train_tfrecords:
    - '/workspace/efficientdet/tfrecords/train/train-*'
  val_tfrecords:
    - '/workspace/efficientdet/tfrecords/val/val-*'
  val_json_file: '/workspace/efficientdet/datasets/ap_od_03292023_val/labels.json'
train:
  optimizer:
    name: 'sgd'
    momentum: 0.9
  lr_schedule:
    name: 'cosine'
    warmup_epoch: 5
    warmup_init: 0.0001
    learning_rate: 0.2
  amp: True
  checkpoint: "/workspace/efficientdet/efficientdet-d5_038.tlt"
  num_examples_per_epoch: 26972
  moving_average_decay: 0.999
  batch_size: 1
  checkpoint_interval: 10
  l2_weight_decay: 0.00004
  l1_weight_decay: 0.0
  clip_gradients_norm: 10.0
  image_preview: True
  qat: False
  random_seed: 42
  pruned_model_path: ''
  num_epochs: 200
model:
  name: 'efficientdet-d5'
  aspect_ratios: '[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]'
  anchor_scale: 4
  min_level: 3
  max_level: 7
  num_scales: 3
  freeze_bn: False
  freeze_blocks: []
augment:
  rand_hflip: True
  random_crop_min_scale: 0.1
  random_crop_max_scale: 2
  auto_color_distortion: False
  auto_translate_xy: True
evaluate:
  batch_size: 1
  num_samples: 4391
  max_detections_per_image: 100
  model_path: ''
export:
  max_batch_size: 8
  dynamic_batch_size: True
  min_score_thresh: 0.4
  model_path: ""
  output_path: ""
inference:
  model_path: ""
  image_dir: ""
  output_dir: ""
  dump_label: False
  batch_size: 1
prune:
  model_path: ""
  normalizer: 'max'
  output_path: ""
  equalization_criterion: 'union'
  granularity: 8
  threshold: 0.5
  min_num_filters: 16
  excluded_layers: []
key: 'nvidia-tlt'
results_dir: '/workspace/efficientdet/experiment_dir_unpruned'

• How to reproduce the issue ?

command:

docker run -it --rm --gpus all -v /home/jupyter:/workspace --shm-size=32gb nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf2.9.1 efficientdet_tf2 train -e /workspace/efficientdet/specs/train.yaml --gpus 1

multi gpus error log:

                                                                                                                                                          class-3-bn-4 (BatchNormalizati  (None, 40, 40, 288)  1152       ['class-3[1][0]']                                                                                  
 on)                                                                                                                                                                                                                                                                                                                                       class-3-bn-5 (BatchNormalizati  (None, 20, 20, 288)  1152       ['class-3[2][0]']                                                                                   on)                                                                                                                                                                
                                                                                                                                                                    
 class-3-bn-6 (BatchNormalizati  (None, 10, 10, 288)  1152       ['class-3[3][0]']                                                                                    on)                                                                                                                                                                                                                                                                                                                                      class-3-bn-7 (BatchNormalizati  (None, 5, 5, 288)   1152        ['class-3[4][0]']                                                                                  
 on)                                                                                                                                                                 
                                                                                                                                                                      box-3-bn-3 (BatchNormalization  (None, 80, 80, 288)  1152       ['box-3[0][0]']                                                                                      )                                                                                                                                                                                                                                                                                                                                      
 box-3-bn-4 (BatchNormalization  (None, 40, 40, 288)  1152       ['box-3[1][0]']                                                                                    
 )                                                                                                                                                                                                                                                                                                                                         box-3-bn-5 (BatchNormalization  (None, 20, 20, 288)  1152       ['box-3[2][0]']                                                                                      )                                                                                                                                                                  
                                                                                                                                                                    
 box-3-bn-6 (BatchNormalization  (None, 10, 10, 288)  1152       ['box-3[3][0]']                                                                                      )                                                                                                                                                                                                                                                                                                                                         box-3-bn-7 (BatchNormalization  (None, 5, 5, 288)   1152        ['box-3[4][0]']                                                                                    
 )                                                                                                                                                                  
                                                                                                                                                                      activation_59 (Activation)     (None, 80, 80, 288)  0           ['class-3-bn-3[0][0]']                                                                                                                                                                                                                                                    activation_63 (Activation)     (None, 40, 40, 288)  0           ['class-3-bn-4[0][0]']                                                                             
                                                                                                                                                                    
 activation_67 (Activation)     (None, 20, 20, 288)  0           ['class-3-bn-5[0][0]']                                                                                                                                                                                                                                                    activation_71 (Activation)     (None, 10, 10, 288)  0           ['class-3-bn-6[0][0]']                                                                                                                                                                                                                                                  
 activation_75 (Activation)     (None, 5, 5, 288)    0           ['class-3-bn-7[0][0]']                                                                             
                                                                                                                                                                      activation_79 (Activation)     (None, 80, 80, 288)  0           ['box-3-bn-3[0][0]']                                                                                                                                                                                                                                                      activation_83 (Activation)     (None, 40, 40, 288)  0           ['box-3-bn-4[0][0]']                                                                               
                                                                                                                                                                    
 activation_87 (Activation)     (None, 20, 20, 288)  0           ['box-3-bn-5[0][0]']                                                                                                                                                                                                                                                      activation_91 (Activation)     (None, 10, 10, 288)  0           ['box-3-bn-6[0][0]']                                                                                
                                                                                                                                                                    
 activation_95 (Activation)     (None, 5, 5, 288)    0           ['box-3-bn-7[0][0]']        
                                                                                                                                  
 class-predict (SeparableConv2D  multiple            12996       ['activation_59[0][0]',                                                                             
 )                                                                'activation_63[0][0]',                                                                             
                                                                  'activation_67[0][0]',                                            
                                                                  'activation_71[0][0]',                                                                            
                                                                  'activation_75[0][0]']                                                                   
                                                                                                                                                                     
 box-predict (SeparableConv2D)  multiple             12996       ['activation_79[0][0]',                                                                             
                                                                  'activation_83[0][0]',                                                  
                                                                  'activation_87[0][0]',     
                                                                  'activation_91[0][0]',                                                                                                                       
                                                                  'activation_95[0][0]']                                                                                                                                                                                                                                                  ==================================================================================================                                                                  Total params: 33,657,021                                                                                                                                            
Trainable params: 33,429,629                                                                                                                                        
Non-trainable params: 227,392                                                                                                                                        __________________________________________________________________________________________________                                                                   LR schedule method: cosine                                                                                                                                          Use SGD optimizer                                                                                                                                                   
/usr/local/lib/python3.8/dist-packages/keras/backend.py:450: UserWarning: `tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. T
o update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.                                                warnings.warn('`tf.keras.backend.set_learning_phase` is deprecated and '                                                                                           WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.                                      `period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.                                                         
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.                                      
`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.                                                          Epoch 1/200                                                                                                                                                          /usr/local/lib/python3.8/dist-packages/keras/backend.py:450: UserWarning: `tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.                                             
  warnings.warn('`tf.keras.backend.set_learning_phase` is deprecated and '                                                                                          
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f89f00b0e50> and will run it as-is.                Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f89f00b0e50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f89f00b0e50> and will run it as-is.                                   Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f89f00b0e50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f89f00b0670> and will run it as-is.          Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f89f00b0670>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f89f00b0670> and will run it as-is.                             Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f89f00b0670>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f6534255430> and will run it as-is.                Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f6534255430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain th
e code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f6534255430> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f6534255430>. Note that functions defined in certain envi
ronments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain th
e code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655b1dc8b0> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655b1dc8b0>. Note that functions defined in certai
n environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are cert
ain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655b1dc8b0> and will run it as-is.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655b1dc8b0> and will run it as-is.                            
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655b1dc8b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f88be7adaf0> and will run it as-is.               
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f88be7adaf0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f88be7adaf0> and will run it as-is.                                  
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f88be7adaf0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f88be7adee0> and will run it as-is.         
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f88be7adee0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f88be7adee0> and will run it as-is.                            
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f88be7adee0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f655ac5ce50> and will run it as-is.               
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f655ac5ce50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f655ac5ce50> and will run it as-is.                                  
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f655ac5ce50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655ab83430> and will run it as-is.         
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655ab83430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655ab83430> and will run it as-is.                            
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f655ab83430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                       
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node 30fb12a8e965 exited on signal 9 (Killed).
--------------------------------------------------------------------------
Sending telemetry data.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

epoch 2 error log:

activation_58 (Activation)     (None, 80, 80, 288)  0           ['class-2-bn-3[0][0]']                                                                     

activation_62 (Activation)     (None, 40, 40, 288)  0           ['class-2-bn-4[0][0]']                                                                       

activation_66 (Activation)     (None, 20, 20, 288)  0           ['class-2-bn-5[0][0]']                                                                    

activation_70 (Activation)     (None, 10, 10, 288)  0           ['class-2-bn-6[0][0]']                                                                       

activation_74 (Activation)     (None, 5, 5, 288)    0           ['class-2-bn-7[0][0]']                                                                       

activation_78 (Activation)     (None, 80, 80, 288)  0           ['box-2-bn-3[0][0]']                                                                         

activation_82 (Activation)     (None, 40, 40, 288)  0           ['box-2-bn-4[0][0]']                                                                         

activation_86 (Activation)     (None, 20, 20, 288)  0           ['box-2-bn-5[0][0]']                                                                         

activation_90 (Activation)     (None, 10, 10, 288)  0           ['box-2-bn-6[0][0]']                                                                         

activation_94 (Activation)     (None, 5, 5, 288)    0           ['box-2-bn-7[0][0]']                                                                         

class-3 (SeparableConv2D)      multiple             85824       ['activation_58[0][0]',                                                                    
                                                                 'activation_62[0][0]',                                                                     
                                                                 'activation_66[0][0]',                                                                    
                                                                 'activation_70[0][0]',                                                                     
                                                                 'activation_74[0][0]']                                                                     

box-3 (SeparableConv2D)        multiple             85824       ['activation_78[0][0]',                                                                    
                                                                 'activation_82[0][0]',                                                                     
                                                                 'activation_86[0][0]',                                                                     
                                                                 'activation_90[0][0]',                                                                     
                                                                 'activation_94[0][0]']                                                                     

class-3-bn-3 (BatchNormalizati  (None, 80, 80, 288)  1152       ['class-3[0][0]']                                                                           
on)                                                                                                                                                         

class-3-bn-4 (BatchNormalizati  (None, 40, 40, 288)  1152       ['class-3[1][0]']                                                                          
on)                                                                                    

class-3-bn-5 (BatchNormalizati  (None, 20, 20, 288)  1152       ['class-3[2][0]']                                                                          
on)                                                      

class-3-bn-6 (BatchNormalizati  (None, 10, 10, 288)  1152       ['class-3[3][0]']                                                                          
on)                                                                                              

class-3-bn-7 (BatchNormalizati  (None, 5, 5, 288)   1152        ['class-3[4][0]']
on)                                                                                                                          
                                                                                                           
box-3-bn-3 (BatchNormalization  (None, 80, 80, 288)  1152       ['box-3[0][0]']                                              
)                                                                                                                                                           

box-3-bn-4 (BatchNormalization  (None, 40, 40, 288)  1152       ['box-3[1][0]']                                                                            
)                                                                                                                                                           

box-3-bn-5 (BatchNormalization  (None, 20, 20, 288)  1152       ['box-3[2][0]']                                                                            
)                                                                                                                                                           

box-3-bn-6 (BatchNormalization  (None, 10, 10, 288)  1152       ['box-3[3][0]']                                                                            
)                                                                                                                                                           

box-3-bn-7 (BatchNormalization  (None, 5, 5, 288)   1152        ['box-3[4][0]']                                                                            
)                                                                                                                                                           

activation_59 (Activation)     (None, 80, 80, 288)  0           ['class-3-bn-3[0][0]']                                                                       

activation_63 (Activation)     (None, 40, 40, 288)  0           ['class-3-bn-4[0][0]']                                                                       

activation_67 (Activation)     (None, 20, 20, 288)  0           ['class-3-bn-5[0][0]']                                                                       

activation_71 (Activation)     (None, 10, 10, 288)  0           ['class-3-bn-6[0][0]']                                                                       

activation_75 (Activation)     (None, 5, 5, 288)    0           ['class-3-bn-7[0][0]']                                                                       

activation_79 (Activation)     (None, 80, 80, 288)  0           ['box-3-bn-3[0][0]']                                                                         

activation_83 (Activation)     (None, 40, 40, 288)  0           ['box-3-bn-4[0][0]']                                                                         

activation_87 (Activation)     (None, 20, 20, 288)  0           ['box-3-bn-5[0][0]']                                                                         

activation_91 (Activation)     (None, 10, 10, 288)  0           ['box-3-bn-6[0][0]']                                                                         

activation_95 (Activation)     (None, 5, 5, 288)    0           ['box-3-bn-7[0][0]']                                                                         

class-predict (SeparableConv2D  multiple            12996       ['activation_59[0][0]',                                                                    
)                                                                'activation_63[0][0]',                                                                     
                                                                 'activation_67[0][0]',                                                                     
                                                                 'activation_71[0][0]',                                                                     
                                                                 'activation_75[0][0]']                                                                    

box-predict (SeparableConv2D)  multiple             12996       ['activation_79[0][0]',                                                                     
                                                                 'activation_83[0][0]',                                                                     
                                                                 'activation_87[0][0]',                                                                     
                                                                 'activation_91[0][0]',
                                                                 'activation_95[0][0]']
                                                                                                                                                             

==================================================================================================                                                          
Total params: 33,657,021                                  
Trainable params: 33,429,629                                                                                                                                
Non-trainable params: 227,392                                                                                                                               

__________________________________________________________________________________________________
LR schedule method: cosine                                                                                                                                   

Use SGD optimizer                                                   
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.

Epoch 1/200                              

WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7fac1430e430> and will run it as-is.       
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7fac1430e430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code   

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7fac1430e430> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7fac1430e430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code   

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4da988b0> and will run it as-is. 

Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4da988b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4da988b0> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4da988b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7fac4d557e50> and will run it as-is.      
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7fac4d557e50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code  

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7fac4d557e50> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7fac4d557e50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code  

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4d4fc430> and will run it as-is.

Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4d4fc430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4d4fc430> and will run it as-is.
Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7fac4d4fc430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                          

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

6/26972 [..............................] - ETA: 3:59:32 - det_loss: 1.3752 - cls_loss: 0.6926 - box_loss: 0.0137 - reg_l2_loss: 0.2557 - reg_l1_loss: 0.0000e+00
- loss: 1.6308 - learning_rate: 1.0371e-04 - gradient_norm: nanWARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.5248s vs `on_train_batch_end` time: 3.8910s). Check your callbacks.

Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.5248s vs `on_train_batch_end` time: 3.8910s). Check your callbacks.    
26972/26972 [==============================] - ETA: 0s - det_loss: 0.8992 - cls_loss: 0.4544 - box_loss: 0.0089 - reg_l2_loss: 0.2541 - reg_l1_loss: 0.0000e+00 - loss: 1.1534 - learning_rate: 0.0201 - gradient_norm: nanNone

26972/26972 [==============================] - 11382s 403ms/step - det_loss: 0.8992 - cls_loss: 0.4544 - box_loss: 0.0089 - reg_l2_loss: 0.2541 - reg_l1_loss: 0.0000e+00 - loss: 1.1534 - learning_rate: 0.0201 - gradient_norm: nan - val_det_loss: 0.7016 - val_cls_loss: 0.3337 - val_box_loss: 0.0074 - val_loss: 0.9463            

Epoch 2/200
2088/26972 [=>............................] - ETA: 2:40:13 - det_loss: 0.8325 - cls_loss: 0.3928 - box_loss: 0.0088 - reg_l2_loss: 0.2435 - reg_l1_loss: 0.0000e+00
- loss: 1.0760 - learning_rate: 0.0416 - gradient_norm: 1.3853Killed
Sending telemetry data.
Telemetry data couldn't be sent, but the command ran successfully.
[Error]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

Morganh · June 28, 2023, 8:49am

Could you try below?
docker run --runtime=nvidia -it --rm --gpus all -v /home/jupyter:/workspace --shm-size=32gb nvcr.io/nvidia/tao/tao-toolkit:4.0.0-tf2.9.1 efficientdet_tf2 train -e /workspace/efficientdet/specs/train.yaml --gpus 1

zhongjin.xu.01 · June 29, 2023, 11:26pm

not working. same issue. I think --gpus all already uses the nvidia runtime.

error log:

Epoch 1/300WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f705c591430> and will run it as-is.                Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f705c591430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the
code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                   
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f705c591430> and will run it as-is.                                   Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f705c591430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the
code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                   
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f708c5588b0> and will run it as-is.          Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f708c5588b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                             
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f708c5588b0> and will run it as-is.                             Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f708c5588b0>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                             
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f7075f9ce50> and will run it as-is.                Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f7075f9ce50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the
code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                   
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._update at 0x7f7075f9ce50> and will run it as-is.                                   Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._update at 0x7f7075f9ce50>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the
code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                                   
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        WARNING:tensorflow:AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f7075ec1430> and will run it as-is.          Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f7075ec1430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                             
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                        AutoGraph could not transform <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f7075ec1430> and will run it as-is.                             Cause: Unable to locate the source code of <function HvdMovingAverage.update_average.<locals>._apply_moving at 0x7f7075ec1430>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code                             To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert                                                                            6/26972 [..............................] - ETA: 3:39:29 - det_loss: 1.0984 - cls_loss: 0.5159 - box_loss: 0.0117 - reg_l2_loss: 0.2557 - reg_l1_loss: 0.0000e+00 - loss: 1.3541 - learning_rate: 1.0371e-04 - gradient_norm: 10.0000WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.4799s vs `on_train_batch_end` time: 4.1284s). Check your callbacks.Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.4799s vs `on_train_batch_end` time: 4.1284s). Check your callbacks.           26972/26972 [==============================] - ETA: 0s - det_loss: 0.9064 - cls_loss: 0.4599 - box_loss: 0.0089 - reg_l2_loss: 0.2546 - reg_l1_loss: 0.0000e+00 - loss: 1.1610 - learning_rate: 0.0201 - gradient_norm: nanNone
26972/26972 [==============================] - 11935s 422ms/step - det_loss: 0.9064 - cls_loss: 0.4599 - box_loss: 0.0089 - reg_l2_loss: 0.2546 - reg_l1_loss: 0.0000e+00 - loss: 1.1611 - learning_rate: 0.0201 - gradient_norm: nan - val_det_loss: 0.6999 - val_cls_loss: 0.3439 - val_box_loss: 0.0071 - val_loss: 0.9462
Epoch 2/300
 1858/26972 [=>............................] - ETA: 2:50:18 - det_loss: 0.8381 - cls_loss: 0.3934 - box_loss: 0.0089 - reg_l2_loss: 0.2459 - reg_l1_loss: 0.0000e+00 - loss: 1.0840 - learning_rate: 0.0415 - gradient_norm: 1.3892

error log from gcp:

Jun 29 13:28:17 zhongjin-ap-od-training-nvidia-retail-od cron[1400]: /usr/sbin/sendmail: Cannot allocate memory\r\n

Morganh · June 30, 2023, 2:26am

Please try to set a lower learning rate, for example learning_rate: 0.1.
And set smaller clip_gradients_norm: 5.0

zhongjin.xu.01 · July 1, 2023, 5:46pm

still not working, same error. I also tried smaller image size 416x416. and I downloaded the updated pre-trained weights from here. the issue persists.

Morganh · July 2, 2023, 4:57pm

Could you retry with small part of the training dataset instead?

Morganh · July 3, 2023, 3:04am

Please change to below.

prefetch_size: 1
shuffle_file: false

Please set lower value for L2. l2_weight_decay: 0.000004

Please set amp:false as well.

zhongjin.xu.01 · July 4, 2023, 2:13am

I changed to use the validation set (4391 images) as the training set, it seems working, finished a few epochs. I’m curious about the dataset size limit if you have a rough number. Is there a plan to optimize the memory using?

Morganh · July 4, 2023, 6:24am

Could you please modify the parameters mentioned above and retry the entire dataset? It will optimize the memory.

zhongjin.xu.01 · July 4, 2023, 5:10pm

I tried. still OOM. I also verified that if I reduce the train set size, I can then use multiple gpus.

Morganh · July 5, 2023, 6:08am

OK, when it is running successfully, how many training images and validation images? And how about the result of $nvidia-smi ?

Morganh · July 10, 2023, 6:00am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Maybe there are issue in training images? Are there any corrupt or different kinds of image? Suggest to use part of training images to narrow down.

system · July 24, 2023, 9:08am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can't load pre-trained model for Retail Object Detection TAO Toolkit	8	883	April 14, 2023
Unable to Train Efficientdet on Multiple GPUS TAO Toolkit	39	1503	July 18, 2022
TAO Toolkit Train of an EfficientDet-D0 is stuck! TAO Toolkit	21	1152	August 2, 2022
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck TAO Toolkit	80	3013	October 11, 2023
DataLossError: corrupted record at 0 when using TFRecords with DetectNet TAO Toolkit	36	6147	February 18, 2022
Original error: could not get source code TAO Toolkit	8	765	July 6, 2022
Excute tao model detectnet_v2 train but Failed TAO Toolkit tao	5	307	June 4, 2024
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>) TAO Toolkit	6	1956	March 4, 2022
0 AP for Object Detection in EfficientDet Jupyter Notebook TAO Toolkit	12	811	June 3, 2022
Cannot reshape a tensor with 25690112 elements to shape [256,256,14,14] TAO Toolkit	51	1947	July 26, 2022

Issues on using Retail Object Detection as the pre-trained weights

Related topics