TLT - retrain trafficcamnet with customized data precision is 0

Hi experts,

The training kicks off but the precision I got is always 0, can someone help me on this, I am really new to this field.

Thanks,
Kai

2021-06-15 20:40:56,001 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:43: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py:67: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py:67: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2021-06-15 12:41:04,768 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2021-06-15 12:41:04,768 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2021-06-15 12:41:05,442 [INFO] __main__: Loading experiment spec at /workspace/tlt-experiments/detectnet_v2/specs/detectnet_v2_retrain_trafficcamnet_car_kitti.txt.
2021-06-15 12:41:05,444 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/tlt-experiments/detectnet_v2/specs/detectnet_v2_retrain_trafficcamnet_car_kitti.txt
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:107: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

2021-06-15 12:41:05,600 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:107: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:110: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

2021-06-15 12:41:05,601 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:110: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:113: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

2021-06-15 12:41:05,604 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:113: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2021-06-15 12:41:05,638 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2021-06-15 12:41:05,640 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2021-06-15 12:41:06,641 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2021-06-15 12:41:07,900 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2021-06-15 12:41:07,900 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2021-06-15 12:41:08,207 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/objectives/bbox_objective.py:61: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

2021-06-15 12:41:12,277 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/objectives/bbox_objective.py:61: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:DriveNet default L1 loss function will be used.
2021-06-15 12:41:12,277 [INFO] tensorflow: DriveNet default L1 loss function will be used.
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, 544, 960)  0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, 272, 480) 9472        input_1[0][0]                    
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 64, 272, 480) 0           conv1[0][0]                      
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (None, 64, 136, 240) 36928       activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_relu_1 (Activation)    (None, 64, 136, 240) 0           block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (None, 64, 136, 240) 36928       block_1a_relu_1[0][0]            
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 64, 136, 240) 4160        activation_1[0][0]               
__________________________________________________________________________________________________
add_1 (Add)                     (None, 64, 136, 240) 0           block_1a_conv_2[0][0]            
                                                                 block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
block_1a_relu (Activation)      (None, 64, 136, 240) 0           add_1[0][0]                      
__________________________________________________________________________________________________
block_1b_conv_1 (Conv2D)        (None, 64, 136, 240) 36928       block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_1b_relu_1 (Activation)    (None, 64, 136, 240) 0           block_1b_conv_1[0][0]            
__________________________________________________________________________________________________
block_1b_conv_2 (Conv2D)        (None, 64, 136, 240) 36928       block_1b_relu_1[0][0]            
__________________________________________________________________________________________________
block_1b_conv_shortcut (Conv2D) (None, 64, 136, 240) 4160        block_1a_relu[0][0]              
__________________________________________________________________________________________________
add_2 (Add)                     (None, 64, 136, 240) 0           block_1b_conv_2[0][0]            
                                                                 block_1b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
block_1b_relu (Activation)      (None, 64, 136, 240) 0           add_2[0][0]                      
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (None, 128, 68, 120) 73856       block_1b_relu[0][0]              
__________________________________________________________________________________________________
block_2a_relu_1 (Activation)    (None, 128, 68, 120) 0           block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (None, 128, 68, 120) 147584      block_2a_relu_1[0][0]            
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 128, 68, 120) 8320        block_1b_relu[0][0]              
__________________________________________________________________________________________________
add_3 (Add)                     (None, 128, 68, 120) 0           block_2a_conv_2[0][0]            
                                                                 block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
block_2a_relu (Activation)      (None, 128, 68, 120) 0           add_3[0][0]                      
__________________________________________________________________________________________________
block_2b_conv_1 (Conv2D)        (None, 128, 68, 120) 147584      block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_2b_relu_1 (Activation)    (None, 128, 68, 120) 0           block_2b_conv_1[0][0]            
__________________________________________________________________________________________________
block_2b_conv_2 (Conv2D)        (None, 128, 68, 120) 147584      block_2b_relu_1[0][0]            
__________________________________________________________________________________________________
block_2b_conv_shortcut (Conv2D) (None, 128, 68, 120) 16512       block_2a_relu[0][0]              
__________________________________________________________________________________________________
add_4 (Add)                     (None, 128, 68, 120) 0           block_2b_conv_2[0][0]            
                                                                 block_2b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
block_2b_relu (Activation)      (None, 128, 68, 120) 0           add_4[0][0]                      
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (None, 256, 34, 60)  295168      block_2b_relu[0][0]              
__________________________________________________________________________________________________
block_3a_relu_1 (Activation)    (None, 256, 34, 60)  0           block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, 34, 60)  590080      block_3a_relu_1[0][0]            
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 256, 34, 60)  33024       block_2b_relu[0][0]              
__________________________________________________________________________________________________
add_5 (Add)                     (None, 256, 34, 60)  0           block_3a_conv_2[0][0]            
                                                                 block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
block_3a_relu (Activation)      (None, 256, 34, 60)  0           add_5[0][0]                      
__________________________________________________________________________________________________
block_3b_conv_1 (Conv2D)        (None, 256, 34, 60)  590080      block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_3b_relu_1 (Activation)    (None, 256, 34, 60)  0           block_3b_conv_1[0][0]            
__________________________________________________________________________________________________
block_3b_conv_2 (Conv2D)        (None, 256, 34, 60)  590080      block_3b_relu_1[0][0]            
__________________________________________________________________________________________________
block_3b_conv_shortcut (Conv2D) (None, 256, 34, 60)  65792       block_3a_relu[0][0]              
__________________________________________________________________________________________________
add_6 (Add)                     (None, 256, 34, 60)  0           block_3b_conv_2[0][0]            
                                                                 block_3b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
block_3b_relu (Activation)      (None, 256, 34, 60)  0           add_6[0][0]                      
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (None, 512, 34, 60)  1180160     block_3b_relu[0][0]              
__________________________________________________________________________________________________
block_4a_relu_1 (Activation)    (None, 512, 34, 60)  0           block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (None, 512, 34, 60)  2359808     block_4a_relu_1[0][0]            
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 512, 34, 60)  131584      block_3b_relu[0][0]              
__________________________________________________________________________________________________
add_7 (Add)                     (None, 512, 34, 60)  0           block_4a_conv_2[0][0]            
                                                                 block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
block_4a_relu (Activation)      (None, 512, 34, 60)  0           add_7[0][0]                      
__________________________________________________________________________________________________
block_4b_conv_1 (Conv2D)        (None, 512, 34, 60)  2359808     block_4a_relu[0][0]              
__________________________________________________________________________________________________
block_4b_relu_1 (Activation)    (None, 512, 34, 60)  0           block_4b_conv_1[0][0]            
__________________________________________________________________________________________________
block_4b_conv_2 (Conv2D)        (None, 512, 34, 60)  2359808     block_4b_relu_1[0][0]            
__________________________________________________________________________________________________
block_4b_conv_shortcut (Conv2D) (None, 512, 34, 60)  262656      block_4a_relu[0][0]              
__________________________________________________________________________________________________
add_8 (Add)                     (None, 512, 34, 60)  0           block_4b_conv_2[0][0]            
                                                                 block_4b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
block_4b_relu (Activation)      (None, 512, 34, 60)  0           add_8[0][0]                      
__________________________________________________________________________________________________
output_bbox (Conv2D)            (None, 4, 34, 60)    2052        block_4b_relu[0][0]              
__________________________________________________________________________________________________
output_cov (Conv2D)             (None, 1, 34, 60)    513         block_4b_relu[0][0]              
==================================================================================================
Total params: 11,527,557
Trainable params: 11,527,557
Non-trainable params: 0
__________________________________________________________________________________________________
2021-06-15 12:41:12,324 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2021-06-15 12:41:12,324 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2021-06-15 12:41:12,324 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2021-06-15 12:41:12,325 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2021-06-15 12:41:12,325 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 188, number of sources: 1, batch size per gpu: 4, steps: 47
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

2021-06-15 12:41:12,370 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7fe565e4c748>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7fe565e4c748>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-06-15 12:41:12,413 [WARNING] tensorflow: Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7fe565e4c748>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7fe565e4c748>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-06-15 12:41:12,441 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2021-06-15 12:41:12,793 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2021-06-15 12:41:12,801 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2021-06-15 12:41:12,801 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fe54c227a90>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fe54c227a90>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-06-15 12:41:12,821 [WARNING] tensorflow: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fe54c227a90>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fe54c227a90>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-06-15 12:41:13,516 [INFO] __main__: Found 188 samples in training set
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/rasterizers/bbox_rasterizer.py:347: The name tf.bincount is deprecated. Please use tf.math.bincount instead.

2021-06-15 12:41:13,669 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/rasterizers/bbox_rasterizer.py:347: The name tf.bincount is deprecated. Please use tf.math.bincount instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/training_proto_utilities.py:89: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2021-06-15 12:41:13,828 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/training_proto_utilities.py:89: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/training_proto_utilities.py:36: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

2021-06-15 12:41:13,851 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/training_proto_utilities.py:36: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_functions.py:17: The name tf.log is deprecated. Please use tf.math.log instead.

2021-06-15 12:41:13,956 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_functions.py:17: The name tf.log is deprecated. Please use tf.math.log instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:235: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

2021-06-15 12:41:13,970 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:235: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/model/detectnet_model.py:574: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

2021-06-15 12:41:13,975 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/model/detectnet_model.py:574: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

2021-06-15 12:41:15,597 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2021-06-15 12:41:15,597 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2021-06-15 12:41:15,597 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2021-06-15 12:41:15,597 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 4, io threads: 8, compute threads: 4, buffered batches: 4
2021-06-15 12:41:15,598 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 30, number of sources: 1, batch size per gpu: 4, steps: 8
WARNING:tensorflow:Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7fe565e4c7f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7fe565e4c7f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-06-15 12:41:15,612 [WARNING] tensorflow: Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7fe565e4c7f0>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7fe565e4c7f0>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-06-15 12:41:15,639 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2021-06-15 12:41:15,981 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2021-06-15 12:41:16,139 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2021-06-15 12:41:16,139 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fe54c64f550>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fe54c64f550>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-06-15 12:41:16,159 [WARNING] tensorflow: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fe54c64f550>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7fe54c64f550>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-06-15 12:41:16,667 [INFO] __main__: Found 30 samples in validation set
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/validation_hook.py:40: The name tf.summary.FileWriterCache is deprecated. Please use tf.compat.v1.summary.FileWriterCache instead.

2021-06-15 12:41:17,127 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/validation_hook.py:40: The name tf.summary.FileWriterCache is deprecated. Please use tf.compat.v1.summary.FileWriterCache instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py:105: The name tf.train.Scaffold is deprecated. Please use tf.compat.v1.train.Scaffold instead.

2021-06-15 12:41:18,224 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py:105: The name tf.train.Scaffold is deprecated. Please use tf.compat.v1.train.Scaffold instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/common/graph/initializers.py:14: The name tf.local_variables_initializer is deprecated. Please use tf.compat.v1.local_variables_initializer instead.

2021-06-15 12:41:18,224 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/common/graph/initializers.py:14: The name tf.local_variables_initializer is deprecated. Please use tf.compat.v1.local_variables_initializer instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/common/graph/initializers.py:15: The name tf.tables_initializer is deprecated. Please use tf.compat.v1.tables_initializer instead.

2021-06-15 12:41:18,225 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/common/graph/initializers.py:15: The name tf.tables_initializer is deprecated. Please use tf.compat.v1.tables_initializer instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/common/graph/initializers.py:16: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

2021-06-15 12:41:18,226 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/common/graph/initializers.py:16: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:59: The name tf.train.LoggingTensorHook is deprecated. Please use tf.estimator.LoggingTensorHook instead.

2021-06-15 12:41:18,229 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:59: The name tf.train.LoggingTensorHook is deprecated. Please use tf.estimator.LoggingTensorHook instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:60: The name tf.train.StopAtStepHook is deprecated. Please use tf.estimator.StopAtStepHook instead.

2021-06-15 12:41:18,229 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:60: The name tf.train.StopAtStepHook is deprecated. Please use tf.estimator.StopAtStepHook instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:74: The name tf.train.StepCounterHook is deprecated. Please use tf.estimator.StepCounterHook instead.

2021-06-15 12:41:18,230 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:74: The name tf.train.StepCounterHook is deprecated. Please use tf.estimator.StepCounterHook instead.

INFO:tensorflow:Create CheckpointSaverHook.
2021-06-15 12:41:18,230 [INFO] tensorflow: Create CheckpointSaverHook.
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:100: The name tf.train.SummarySaverHook is deprecated. Please use tf.estimator.SummarySaverHook instead.

2021-06-15 12:41:18,230 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/utils.py:100: The name tf.train.SummarySaverHook is deprecated. Please use tf.estimator.SummarySaverHook instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py:140: The name tf.train.SingularMonitoredSession is deprecated. Please use tf.compat.v1.train.SingularMonitoredSession instead.

2021-06-15 12:41:18,231 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/training/utilities.py:140: The name tf.train.SingularMonitoredSession is deprecated. Please use tf.compat.v1.train.SingularMonitoredSession instead.

INFO:tensorflow:Graph was finalized.
2021-06-15 12:41:18,996 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2021-06-15 12:41:20,726 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2021-06-15 12:41:21,486 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2021-06-15 12:41:27,227 [INFO] tensorflow: Saving checkpoints for step-0.
INFO:tensorflow:epoch = 0.0, loss = 0.06009046, step = 0
2021-06-15 12:41:47,496 [INFO] tensorflow: epoch = 0.0, loss = 0.06009046, step = 0
2021-06-15 12:41:47,499 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 0/120: loss: 0.06009 Time taken: 0:00:00 ETA: 0:00:00
2021-06-15 12:41:47,499 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 0.535
INFO:tensorflow:global_step/sec: 0.819422
2021-06-15 12:41:52,379 [INFO] tensorflow: global_step/sec: 0.819422
INFO:tensorflow:epoch = 0.1276595744680851, loss = 0.060291, step = 6 (7.296 sec)
2021-06-15 12:41:54,792 [INFO] tensorflow: epoch = 0.1276595744680851, loss = 0.060291, step = 6 (7.296 sec)
INFO:tensorflow:global_step/sec: 1.23529
2021-06-15 12:41:55,617 [INFO] tensorflow: global_step/sec: 1.23529
INFO:tensorflow:global_step/sec: 2.31361
2021-06-15 12:41:57,346 [INFO] tensorflow: global_step/sec: 2.31361
INFO:tensorflow:global_step/sec: 2.4405
2021-06-15 12:41:58,985 [INFO] tensorflow: global_step/sec: 2.4405
INFO:tensorflow:epoch = 0.40425531914893614, loss = 0.06011029, step = 19 (5.486 sec)
2021-06-15 12:42:00,278 [INFO] tensorflow: epoch = 0.40425531914893614, loss = 0.06011029, step = 19 (5.486 sec)
INFO:tensorflow:global_step/sec: 2.23885
2021-06-15 12:42:00,772 [INFO] tensorflow: global_step/sec: 2.23885
INFO:tensorflow:global_step/sec: 2.36717
2021-06-15 12:42:02,462 [INFO] tensorflow: global_step/sec: 2.36717
2021-06-15 12:42:02,465 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 4.457
INFO:tensorflow:global_step/sec: 2.39177
2021-06-15 12:42:04,134 [INFO] tensorflow: global_step/sec: 2.39177
INFO:tensorflow:epoch = 0.6808510638297872, loss = 0.059858073, step = 32 (5.612 sec)
2021-06-15 12:42:05,890 [INFO] tensorflow: epoch = 0.6808510638297872, loss = 0.059858073, step = 32 (5.612 sec)
INFO:tensorflow:global_step/sec: 2.27592
2021-06-15 12:42:05,892 [INFO] tensorflow: global_step/sec: 2.27592
INFO:tensorflow:global_step/sec: 2.71606
2021-06-15 12:42:07,364 [INFO] tensorflow: global_step/sec: 2.71606
INFO:tensorflow:global_step/sec: 2.53659
2021-06-15 12:42:08,941 [INFO] tensorflow: global_step/sec: 2.53659
INFO:tensorflow:global_step/sec: 2.4402
2021-06-15 12:42:10,581 [INFO] tensorflow: global_step/sec: 2.4402
INFO:tensorflow:epoch = 0.9787234042553191, loss = 0.060073316, step = 46 (5.586 sec)
2021-06-15 12:42:11,476 [INFO] tensorflow: epoch = 0.9787234042553191, loss = 0.060073316, step = 46 (5.586 sec)
20c71d84dbaa:40:52 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>
20c71d84dbaa:40:52 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
20c71d84dbaa:40:52 [0] NCCL INFO NET/IB : No device found.
20c71d84dbaa:40:52 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.3<0>
20c71d84dbaa:40:52 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
20c71d84dbaa:40:52 [0] NCCL INFO Channel 00/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 01/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 02/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 03/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 04/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 05/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 06/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 07/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 08/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 09/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 10/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 11/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 12/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 13/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 14/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 15/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 16/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 17/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 18/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 19/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 20/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 21/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 22/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 23/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 24/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 25/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 26/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 27/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 28/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 29/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 30/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Channel 31/32 :    0
20c71d84dbaa:40:52 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0->-1/-
20c71d84dbaa:40:52 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
20c71d84dbaa:40:52 [0] NCCL INFO comm 0x7fe55c37efb0 rank 0 nranks 1 cudaDev 0 busId 80 - Init COMPLETE
2021-06-15 12:42:11,890 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 7, 0.00s/step
Epoch 1/120
=========================

Validation cost: 0.000956
Mean average_precision (in %): 0.0000

class name      average precision (in %)
------------  --------------------------
car                                    0

Median Inference Time: 0.061750
INFO:tensorflow:epoch = 1.0, loss = 0.0009655007, step = 47 (9.809 sec)
2021-06-15 12:42:21,285 [INFO] tensorflow: epoch = 1.0, loss = 0.0009655007, step = 47 (9.809 sec)
2021-06-15 12:42:21,286 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 1/120: loss: 0.00097 Time taken: 0:00:40.904097 ETA: 1:21:07.587496
INFO:tensorflow:global_step/sec: 0.358527
2021-06-15 12:42:21,737 [INFO] tensorflow: global_step/sec: 0.358527
2021-06-15 12:42:22,145 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 5.081
INFO:tensorflow:global_step/sec: 2.5942
2021-06-15 12:42:23,279 [INFO] tensorflow: global_step/sec: 2.5942
INFO:tensorflow:global_step/sec: 2.74829
2021-06-15 12:42:24,735 [INFO] tensorflow: global_step/sec: 2.74829
INFO:tensorflow:global_step/sec: 2.57983
2021-06-15 12:42:26,285 [INFO] tensorflow: global_step/sec: 2.57983
INFO:tensorflow:epoch = 1.297872340425532, loss = 0.0010678559, step = 61 (5.399 sec)
2021-06-15 12:42:26,684 [INFO] tensorflow: epoch = 1.297872340425532, loss = 0.0010678559, step = 61 (5.399 sec)
INFO:tensorflow:global_step/sec: 2.56197
2021-06-15 12:42:27,846 [INFO] tensorflow: global_step/sec: 2.56197
INFO:tensorflow:global_step/sec: 2.22231
2021-06-15 12:42:29,646 [INFO] tensorflow: global_step/sec: 2.22231
INFO:tensorflow:global_step/sec: 2.41353
2021-06-15 12:42:31,304 [INFO] tensorflow: global_step/sec: 2.41353
INFO:tensorflow:epoch = 1.574468085106383, loss = 0.000981944, step = 74 (5.451 sec)
2021-06-15 12:42:32,135 [INFO] tensorflow: epoch = 1.574468085106383, loss = 0.000981944, step = 74 (5.451 sec)
2021-06-15 12:42:32,135 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 10.010
INFO:tensorflow:global_step/sec: 2.5232
2021-06-15 12:42:32,889 [INFO] tensorflow: global_step/sec: 2.5232
INFO:tensorflow:global_step/sec: 2.41112
2021-06-15 12:42:34,548 [INFO] tensorflow: global_step/sec: 2.41112
INFO:tensorflow:global_step/sec: 2.4484
2021-06-15 12:42:36,182 [INFO] tensorflow: global_step/sec: 2.4484
INFO:tensorflow:epoch = 1.872340425531915, loss = 0.0012640116, step = 88 (5.691 sec)
2021-06-15 12:42:37,826 [INFO] tensorflow: epoch = 1.872340425531915, loss = 0.0012640116, step = 88 (5.691 sec)
INFO:tensorflow:global_step/sec: 2.4312
2021-06-15 12:42:37,827 [INFO] tensorflow: global_step/sec: 2.4312
INFO:tensorflow:global_step/sec: 2.53872
2021-06-15 12:42:39,403 [INFO] tensorflow: global_step/sec: 2.53872
2021-06-15 12:42:40,132 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 2/120: loss: 0.00124 Time taken: 0:00:18.815971 ETA: 0:37:00.284538
INFO:tensorflow:global_step/sec: 2.68034
2021-06-15 12:42:40,895 [INFO] tensorflow: global_step/sec: 2.68034
2021-06-15 12:42:42,141 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 9.994
INFO:tensorflow:global_step/sec: 2.51051
2021-06-15 12:42:42,488 [INFO] tensorflow: global_step/sec: 2.51051
INFO:tensorflow:epoch = 2.1702127659574466, loss = 0.0015662777, step = 102 (5.410 sec)
2021-06-15 12:42:43,235 [INFO] tensorflow: epoch = 2.1702127659574466, loss = 0.0015662777, step = 102 (5.410 sec)
INFO:tensorflow:global_step/sec: 2.59284
2021-06-15 12:42:44,031 [INFO] tensorflow: global_step/sec: 2.59284
INFO:tensorflow:global_step/sec: 2.59594
2021-06-15 12:42:45,572 [INFO] tensorflow: global_step/sec: 2.59594
INFO:tensorflow:global_step/sec: 2.30278
2021-06-15 12:42:47,309 [INFO] tensorflow: global_step/sec: 2.30278
INFO:tensorflow:epoch = 2.4680851063829787, loss = 0.000651872, step = 116 (5.758 sec)
2021-06-15 12:42:48,993 [INFO] tensorflow: epoch = 2.4680851063829787, loss = 0.000651872, step = 116 (5.758 sec)
INFO:tensorflow:global_step/sec: 2.37283
2021-06-15 12:42:48,994 [INFO] tensorflow: global_step/sec: 2.37283
INFO:tensorflow:global_step/sec: 2.45762
2021-06-15 12:42:50,622 [INFO] tensorflow: global_step/sec: 2.45762
INFO:tensorflow:global_step/sec: 2.40992
2021-06-15 12:42:52,282 [INFO] tensorflow: global_step/sec: 2.40992
2021-06-15 12:42:52,283 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 9.860
INFO:tensorflow:global_step/sec: 2.53232
2021-06-15 12:42:53,861 [INFO] tensorflow: global_step/sec: 2.53232
INFO:tensorflow:epoch = 2.7659574468085104, loss = 0.00043913763, step = 130 (5.573 sec)
2021-06-15 12:42:54,567 [INFO] tensorflow: epoch = 2.7659574468085104, loss = 0.00043913763, step = 130 (5.573 sec)
INFO:tensorflow:global_step/sec: 2.67132
2021-06-15 12:42:55,359 [INFO] tensorflow: global_step/sec: 2.67132
INFO:tensorflow:global_step/sec: 2.4208
2021-06-15 12:42:57,011 [INFO] tensorflow: global_step/sec: 2.4208
INFO:tensorflow:global_step/sec: 2.75674
2021-06-15 12:42:58,462 [INFO] tensorflow: global_step/sec: 2.75674
2021-06-15 12:42:58,878 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 3/120: loss: 0.00096 Time taken: 0:00:18.719233 ETA: 0:36:30.150265
INFO:tensorflow:epoch = 3.0638297872340425, loss = 0.00080436235, step = 144 (5.759 sec)
2021-06-15 12:43:00,325 [INFO] tensorflow: epoch = 3.0638297872340425, loss = 0.00080436235, step = 144 (5.759 sec)
INFO:tensorflow:global_step/sec: 2.14549
2021-06-15 12:43:00,327 [INFO] tensorflow: global_step/sec: 2.14549
INFO:tensorflow:global_step/sec: 2.64888
2021-06-15 12:43:01,837 [INFO] tensorflow: global_step/sec: 2.64888
2021-06-15 12:43:02,252 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 10.031
INFO:tensorflow:global_step/sec: 2.31762
2021-06-15 12:43:03,563 [INFO] tensorflow: global_step/sec: 2.31762
INFO:tensorflow:global_step/sec: 2.56349
2021-06-15 12:43:05,123 [INFO] tensorflow: global_step/sec: 2.56349
INFO:tensorflow:epoch = 3.361702127659574, loss = 0.00094284175, step = 158 (5.568 sec)
2021-06-15 12:43:05,893 [INFO] tensorflow: epoch = 3.361702127659574, loss = 0.00094284175, step = 158 (5.568 sec)
INFO:tensorflow:global_step/sec: 2.60153
2021-06-15 12:43:06,661 [INFO] tensorflow: global_step/sec: 2.60153
INFO:tensorflow:global_step/sec: 2.4485
2021-06-15 12:43:08,294 [INFO] tensorflow: global_step/sec: 2.4485
INFO:tensorflow:global_step/sec: 2.49224
2021-06-15 12:43:09,899 [INFO] tensorflow: global_step/sec: 2.49224
INFO:tensorflow:epoch = 3.6595744680851063, loss = 0.0015221415, step = 172 (5.468 sec)
2021-06-15 12:43:11,362 [INFO] tensorflow: epoch = 3.6595744680851063, loss = 0.0015221415, step = 172 (5.468 sec)
INFO:tensorflow:global_step/sec: 2.73267
2021-06-15 12:43:11,363 [INFO] tensorflow: global_step/sec: 2.73267
2021-06-15 12:43:12,095 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 10.160
INFO:tensorflow:global_step/sec: 2.60963
2021-06-15 12:43:12,896 [INFO] tensorflow: global_step/sec: 2.60963
INFO:tensorflow:global_step/sec: 2.61323
2021-06-15 12:43:14,426 [INFO] tensorflow: global_step/sec: 2.61323
INFO:tensorflow:global_step/sec: 2.44023
2021-06-15 12:43:16,066 [INFO] tensorflow: global_step/sec: 2.44023
INFO:tensorflow:epoch = 3.957446808510638, loss = 0.0007857586, step = 186 (5.499 sec)
2021-06-15 12:43:16,861 [INFO] tensorflow: epoch = 3.957446808510638, loss = 0.0007857586, step = 186 (5.499 sec)
INFO:tensorflow:global_step/sec: 2.48045
2021-06-15 12:43:17,678 [INFO] tensorflow: global_step/sec: 2.48045
2021-06-15 12:43:17,680 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 4/120: loss: 0.00131 Time taken: 0:00:18.802958 ETA: 0:36:21.143074
INFO:tensorflow:global_step/sec: 2.394
2021-06-15 12:43:19,349 [INFO] tensorflow: global_step/sec: 2.394
INFO:tensorflow:global_step/sec: 2.37207
2021-06-15 12:43:21,035 [INFO] tensorflow: global_step/sec: 2.37207
2021-06-15 12:43:22,289 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 9.810
INFO:tensorflow:epoch = 4.25531914893617, loss = 0.0013742528, step = 200 (5.792 sec)
2021-06-15 12:43:22,653 [INFO] tensorflow: epoch = 4.25531914893617, loss = 0.0013742528, step = 200 (5.792 sec)
INFO:tensorflow:global_step/sec: 2.47044
2021-06-15 12:43:22,654 [INFO] tensorflow: global_step/sec: 2.47044
INFO:tensorflow:global_step/sec: 2.37571
2021-06-15 12:43:24,338 [INFO] tensorflow: global_step/sec: 2.37571
INFO:tensorflow:global_step/sec: 2.25377
2021-06-15 12:43:26,113 [INFO] tensorflow: global_step/sec: 2.25377
INFO:tensorflow:global_step/sec: 2.50197
2021-06-15 12:43:27,712 [INFO] tensorflow: global_step/sec: 2.50197
INFO:tensorflow:epoch = 4.531914893617021, loss = 0.001304254, step = 213 (5.494 sec)
2021-06-15 12:43:28,147 [INFO] tensorflow: epoch = 4.531914893617021, loss = 0.001304254, step = 213 (5.494 sec)
INFO:tensorflow:global_step/sec: 2.47797
2021-06-15 12:43:29,326 [INFO] tensorflow: global_step/sec: 2.47797
INFO:tensorflow:global_step/sec: 2.4851
2021-06-15 12:43:30,936 [INFO] tensorflow: global_step/sec: 2.4851
INFO:tensorflow:global_step/sec: 2.25075
2021-06-15 12:43:32,713 [INFO] tensorflow: global_step/sec: 2.25075
2021-06-15 12:43:32,714 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 9.593
INFO:tensorflow:epoch = 4.829787234042553, loss = 0.0007802945, step = 227 (5.834 sec)
2021-06-15 12:43:33,980 [INFO] tensorflow: epoch = 4.829787234042553, loss = 0.0007802945, step = 227 (5.834 sec)
INFO:tensorflow:global_step/sec: 2.4641
2021-06-15 12:43:34,336 [INFO] tensorflow: global_step/sec: 2.4641
INFO:tensorflow:global_step/sec: 2.64397
2021-06-15 12:43:35,849 [INFO] tensorflow: global_step/sec: 2.64397
2021-06-15 12:43:37,131 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 5/120: loss: 0.00082 Time taken: 0:00:19.484348 ETA: 0:37:20.700027
INFO:tensorflow:global_step/sec: 2.27821
2021-06-15 12:43:37,605 [INFO] tensorflow: global_step/sec: 2.27821
INFO:tensorflow:global_step/sec: 2.55934
2021-06-15 12:43:39,168 [INFO] tensorflow: global_step/sec: 2.55934
INFO:tensorflow:epoch = 5.127659574468085, loss = 0.0010832879, step = 241 (5.597 sec)
2021-06-15 12:43:39,577 [INFO] tensorflow: epoch = 5.127659574468085, loss = 0.0010832879, step = 241 (5.597 sec)
INFO:tensorflow:global_step/sec: 2.34384
2021-06-15 12:43:40,874 [INFO] tensorflow: global_step/sec: 2.34384
INFO:tensorflow:global_step/sec: 2.47716
2021-06-15 12:43:42,489 [INFO] tensorflow: global_step/sec: 2.47716
2021-06-15 12:43:42,930 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 9.788
INFO:tensorflow:global_step/sec: 2.3761
2021-06-15 12:43:44,172 [INFO] tensorflow: global_step/sec: 2.3761
INFO:tensorflow:epoch = 5.425531914893617, loss = 0.001108178, step = 255 (5.735 sec)
2021-06-15 12:43:45,312 [INFO] tensorflow: epoch = 5.425531914893617, loss = 0.001108178, step = 255 (5.735 sec)
INFO:tensorflow:global_step/sec: 2.619
2021-06-15 12:43:45,700 [INFO] tensorflow: global_step/sec: 2.619
INFO:tensorflow:global_step/sec: 2.39603
2021-06-15 12:43:47,369 [INFO] tensorflow: global_step/sec: 2.39603
INFO:tensorflow:global_step/sec: 2.46389
2021-06-15 12:43:48,993 [INFO] tensorflow: global_step/sec: 2.46389
INFO:tensorflow:global_step/sec: 2.70787
2021-06-15 12:43:50,470 [INFO] tensorflow: global_step/sec: 2.70787
INFO:tensorflow:epoch = 5.723404255319148, loss = 0.00061569334, step = 269 (5.557 sec)
2021-06-15 12:43:50,868 [INFO] tensorflow: epoch = 5.723404255319148, loss = 0.00061569334, step = 269 (5.557 sec)
INFO:tensorflow:global_step/sec: 2.56318
2021-06-15 12:43:52,030 [INFO] tensorflow: global_step/sec: 2.56318
2021-06-15 12:43:52,733 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 10.202
INFO:tensorflow:global_step/sec: 2.78304
2021-06-15 12:43:53,468 [INFO] tensorflow: global_step/sec: 2.78304
INFO:tensorflow:global_step/sec: 2.26568
2021-06-15 12:43:55,233 [INFO] tensorflow: global_step/sec: 2.26568
2021-06-15 12:43:55,632 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 7, 0.00s/step
Epoch 6/120
=========================[](https://)

Validation cost: 0.000950
Mean average_precision (in %): 0.0000

class name      average precision (in %)
------------  --------------------------
car                                    0

Median Inference Time: 0.073559
INFO:tensorflow:epoch = 6.0, loss = 0.00092883315, step = 282 (17.548 sec)
2021-06-15 12:44:08,417 [INFO] tensorflow: epoch = 6.0, loss = 0.00092883315, step = 282 (17.548 sec)
2021-06-15 12:44:08,418 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 6/120: loss: 0.00093 Time taken: 0:00:31.336126 ETA: 0:59:32.318347
INFO:tensorflow:global_step/sec: 0.287938
2021-06-15 12:44:09,125 [INFO] tensorflow: global_step/sec: 0.287938
INFO:tensorflow:global_step/sec: 2.52453
2021-06-15 12:44:10,709 [INFO] tensorflow: global_step/sec: 2.52453
INFO:tensorflow:global_step/sec: 2.29179
2021-06-15 12:44:12,455 [INFO] tensorflow: global_step/sec: 2.29179
INFO:tensorflow:epoch = 6.297872340425532, loss = 0.001180372, step = 296 (5.678 sec)
2021-06-15 12:44:14,094 [INFO] tensorflow: epoch = 6.297872340425532, loss = 0.001180372, step = 296 (5.678 sec)
INFO:tensorflow:global_step/sec: 2.43751
2021-06-15 12:44:14,096 [INFO] tensorflow: global_step/sec: 2.43751
2021-06-15 12:44:15,337 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 4.424
INFO:tensorflow:global_step/sec: 2.41797
2021-06-15 12:44:15,750 [INFO] tensorflow: global_step/sec: 2.41797
INFO:tensorflow:global_step/sec: 2.65673
2021-06-15 12:44:17,256 [INFO] tensorflow: global_step/sec: 2.65673
INFO:tensorflow:global_step/sec: 2.4071
2021-06-15 12:44:18,917 [INFO] tensorflow: global_step/sec: 2.4071
INFO:tensorflow:epoch = 6.595744680851063, loss = 0.0012715331, step = 310 (5.551 sec)
2021-06-15 12:44:19,645 [INFO] tensorflow: epoch = 6.595744680851063, loss = 0.0012715331, step = 310 (5.551 sec)
INFO:tensorflow:global_step/sec: 2.57286
2021-06-15 12:44:20,472 [INFO] tensorflow: global_step/sec: 2.57286
INFO:tensorflow:global_step/sec: 2.87132
2021-06-15 12:44:21,865 [INFO] tensorflow: global_step/sec: 2.87132
INFO:tensorflow:global_step/sec: 2.26257

I have attached the specs and one example from the dataset ( the total is around 230 images like this one )


frame_000150.xml (529 Bytes)
frame_000150.txt (40 Bytes)
detectnet_v2_retrain_trafficcamnet_car_kitti.txt (3.3 KB)
detectnet_v2_train_trafficcamnet_car_kitti.txt (3.3 KB)

Which tlt docker did you use? 3.0-dp-py3 or 3.0-py3?

Configuration of the TLT Instance
dockers: [‘nvcr.io/nvidia/tlt-streamanalytics’, ‘nvcr.io/nvidia/tlt-pytorch’]
format_version: 1.0
tlt_version: 3.0
published_date: 02/02/2021

nvcr.io/nvidia/tlt-streamanalytics v3.0-dp-py3 a865982b80a3 4 months ago 15.5GB

It should be 3.0-dp-py3 docker. Your images are 1920x1080, but your training spec is 960x544. For this version of docker, end user should resize images/labels offline to 960x544 in order to match training spec.

Suggest you to use latest 3.0-py3 docker.For this version of docker, end user do not need to resize images/labels by themselves. Just need to set as below. See https://docs.nvidia.com/tlt/tlt-user-guide/text/object_detection/detectnet_v2.html#input-requirement,

The train tool does not support training on images of multiple resolutions. However, the dataloader does support resizing images to the input resolution defined in the specification file. This can be enabled by setting the enable_auto_resize parameter to true in the augmentation_config module of the spec file.

How should I force tlt command to use 3.0-py3 docker, when I run the !tlt detectnet_v2 dataset_convert command, it automatically download the 3.0-dp-py3 docker I believe

See https://docs.nvidia.com/tlt/tlt-user-guide/text/tlt_quick_start_guide.html#installing-tlt

If you had installed an older version of nvidia-tlt launcher, you may upgrade to the latest version by running the following command.

pip3 install --upgrade nvidia-tlt

Then, verify with one of below ways.

  1.   import tlt
      print(tlt.__version__)
    

    this should be at 0.1.4

  2. Run tlt info --verbose
    Check “docker_tag:”

Many thanks, however, I have tried to update the tlt to the right version and changed the config file it still fails. This is really my first time doing the custom training :-(

tlt info --verbose:

Configuration of the TLT Instance

dockers: 		
	nvidia/tlt-streamanalytics: 			
		docker_registry: nvcr.io
		docker_tag: v3.0-py3
		tasks: 
			1. augment
			2. bpnet
			3. classification
			4. detectnet_v2
			5. dssd
			6. emotionnet
			7. faster_rcnn
			8. fpenet
			9. gazenet
			10. gesturenet
			11. heartratenet
			12. lprnet
			13. mask_rcnn
			14. multitask_classification
			15. retinanet
			16. ssd
			17. unet
			18. yolo_v3
			19. yolo_v4
			20. tlt-converter
	nvidia/tlt-pytorch: 			
		docker_registry: nvcr.io
		docker_tag: v3.0-py3
		tasks: 
			1. speech_to_text
			2. speech_to_text_citrinet
			3. text_classification
			4. question_answering
			5. token_classification
			6. intent_slot_classification
			7. punctuation_and_capitalization
format_version: 1.0
tlt_version: 3.0
published_date: 04/16/2021
  preprocessing {
    output_image_width: 960
    output_image_height: 544
    crop_right: 960
    crop_bottom: 544
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    output_image_channel: 3
    enable_auto_resize: True
  }

How did you generate tfrecords /workspace/tlt-experiments/data/tfrecords/kitti_trainval/* ?
Do you have log?

Can you modify image_extension: "PNG" to image_extension: "png" and retry?

Here is the full log:

TFrecords conversion spec file for kitti training
kitti_config {
  root_directory_path: "/workspace/tlt-experiments/data/training"
  image_dir_name: "image_2"
  label_dir_name: "label_2_kitti"
  image_extension: ".PNG"
  partition_mode: "random"
  num_partitions: 2
  val_split: 14
  num_shards: 4
}
image_directory_path: "/workspace/tlt-experiments/data/training"
!ls $$LOCAL_DATA_DIR/tfrecords/kitti_trainval/
-fold-000-of-002-shard-00000-of-00004  -fold-001-of-002-shard-00000-of-00004
-fold-000-of-002-shard-00001-of-00004  -fold-001-of-002-shard-00001-of-00004
-fold-000-of-002-shard-00002-of-00004  -fold-001-of-002-shard-00002-of-00004
-fold-000-of-002-shard-00003-of-00004  -fold-001-of-002-shard-00003-of-00004
!rm -rf $$LOCAL_DATA_DIR/tfrecords/kitti_trainval/
# Creating a new directory for the output tfrecords dump.
print("Converting Tfrecords for kitti trainval dataset")
!tlt detectnet_v2 dataset_convert \
                  -d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt \
                  -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval
Converting Tfrecords for kitti trainval dataset
2021-06-16 12:40:53,881 [INFO] root: Registry: ['nvcr.io']
2021-06-16 12:40:53,970 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2021-06-16 04:41:02,460 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2021-06-16 04:41:02,461 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Creating output directory /workspace/tlt-experiments/data/tfrecords/kitti_trainval
2021-06-16 04:41:02,462 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 188	Val: 30
2021-06-16 04:41:02,462 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2021-06-16 04:41:02,462 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2021-06-16 04:41:02,462 - tensorflow - WARNING - From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:273: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2021-06-16 04:41:02,477 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2021-06-16 04:41:02,484 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2021-06-16 04:41:02,491 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2021-06-16 04:41:02,500 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - 
Wrote the following numbers of objects:
b'car': 59

2021-06-16 04:41:02,500 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2021-06-16 04:41:02,546 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2021-06-16 04:41:02,592 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2021-06-16 04:41:02,638 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2021-06-16 04:41:02,684 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - 
Wrote the following numbers of objects:
b'car': 338

2021-06-16 04:41:02,684 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2021-06-16 04:41:02,684 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - 
Wrote the following numbers of objects:
b'car': 397

2021-06-16 04:41:02,684 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map. 
Label in GT: Label in tfrecords file 
b'car': b'car'
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2021-06-16 04:41:02,684 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.
2021-06-16 12:41:04,028 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
$LOCAL_DATA_DIR/tfrecords/kitti_trainval/
!ls -rlt $LOCAL_DATA_DIR/tfrecords/kitti_trainval/
total 160
-rw-r--r-- 1 root root  4597 Jun 16 12:41 kitti_trainval-fold-000-of-002-shard-00000-of-00004
-rw-r--r-- 1 root root  4655 Jun 16 12:41 kitti_trainval-fold-000-of-002-shard-00001-of-00004
-rw-r--r-- 1 root root  4771 Jun 16 12:41 kitti_trainval-fold-000-of-002-shard-00002-of-00004
-rw-r--r-- 1 root root  5869 Jun 16 12:41 kitti_trainval-fold-000-of-002-shard-00003-of-00004
-rw-r--r-- 1 root root 30559 Jun 16 12:41 kitti_trainval-fold-001-of-002-shard-00000-of-00004
-rw-r--r-- 1 root root 31081 Jun 16 12:41 kitti_trainval-fold-001-of-002-shard-00001-of-00004
-rw-r--r-- 1 root root 30791 Jun 16 12:41 kitti_trainval-fold-001-of-002-shard-00002-of-00004
-rw-r--r-- 1 root root 30385 Jun 16 12:41 kitti_trainval-fold-001-of-002-shard-00003-of-00004

FYI. I have resized all the images offline to 960x554, changed the image file name from .PNG to .png, then modified the config file accordingly, the problem is still there. Really confused!!! tfrecords I believed are correct, image file size are resized offline, Where else should I look into?

Can you modify below and retry?

  1. minimum_bounding_box_height: 15 (to 5)

  2. evaluation_box_config {
    key: “car”
    value {
    minimum_height: 20 (to 5)
    maximum_height: 9999
    minimum_width: 10 (to 5)
    maximum_width: 9999

Still not working. I suspect it is very silly mistake…

What are you using in “pretrained_model_file” now?

More, can you set higher lr as below?
min_learning_rate: 5e-6
max_learning_rate: 5e-4

/workspace/tlt-experiments/detectnet_v2/pretrained_trafficcamnet/tlt_trafficcamnet_vunpruned_v1.0/resnet18_trafficcamnet.tlt

I have changed use_batch_norm: to true, then the precsion starts to accumulate, still very small though… I will continue to play to see how it impoves.

Now the training finishes after 120 epochs, the precison only managed to get up to 5%. Do I have to get more training data with more epochs or there are some other parameters to fine tune?

I attached the current conf below

random_seed: 42
dataset_config {
  data_sources {
    tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "png"
  target_class_mapping {
    key: "car"
    value: "car"
  }

  validation_fold: 0
}

model_config {
  pretrained_model_file: "/workspace/tlt-experiments/detectnet_v2/pretrained_trafficcamnet/tlt_trafficcamnet_vunpruned_v1.0/resnet18_trafficcamnet.tlt"
  num_layers: 18
  use_batch_norm: true
    objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
  all_projections:True
}

augmentation_config {
  preprocessing {
    output_image_width: 960
    output_image_height: 544
    crop_right: 960
    crop_bottom: 544
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    output_image_channel: 3
    enable_auto_resize: false
  }
  spatial_augmentation {
    hflip_probability: 0.5
    zoom_min: 1
    zoom_max: 1
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}

postprocessing_config {
  target_class_config {
    key: "car"
    value {
      clustering_config {
        clustering_algorithm: DBSCAN
        dbscan_confidence_threshold: 0.7
        coverage_threshold: 0.00499999988824
        dbscan_eps: 0.20000000298
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 5
      }
    }
  }
}

evaluation_config {
  validation_period_during_training: 5
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "car"
    value: 0.699999988079
  }
  evaluation_box_config {
    key: "car"
    value {
      minimum_height: 5
      maximum_height: 9999
      minimum_width: 5
      maximum_width: 9999
    }
  }
  average_precision_mode: INTEGRATE
}

cost_function_config {
  target_classes {
    name: "car"
    class_weight: 1.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}

training_config {
  batch_size_per_gpu: 4
  num_epochs: 20
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-6
      max_learning_rate: 5e-4
      soft_start: 0.10000000149
      annealing: 0.699999988079
    }
  }
  regularizer {
    type: L1
    weight: 3.00000002618e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    } 
  }
  cost_scaling {
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 10
}

bbox_rasterizer_config {
  target_class_config {
  key: "car"
  value {
    cov_center_x: 0.5
    cov_center_y: 0.5
    cov_radius_x: 0.40000000596
    cov_radius_y: 0.40000000596
    bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.400000154972
}

Glad to see the original issue is gone.
For improving the AP, you can add more training data and trigger more experiments on it.
For example, more epochs, finetune batch-size, learning-rate, etc.
More info for small objects can be found in NVIDIA TAO Documentation

1 Like

Many thanks for your support! It has been long day for 2 days!