TLT v2 error when running detectnet_v2

We are running detectnet_v2 with TLT v2 and adding two classes to the model. TLT v1 works fine for detectnet and ssd both version 1 and 2 of TLT work.

Error:

target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2020-05-11 14:43:42,877 [INFO] iva.detectnet_v2.scripts.train: Found 1272 samples in validation set
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 47, in main
File “”, line 2, in main
File “./detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “./detectnet_v2/scripts/train.py”, line 667, in main
File “./detectnet_v2/scripts/train.py”, line 591, in run_experiment
File “./detectnet_v2/scripts/train.py”, line 525, in train_gridbox
File “./detectnet_v2/scripts/train.py”, line 142, in run_training_loop
File “./detectnet_v2/training/utilities.py”, line 143, in get_singular_monitored_session
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1021, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 650, in init
self._sess = self._coordinated_creator.create_session()
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 812, in create_session
hook.after_create_session(self.tf_sess, self.coord)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py”, line 568, in after_create_session
self._save(session, global_step)
File “./detectnet_v2/tfhooks/checkpoint_saver_hook.py”, line 77, in _save
File “./detectnet_v2/tfhooks/checkpoint_saver_hook.py”, line 110, in _save_encrypted_checkpoint
IOError: [Errno 2] No such file or directory: ‘output/model.step-0.ckzip’
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 47, in main
File “”, line 2, in main
File “./detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “./detectnet_v2/scripts/train.py”, line 667, in main
File “./detectnet_v2/scripts/train.py”, line 591, in run_experiment
File “./detectnet_v2/scripts/train.py”, line 525, in train_gridbox
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
File “./detectnet_v2/scripts/train.py”, line 144, in run_training_loop
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 676, in run
run_metadata=run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1270, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1255, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1327, in run
run_metadata=run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1091, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 929, in run
sys.exit(main())
File “./common/magnet_train.py”, line 47, in main
File “”, line 2, in main
File “./detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “./detectnet_v2/scripts/train.py”, line 667, in main
File “./detectnet_v2/scripts/train.py”, line 591, in run_experiment
File “./detectnet_v2/scripts/train.py”, line 525, in train_gridbox
File “./detectnet_v2/scripts/train.py”, line 144, in run_training_loop
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 676, in run
run_metadata=run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1270, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1255, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1327, in run
run_metadata=run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1091, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 929, in run
run_metadata_ptr)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1152, in _run
run_metadata_ptr)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1152, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1328, in _do_run
run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1328, in _do_run
run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1348, in _do_call
: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_bn_conv1_FusedBatchNorm_grad_tuple_control_dependency_1_0 (defined at :78) ]]
[[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_block_3a_bn_2_FusedBatchNorm_grad_tuple_control_dependency_1_0 (defined at :78) ]]

Caused by op u’DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_bn_conv1_FusedBatchNorm_grad_tuple_control_dependency_1_0’, defined at:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 47, in main
File “”, line 2, in main
File “./detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “./detectnet_v2/scripts/train.py”, line 667, in main
File “./detectnet_v2/scripts/train.py”, line 591, in run_experiment
File “./detectnet_v2/scripts/train.py”, line 500, in train_gridbox
File “./detectnet_v2/scripts/train.py”, line 353, in build_training_graph
File “./detectnet_v2/model/detectnet_model.py”, line 531, in build_training_graph
File “./detectnet_v2/training/train_op_generator.py”, line 60, in get_train_op
File “./detectnet_v2/training/train_op_generator.py”, line 75, in _get_train_op_without_cost_scaling
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py”, line 403, in minimize
grad_loss=grad_loss)
File “/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/init.py”, line 230, in compute_gradients
avg_grads = self._allreduce_grads(grads)
File “/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/init.py”, line 209, in allreduce_grads
for grad in grads]
File “/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/init.py”, line 88, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File “/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/mpi_ops.py”, line 91, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File “”, line 78, in horovod_allreduce
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py”, line 788, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 3300, in create_op
op_def=op_def)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 1801, in init
self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_bn_conv1_FusedBatchNorm_grad_tuple_control_dependency_1_0 (defined at :78) ]]
[[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_block_3a_bn_2_FusedBatchNorm_grad_tuple_control_dependency_1_0 (defined at :78) ]]

raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_bn_conv1_FusedBatchNorm_grad_tuple_control_dependency_2_0 (defined at :78) ]]
[[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_block_3a_bn_2_FusedBatchNorm_grad_tuple_control_dependency_1_0 (defined at :78) ]]

Caused by op u’DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_bn_conv1_FusedBatchNorm_grad_tuple_control_dependency_2_0’, defined at:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 47, in main
File “”, line 2, in main
File “./detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “./detectnet_v2/scripts/train.py”, line 667, in main
File “./detectnet_v2/scripts/train.py”, line 591, in run_experiment
File “./detectnet_v2/scripts/train.py”, line 500, in train_gridbox
File “./detectnet_v2/scripts/train.py”, line 353, in build_training_graph
File “./detectnet_v2/model/detectnet_model.py”, line 531, in build_training_graph
File “./detectnet_v2/training/train_op_generator.py”, line 60, in get_train_op
File “./detectnet_v2/training/train_op_generator.py”, line 75, in _get_train_op_without_cost_scaling
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py”, line 403, in minimize
grad_loss=grad_loss)
File “/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/init.py”, line 230, in compute_gradients
avg_grads = self._allreduce_grads(grads)
File “/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/init.py”, line 209, in allreduce_grads
for grad in grads]
File “/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/init.py”, line 88, in allreduce
summed_tensor_compressed = _allreduce(tensor_compressed)
File “/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/mpi_ops.py”, line 91, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File “”, line 78, in horovod_allreduce
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py”, line 788, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 3300, in create_op
op_def=op_def)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 1801, in init
self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_bn_conv1_FusedBatchNorm_grad_tuple_control_dependency_2_0 (defined at :78) ]]
[[node DistributedAdamOptimizer_Allreduce/HorovodAllreduce_gradients_resnet18_nopool_bn_detectnet_v2_block_3a_bn_2_FusedBatchNorm_grad_tuple_control_dependency_1_0 (defined at :78) ]]


Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[35227,1],1]
Exit code: 1