Horovod shutdown when trying to train the NVDIA sample Efficientnet_B0 network

TensorRT Version=8.0.1
NVIDIA GPU Driver Version 470.82.01

Other sample networks trained fine on this same setup but for some reason training the Efficientnet_B0 network from the NVIDIA NGC led to this error, wondering if anyone else has seen this? If so could you please provide any information?
Thank you very much,
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[{{node training/SGD/DistributedSGD_Allreduce/cond_86/HorovodAllreduce_training_SGD_cond_86_Merge_0}}]]

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Is this training topic instead of DeepStream related?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.