Horovod shutdown when trying to train the NVDIA sample Efficientnet_B0 network

T4 GPU
deepstream:6.0-triton
TensorRT Version=8.0.1
NVIDIA GPU Driver Version 470.82.01

Other sample networks trained fine on this same setup but for some reason training the Efficientnet_B0 network from the NVIDIA NGC led to this error, wondering if anyone else has seen this? If so could you please provide any information?
Thank you very much,
Brandt
tensorflow.python.framework.errors_impl.UnknownError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
[[{{node training/SGD/DistributedSGD_Allreduce/cond_86/HorovodAllreduce_training_SGD_cond_86_Merge_0}}]]

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Is this training topic instead of DeepStream related?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.