An error occurred when running TLT training

Hi,It was completely normal when I was running yesterday,the error is as follows:

Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.

[[53698,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: 2d312733f74d

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

[2d312733f74d:01866] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[2d312733f74d:01866] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
2021-02-26 07:08:53,606 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/spec_loader/spec_loader.pyc: Loading experiment spec at ./specs/default_spec_resnet50.txt.
2021-02-26 07:08:53,624 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/spec_loader/spec_loader.pyc: Loading experiment spec at ./specs/default_spec_resnet50.txt.
2021-02-26 07:08:53,662 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/spec_loader/spec_loader.pyc: Loading experiment spec at ./specs/default_spec_resnet50.txt.
2021-02-26 07:08:53,670 [INFO] /usr/local/lib/python2.7/dist-packages/iva/faster_rcnn/spec_loader/spec_loader.pyc: Loading experiment spec at ./specs/default_spec_resnet50.txt.
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 33, in main
File “./faster_rcnn/scripts/train.py”, line 56, in main
File “./faster_rcnn/models/utils.py”, line 215, in build_or_resume_model
File “./faster_rcnn/data_loader/inputs_loader.py”, line 74, in init
File “./detectnet_v2/dataloader/default_dataloader.py”, line 201, in get_dataset_tensors
File “./detectnet_v2/dataloader/utilities.py”, line 181, in extract_tfrecords_features
StopIteration
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 33, in main
File “./faster_rcnn/scripts/train.py”, line 56, in main
File “./faster_rcnn/models/utils.py”, line 215, in build_or_resume_model
File “./faster_rcnn/data_loader/inputs_loader.py”, line 74, in init
File “./detectnet_v2/dataloader/default_dataloader.py”, line 201, in get_dataset_tensors
File “./detectnet_v2/dataloader/utilities.py”, line 181, in extract_tfrecords_features
StopIteration
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 33, in main
File “./faster_rcnn/scripts/train.py”, line 56, in main
File “./faster_rcnn/models/utils.py”, line 215, in build_or_resume_model
File “./faster_rcnn/data_loader/inputs_loader.py”, line 74, in init
File “./detectnet_v2/dataloader/default_dataloader.py”, line 201, in get_dataset_tensors
File “./detectnet_v2/dataloader/utilities.py”, line 181, in extract_tfrecords_features
StopIteration
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 33, in main
File “./faster_rcnn/scripts/train.py”, line 56, in main
File “./faster_rcnn/models/utils.py”, line 215, in build_or_resume_model
File “./faster_rcnn/data_loader/inputs_loader.py”, line 74, in init
File “./detectnet_v2/dataloader/default_dataloader.py”, line 201, in get_dataset_tensors
File “./detectnet_v2/dataloader/utilities.py”, line 181, in extract_tfrecords_features
StopIteration

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[53698,1],2]
Exit code: 1

Please check your tfrecord files.
Reference: [Urgent] Can't run `tlt-evaluate faster_rcnn` for exported model - #9 by cogbot
Training detectnet_v2 Issue - #8 by Morganh
tlt-train error when deploy mobilenet_v2 by using DetectNet - #17 by Morganh