Hi,
I am getting the below error when I run “TLT MaskRCNN example usecase” on a custom-made dataset. My dataset includes images with size 545x800 and polygon annotations. Interestingly when I run tlt-train on the COCO dataset, there is no error and it runs smoothly with two gpus but pointing to the custom-made train.record data will create the issue.
For multi-GPU, change --gpus based on your machine.
2021-03-12 19:57:46.485641: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:57:46.502109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
[MaskRCNN] INFO : Loading weights from /workspace/server/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-0.tlt
[MaskRCNN] INFO : Loading weights from /workspace/server/tlt-experiments/maskrcnn/experiment_dir_unpruned/model.step-0.tlt
[MaskRCNN] INFO : Horovod successfully initialized …
[MaskRCNN] INFO : Create EncryptCheckpointSaverHook.
[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Start training cycle 01
[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Using Dataset Sharding with Horovod
2021-03-12 19:57:57.070815: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-03-12 19:57:57.129558: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0a:00.0
2021-03-12 19:57:57.129606: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:57:57.131500: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:57:57.132423: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:57:57.133049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:57:57.135068: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:57:57.136282: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:57:57.143905: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:57:57.147315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:57:57.324039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-03-12 19:57:57.354097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-03-12 19:57:57.354165: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:57:57.355455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:57:57.356925: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:57:57.357271: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:57:57.359074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:57:57.360853: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:57:57.365959: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:57:57.370115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_2/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_3/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_4/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_5/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS… Scope: multilevel_propose_rois/level_6/
2021-03-12 19:58:00.626653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:0a:00.0
2021-03-12 19:58:00.626735: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:00.626825: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:58:00.626867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:58:00.626906: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:58:00.626943: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:58:00.626981: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:58:00.627018: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:58:00.630744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:58:00.630799: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:01.020046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-03-12 19:58:01.020159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:01.020306: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:58:01.020345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:58:01.020381: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:58:01.020414: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:58:01.020447: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:58:01.020480: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:58:01.024365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:58:01.024439: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:01.063165: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-12 19:58:01.063231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-03-12 19:58:01.063242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-03-12 19:58:01.070051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22514 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:0a:00.0, compute capability: 7.5)
2021-03-12 19:58:01.425943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-12 19:58:01.426022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-03-12 19:58:01.426030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-03-12 19:58:01.429791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21801 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:41:00.0, compute capability: 7.5)
Parsing Inputs…
[MaskRCNN] INFO : [Training Compute Statistics] 372.7 GFLOPS/image
Using TensorFlow backend.
4 ops no flops stats due to incomplete shapes.
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 58, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py”, line 187, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py”, line 90, in run_executer
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/distributed_executer.py”, line 393, in train_and_eval
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1195, in _train_model_default
saving_listeners)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py”, line 1490, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 584, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 1014, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py”, line 713, in init
h.begin()
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/pretrained_restore_hook.py”, line 209, in begin
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/hooks/pretrained_restore_hook.py”, line 113, in assign_from_checkpoint
ValueError: Total size of new array must be unchanged for box_head/class-predict/kernel lh_shape: [(1024, 1)], rh_shape: [(1024, 2)]
[MaskRCNN] ERROR : Job finished with an uncaught exception: FAILURE
2021-03-12 19:58:08.250839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:41:00.0
2021-03-12 19:58:08.250934: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-12 19:58:08.251019: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-12 19:58:08.251035: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-12 19:58:08.251049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-12 19:58:08.251061: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-12 19:58:08.251072: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-12 19:58:08.251085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-12 19:58:08.252041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-03-12 19:58:08.252090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-12 19:58:08.252097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-03-12 19:58:08.252102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-03-12 19:58:08.253112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device
(/job:localhost/replica:0/task:0/device:GPU:0 with 21801 MB memory) → physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:41:00.0, compute capability: 7.5)
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[3548,1],0]
Exit code: 1