I’m running the TAO Toolkit (tao-toolkit-tf:v3.21.08-py3) on an HPC Cluster using singularity. The setup was straightforward.
Everything seems working, i.g. (inside the singularity) nvidia-smi
shows the GPUs. I can also access the GPUs using a test python script:
from __future__ import print_function
import tensorflow as tf
vers = tf.__version__
print(vers)
hello = tf.constant('Hello, TensorFlow!')
matrix1 = tf.constant([[3., 3.]])
matrix2 = tf.constant([[2.],[2.]])
product = tf.matmul(matrix1, matrix2)
sess = tf.Session()
print(sess.run(hello))
print(sess.run(product))
sess.close()
But if I run my tao command: mask_rcnn train -e ./specs.txt -k key_value -d /model_output/
, I get back an error message:
Using TensorFlow backend.
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: [[0,1],0] (PID 103639)
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
[MaskRCNN] INFO : Loading pretrained model...
WARNING:tensorflow:OMP_NUM_THREADS is no longer used by the default Keras config. To configure the number of threads, use tf.config.threading APIs.
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 360, in train_and_eval
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 273, in get_training_hooks
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 219, in load_pretrained_model
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/model_loader.py", line 49, in load_keras_model
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 366, in load_tf_keras_model
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/save.py", line 143, in load_model
return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 165, in load_model_from_hdf5
load_weights_from_hdf5_group(f['model_weights'], model.layers)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 693, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3259, in batch_set_value
get_session().run(assign_ops, feed_dict=feed_dict)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 483, in get_session
session = _get_session(op_input_list)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 455, in _get_session
config=get_default_session_config())
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable
[MaskRCNN] ERROR : Job finished with an uncaught exception: `FAILURE`
What could be the problem?
System Info:
- Python 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0] on Linux
- NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0
- singularity version 3.8.0-1.el7
- Defined by the cluster environment the GPU is in EXCLUSIVE_PROCESS