Tao-toolkit on Cluster with GPU in EXCLUSIVE_PROCESS: CUDA runtime implicit initialization on GPU:0 failed

I’m running the TAO Toolkit (tao-toolkit-tf:v3.21.08-py3) on an HPC Cluster using singularity. The setup was straightforward.

Everything seems working, i.g. (inside the singularity) nvidia-smi shows the GPUs. I can also access the GPUs using a test python script:

from __future__ import print_function
import tensorflow as tf

vers = tf.__version__
print(vers)
hello = tf.constant('Hello, TensorFlow!')
matrix1 = tf.constant([[3., 3.]])
matrix2 = tf.constant([[2.],[2.]])
product = tf.matmul(matrix1, matrix2)

sess = tf.Session()
print(sess.run(hello))
print(sess.run(product))
sess.close()

But if I run my tao command: mask_rcnn train -e ./specs.txt -k key_value -d /model_output/, I get back an error message:

Using TensorFlow backend.
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[0,1],0] (PID 103639)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

[MaskRCNN] INFO    : Loading pretrained model...
WARNING:tensorflow:OMP_NUM_THREADS is no longer used by the default Keras config. To configure the number of threads, use tf.config.threading APIs.
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 222, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 218, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/train.py", line 85, in run_executer
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 360, in train_and_eval
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 273, in get_training_hooks
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/executer/distributed_executer.py", line 219, in load_pretrained_model
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/utils/model_loader.py", line 49, in load_keras_model
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 366, in load_tf_keras_model
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/save.py", line 143, in load_model
    return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 165, in load_model_from_hdf5
    load_weights_from_hdf5_group(f['model_weights'], model.layers)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 693, in load_weights_from_hdf5_group
    K.batch_set_value(weight_value_tuples)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3259, in batch_set_value
    get_session().run(assign_ops, feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 483, in get_session
    session = _get_session(op_input_list)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 455, in _get_session
    config=get_default_session_config())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1585, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 699, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: all CUDA-capable devices are busy or unavailable

[MaskRCNN] ERROR   : Job finished with an uncaught exception: `FAILURE`

What could be the problem?

System Info:

  • Python 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0] on Linux
  • NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0
  • singularity version 3.8.0-1.el7
  • Defined by the cluster environment the GPU is in EXCLUSIVE_PROCESS

Can you show the result of nvidia-smi ?

I am now able to run the container, but only if I pass the --gpu_index=1 flag.
I still cannot access GPU:0. Maybe it’s a problem with some environment variables? Or it is a problem with the known issue (Release Notes — TAO Toolkit 3.22.05 documentation) with mask_rcnn.

Seems that your machine has two identical 2080. It does not make sense one can work but another cannot work. How about running with the --gpu_index=0 flag ? Actually other user do not meet this error. Multigpus can be working.

No, there is not such known issue. You can try other networks to narrow down. Or as mention above, I am afraid there is something wrong in your 1st device.

I have found the cause of the problem. On the cluster, the GPUs are running in EXCLUSIVE_PROCESS mode, so they only allow one process per GPU. But when I start tao using mask_rcnn train -e /workspace/stemHarvest/specs.txt -k key_val -d /workspace/stemHarvest/model_output/ --gpus 3 actually 4 processes are running (see nvidia-smi below) . The additional process /usr/bin/python3.6 causes the exception.

Workaround: add the flag --gpu_index 0 2 3 with a list of indices of your GPUs. Note: I have excluded GPU 1 for the list, since this GPU runs the /usr/bin/python3.6 process.

Question:

  • What is the purpose of this /usr/bin/python3.6 process? Can we run it on the CPU to free the GPU?
  • Is there a way to change the GPU used in the mask_rcnn export command? I got the same issue here, but found no workaround yet.
Fri Oct 29 10:43:29 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:04:00.0 Off |                  N/A |
| 35%   55C    P2    45W / 180W |   7921MiB /  8119MiB |     92%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:06:00.0 Off |                  N/A |
| 27%   28C    P8     6W / 180W |      0MiB /  8119MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 00000000:07:00.0 Off |                  N/A |
| 28%   30C    P8     6W / 180W |      0MiB /  8119MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    Off  | 00000000:08:00.0 Off |                  N/A |
| 27%   29C    P8     5W / 180W |      0MiB /  8119MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 1080    Off  | 00000000:0C:00.0 Off |                  N/A |
| 27%   27C    P8     5W / 180W |    105MiB /  8119MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 1080    Off  | 00000000:0D:00.0 Off |                  N/A |
| 40%   62C    P2    52W / 180W |   7945MiB /  8119MiB |     65%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 1080    Off  | 00000000:0E:00.0 Off |                  N/A |
| 35%   53C    P2    40W / 180W |   7945MiB /  8119MiB |     10%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 1080    Off  | 00000000:0F:00.0 Off |                  N/A |
| 27%   28C    P8     6W / 180W |      2MiB /  8119MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3080      C   python3.6                        7903MiB |
|    4   N/A  N/A      3036      C   /usr/bin/python3.6                103MiB |
|    5   N/A  N/A      3081      C   python3.6                        7927MiB |
|    6   N/A  N/A      3082      C   python3.6                        7927MiB |
+-----------------------------------------------------------------------------+

Note:

No, there is not such known issue. You can try other networks to narrow down. Or as mention above, I am afraid there is something wrong in your 1st device.

I think it’s indeed related with this known issue (from Release Notes — TAO Toolkit 3.22.05 documentation):

When using MaskRCNN, please make sure GPU 0 is free.

In my case, that’s not true. I need GPU 1 to be free…

Thanks for the info, I will check further and try to reproduce.

The exporting can run on any gpu if you have N gpus. But it is a single gpu process.

Actually the process is the training of “mask_rcnn train xxx” . You can run “$ ps -aux” to confirm. For your case, you can allow more processes per GPU.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.