Cuda Error, creating more than one session using tensorflow

Hello Nvidia forums!

I have encounter this strange behavior, with a new jetson tx2 with the latest Jetpack and the following configuration:

  • cuda-toolkit
  • tensorrt

Whenever I create more than one session on tensorflow (I have tested with my own fresh builds in the range r.15 - r1.8 and the wheels provided by the nvidia team) tensorflow reports that it failed to create the session: [reported here][https://github.com/tensorflow/tensorflow/issues/19482]

Is there any tool or command line utility that I could use to investigate by my own (dmesg reports nothing)

Thanks

Hi,

TensorFlow also requires cuDNN package.
Please install cuDNN library from the JetPack installer.

Thanks.

cuDNN is already installed otherwise it could not create the first session…

Hi,

Sorry for the missing.

Not sure if this issue is caused by incorrect setting when building.
Could you test some public wheel file to check if issue also occurs?

For example,
https://devtalk.nvidia.com/default/topic/1031300
https://github.com/peterlee0127/tensorflow-nvJetson

Thanks.

Hi,

In my JetPack 3.2 / python 3.6.3 environment, tensroflow multi-session problem occurred by apt-get dist-upgrade.
I downgraded packages and solved it.

Downgrade:

apt-get install python3-update-manager=1:16.04.3 update-manager=1:16.04.3 update-manager-core=1:16.04.3 update-notifier-common=3.168

Only for my python3.6.3:(because add-apt-repository doesn’t work with python 3.6)

head -n1 /usr/bin/add-apt-repository
sed -i 's/^#! \/usr\/bin\/python3$/#! \/usr\/bin\/python3\.5/g' /usr/bin/add-apt-repository
head -n1 /usr/bin/add-apt-repository

I found installable package versions with the following command.

apt-cache policy python3-update-manager update-manager update-manager-core update-notifier-common

My error log:

ubuntu@tegra-ubuntu:~/notebooks/github/realtime_object_detection$ python object_detection.py 
Model found. Proceed.
Loading frozen model into memory
2018-06-08 10:36:18.409650: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2018-06-08 10:36:18.409813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.66GiB freeMemory: 5.93GiB
2018-06-08 10:36:18.409865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-06-08 10:36:21.072831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5050 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Loading label map
Building Graph
2018-06-08 10:36:59.790357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-06-08 10:36:59.790501: E tensorflow/core/common_runtime/direct_session.cc:167] Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: unknown error
Traceback (most recent call last):
  File "object_detection.py", line 302, in <module>
    main()
  File "object_detection.py", line 298, in main
    detection(graph, category, score, expand)
  File "object_detection.py", line 181, in detection
    with tf.Session(graph=detection_graph,config=config) as sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1522, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 638, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
    from apport.fileutils import likely_packaged, get_recent_crashes
  File "/usr/lib/python3/dist-packages/apport/__init__.py", line 5, in <module>
    from apport.report import Report
  File "/usr/lib/python3/dist-packages/apport/report.py", line 30, in <module>
    import apport.fileutils
  File "/usr/lib/python3/dist-packages/apport/fileutils.py", line 23, in <module>
    from apport.packaging_impl import impl as packaging
  File "/usr/lib/python3/dist-packages/apport/packaging_impl.py", line 23, in <module>
    import apt
  File "/usr/lib/python3/dist-packages/apt/__init__.py", line 23, in <module>
    import apt_pkg
ModuleNotFoundError: No module named 'apt_pkg'

Original exception was:
Traceback (most recent call last):
  File "object_detection.py", line 302, in <module>
    main()
  File "object_detection.py", line 298, in main
    detection(graph, category, score, expand)
  File "object_detection.py", line 181, in detection
    with tf.Session(graph=detection_graph,config=config) as sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1522, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 638, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

I wrote downgrade package, but it seems incorrect.

NG:

import tensorflow as tf

with tf.Session():
    pass
with tf.Session():
    pass

OK:

import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.allow_growth = True

with tf.Session(config=config):
    pass
with tf.Session(config=config):
    pass

It seems to be good to set “gpu_options.allow_growth = True” for the first session.

Hi,

You may meet the similar issue of this topic:
https://devtalk.nvidia.com/default/topic/1029742/jetson-tx2/tensorflow-1-6-not-working-with-jetpack-3-2/post/5241765/#5241765
https://devtalk.nvidia.com/default/topic/1029742/jetson-tx2/tensorflow-1-6-not-working-with-jetpack-3-2/post/5242249/#5242249

Thanks.

Hi,

Yes, it is the same problem as those topics.
And maybe this also.
https://devtalk.nvidia.com/default/topic/1036339/jetson-tx2/tensorflow-fails-to-create-a-session-and-issue-with-docker/

I looked at the source code of Tensorflow r1.8.0 and found the comment.

tensorflow/python/client/session.py

1541      If no `graph` argument is specified when constructing the session,
1542      the default graph will be launched in the session. If you are
1543      using more than one graph (created with `tf.Graph()` in the same
1544      process, you will have to use different sessions for each graph,
1545      but each graph can be used in multiple sessions. In this case, it
1546      is often clearer to pass the graph to be launched explicitly to
1547      the session constructor.

tensorflow/core/common_runtime/session_factory.cc

88      // NOTE(mrry): This implementation assumes that the domains (in
89      // terms of acceptable SessionOptions) of the registered
90      // SessionFactory implementations do not overlap. This is fine for
91      // now, but we may need an additional way of distinguishing
92      // different runtimes (such as an additional session option) if
93      // the number of sessions grows.
94      // TODO(mrry): Consider providing a system-default fallback option
95      // in this case.

Multiple session in the same thread use only the first session’s options.
In my understanding, the correct code is:

import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.allow_growth = True

with tf.Session(config=config):
    pass
with tf.Session():
    pass

The issue https://github.com/tensorflow/tensorflow/issues/19482 in #1,
there are two tf.Session() in the main thread. The first one is in the load_frozenmodel() function and the second one in the detection() function.

And the error occured in the detection() function.

As a solution, it need to add config to tf.Session() in the load_frozenmodel() function.
The code is:

def load_frozenmodel():
...
        input_graph = tf.Graph()
        config = tf.ConfigProto()
        config.gpu_options.allow_growth = allow_memory_growth
        with tf.Session(graph=input_graph, config=config):

By the way, because an config is required separately for tf.Session() generated by another thread,

gpu_worker = SessionWorker("GPU",detection_graph,config)
cpu_worker = SessionWorker("CPU",detection_graph,config)

please remain this part.
I thought that it was by process, and I deleted it. But it got error.

Thanks for the update. : )