Fail to initialize CUDNN when running tensorflow: CUDNN_STATUS_INTERNAL_ERROR

Hi,

I am developing an object detection application using tensorflow running on the Jetson Xavier. For development, the application run fine with the devkit, i did not have any trouble installing and setting the environment. However when moving to the production device, which is a third party jetson, with the same setting i get this ugly error:

tensorflow/stream_executor/cuda/cuda_dnn.cc:330] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

This is strange because this problem seems to be related to out of memory issue. I tried to set allow_growth but it did not resolve the issue. Monitoring the resources, it never exceed 20% before raising error. All the other threads did not show me a resolution or i am missing something.

Setting:

  • Jetpack: 4.3 (L4T 32.3.1) (modified slightly by provider)
  • Platform: AGX Xavier 16GB
  • CUDA: 10.0.326
  • cuDNN: 7.6.3.28
  • Python: 3.6.9
  • Tensorflow: Tested with all the available version for jp43 (1.15, 2.0, 2.1)

Test script:

import cv2
import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

from collections import defaultdict
from io import StringIO
#from matplotlib import pyplot as plt
from PIL import Image
from object_detection.utils import label_map_util
from utils import visualization_utils as vis_util

PATH_TO_CKPT = ‘/home/nvidia/Desktop/ComputerVision/prototype/frozen_inference_graph_mobilessd.pb’
PATH_TO_LABELS = ‘/home/nvidia/Desktop/ComputerVision/prototype/labelmap.pbtxt’
NUM_CLASSES = 1
#os.environ[‘CUDA_VISIBLE_DEVICES’] = ‘0’
physical_devices = tf.config.experimental.list_physical_devices(‘GPU’)
tf.config.experimental.set_memory_growth(physical_devices[0], True)
detection_graph = tf.Graph()
with detection_graph.as_default():
od_graph_def = tf.compat.v1.GraphDef()
with tf.compat.v1.gfile.GFile(PATH_TO_CKPT, ‘rb’) as fid:
serialized_graph = fid.read()
od_graph_def.ParseFromString(serialized_graph)
tf.import_graph_def(od_graph_def, name=‘’)

label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = label_map_util.convert_label_map_to_categories(
label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)

def load_image_into_numpy_array(image):
(im_width, im_height) = image.size
return np.array(image.getdata()).reshape(
(im_height, im_width, 3)).astype(np.uint8)

cap = cv2.VideoCapture(‘/dev/video0’)
print(1)
gpu_options = tf.compat.v1.GPUOptions(allow_growth=True)
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
config.log_device_placement = True
config.gpu_options.per_process_gpu_memory_fraction = 0.75
with detection_graph.as_default():
with tf.compat.v1.Session(config=config,graph=detection_graph) as sess:
sess.as_default()
while True:
ret, image_np = cap.read()
# Expand dimensions since the model expects images to have shape: [1, None, None, 3]
image_np_expanded = np.expand_dims(image_np, axis=0)
# Extract image tensor
image_tensor = detection_graph.get_tensor_by_name(‘image_tensor:0’)
# Extract detection boxes
boxes = detection_graph.get_tensor_by_name(‘detection_boxes:0’)
# Extract detection scores
scores = detection_graph.get_tensor_by_name(‘detection_scores:0’)
# Extract detection classes
classes = detection_graph.get_tensor_by_name(‘detection_classes:0’)
# Extract number of detectionsd
num_detections = detection_graph.get_tensor_by_name(‘num_detections:0’)
# Actual detection. ???
(boxes, scores, classes, num_detections) = sess.run(
[boxes, scores, classes, num_detections],
feed_dict={image_tensor: image_np_expanded})
cv2.imshow(‘object detection’, cv2.resize(image_np, (800, 600)))
if cv2.waitKey(25) & 0xFF == ord(‘q’):
cv2.destroyAllWindows()
break

This test script only open a video feed and start a tensorflow session. The session.run() makes the script crashed. Test importing is fine with no error.

Full output:

2020-08-25 10:05:44.565764: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

2020-08-25 10:05:47.958092: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-08-25 10:05:47.961467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-08-25 10:05:52.198216: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-25 10:05:52.205445: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-08-25 10:05:52.205693: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.377GHz coreCount: 8 deviceMemorySize: 31.18GiB deviceMemoryBandwidth: 82.08GiB/s
2020-08-25 10:05:52.205843: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-25 10:05:52.206094: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-25 10:05:52.209330: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-25 10:05:52.210735: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-25 10:05:52.216991: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-25 10:05:52.220891: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-25 10:05:52.221238: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-25 10:05:52.221729: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-08-25 10:05:52.222237: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-08-25 10:05:52.222357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
[ WARN:0] global /home/nvidia/host/build_opencv/nv_opencv/modules/videoio/src/cap_gstreamer.cpp (1757) handleMessage OpenCV | GStreamer warning: Embedded video playback halted; module source reported: Could not read from resource.
[ WARN:0] global /home/nvidia/host/build_opencv/nv_opencv/modules/videoio/src/cap_gstreamer.cpp (886) open OpenCV | GStreamer warning: unable to start pipeline
[ WARN:0] global /home/nvidia/host/build_opencv/nv_opencv/modules/videoio/src/cap_gstreamer.cpp (480) isPipelinePlaying OpenCV | GStreamer warning: GStreamer: pipeline have not been created
1
2020-08-25 10:05:59.122872: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2020-08-25 10:05:59.124287: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3d0960a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-25 10:05:59.124381: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-08-25 10:05:59.225984: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-08-25 10:05:59.226697: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3cffc390 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-25 10:05:59.226815: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Xavier, Compute Capability 7.2
2020-08-25 10:05:59.227757: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-08-25 10:05:59.227980: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.377GHz coreCount: 8 deviceMemorySize: 31.18GiB deviceMemoryBandwidth: 82.08GiB/s
2020-08-25 10:05:59.228149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-25 10:05:59.228242: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-25 10:05:59.228357: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-25 10:05:59.228508: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-25 10:05:59.228648: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-25 10:05:59.228785: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-25 10:05:59.228907: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-25 10:05:59.229383: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-08-25 10:05:59.229829: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-08-25 10:05:59.229946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-08-25 10:05:59.230186: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-25 10:06:01.668683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-25 10:06:01.668798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-08-25 10:06:01.668852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-08-25 10:06:01.669660: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-08-25 10:06:01.670227: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:948] ARM64 does not support NUMA - returning NUMA node zero
2020-08-25 10:06:01.670534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 23945 MB memory) → physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
2020-08-25 10:06:06.605448: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-25 10:06:07.657404: E tensorflow/stream_executor/cuda/cuda_dnn.cc:330] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-08-25 10:06:08.415134: E tensorflow/stream_executor/cuda/cuda_dnn.cc:330] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1367, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1352, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1445, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/MobilenetV2/Conv/Conv2D}}]]
[[Postprocessor/BatchMultiClassNonMaxSuppression/map/while/Exit_8/_75]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/MobilenetV2/Conv/Conv2D}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/nvidia/Desktop/ComputerVision/drone/tools/dronetest.py”, line 86, in
feed_dict={image_tensor: image_np_expanded})
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 960, in run
run_metadata_ptr)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1183, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1361, in _do_run
run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1386, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node FeatureExtractor/MobilenetV2/Conv/Conv2D (defined at home/nvidia/Desktop/ComputerVision/drone/tools/dronetest.py:34) ]]
[[Postprocessor/BatchMultiClassNonMaxSuppression/map/while/Exit_8/_75]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node FeatureExtractor/MobilenetV2/Conv/Conv2D (defined at home/nvidia/Desktop/ComputerVision/drone/tools/dronetest.py:34) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for ‘FeatureExtractor/MobilenetV2/Conv/Conv2D’:
File “home/nvidia/Desktop/ComputerVision/drone/tools/dronetest.py”, line 34, in
tf.import_graph_def(od_graph_def, name=‘’)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py”, line 405, in import_graph_def
producer_op_list=producer_op_list)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py”, line 513, in _import_graph_def_internal
_ProcessNewOps(graph)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py”, line 243, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3459, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3459, in
for c_op in c_api_util.new_tf_operations(self)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 3347, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File “usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py”, line 1756, in init
self._traceback = tf_stack.extract_stack()

Much appreciated!

Hi,

Could you set the following configure to see if helps first?

config.gpu_options.allow_growth = True

Thanks.

Hi,
Thanks for your respond. The script above already includes this configuration and the result is still the same with it or not.

Hi,

We suspect this issue comes from CUDA since no cuDNN trace is found in the log.
Would you mind to run the cuDNN sample to see if this issue can be reproduced outside of TensorFlow?

$ cd /usr/src/cudnn_samples_v8/mnistCUDNN/
$ sudo make
$ ./mnistCUDNN 

Thanks.

Hi,
Testing cuDNN indeed raised error, this also did not happen with the devkit, also i tried reflashing it a couple of times already. This is the output running mnist as root after deleting cache at ~/.nv/:

cudnnGetVersion() : 7603 , CUDNN_VERSION from cudnn.h : 7603 (7.6.3)
Host compiler version : GCC 7.5.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 8 Capabilities 7.2, SmClock 1377.0 Mhz, MemSize (Mb) 31927, MemClock 1377.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
CUDNN failure
Error: CUDNN_STATUS_INTERNAL_ERROR
mnistCUDNN.cpp:394
Aborting…

Hi,

I manually reinstall cudnn library from the downloaded packaged on the host and every thing work fine now. With the devkit i dont remember if i have to do this manually. Oh well thanks for the help, thread can be closed.

Good to know this.
Thanks for the update.