Tensorflow Memory Error

JArchy · August 9, 2017, 11:04am

Hello, seeming to have an error when running Tensorflow based models. From general google searches it seems that this is a GPU memory issue, however none of the fixes for other architectures worked.

I am running the TX2 with the latest Jetpack (3.1):
define CUDNN_MAJOR 6
define CUDNN_MINOR 0
define CUDNN_PATCHLEVEL 21
Cuda compilation tools, release 8.0, V8.0.72
bazel-0.5.1

import tensorflow as tf
tf.version
‘1.2.1’

and this is the error:

2017-08-09 11:02:04.230621: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:879] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2017-08-09 11:02:04.230736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 3.24GiB
2017-08-09 11:02:04.230790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-08-09 11:02:04.230815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-08-09 11:02:04.230856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)
2017-08-09 11:02:04.230891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:642] Could not identify NUMA node of /job:localhost/replica:0/task:0/gpu:0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2017-08-09 11:02:05.303591: E tensorflow/stream_executor/cuda/cuda_dnn.cc:359] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-08-09 11:02:05.303663: E tensorflow/stream_executor/cuda/cuda_dnn.cc:326] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-08-09 11:02:05.303694: F tensorflow/core/kernels/conv_ops.cc:671] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 
Aborted (core dumped)

I have tried previous version of Jetpack as well as tensorflow with the same error, and have tried completely separate Tensorflow models.

AastaLLL · August 10, 2017, 1:50am

Hi,

Please try to enable following options:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

Thanks.

JArchy · August 10, 2017, 8:12am

Nope, no dice unfortunately, with much the same error.

I also previously tried:

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
...

With various differing values of both high and low and received the same error.

AastaLLL · August 11, 2017, 2:41am

Hi,

Two things want to confirm first:

1. Do you follow the steps below to build your TX2?
https://syed-ahmed.gitbooks.io/nvidia-jetson-tx2-recipes/content/first-question.html

2. Does this error also occur with MNIST?

We want to confirm this is a resource issue(memory…) or frameworks(build architecture…) issue first.
Thanks.

JArchy · August 11, 2017, 9:04am

The first build i followed that but using the JetsonHacks scripts based on github (basically those instructions), however for the tf 1.2 i followed Andrey1984’s instructions on this post [url]https://devtalk.nvidia.com/default/topic/1016294/tensorflow-1-2-0-gpu-on-tx2/[/url]

and the build prompts on this [url]https://gist.github.com/csarron/a265280010faeecae3e8c204c5749a67[/url]

No i havent seen this error on minst, or other run examples. (Simple tests such as at the bottom of the link you posted run correctly with no errors.)

AastaLLL · August 14, 2017, 2:32am

Hi,

NUMA is for multi-GPU.
It looks like your model wants to enable multi-GPU options.
But NUMA is turn off (by default) when building, and TX2 only have one GPU.

Could you try to disable the related option in your source and check it again?

JArchy · August 14, 2017, 12:46pm

Rebuilt tensorflow following the instructions you provided based on disabling the NUMA node. It didnt work, even tried with original and the additional config. options.

python detect.py h786poj.jpg weights.npz out3.jpg
[[ 0.40392157  0.44313725  0.4627451  ...,  0.17254902  0.17647059
   0.18431373]
 [ 0.42352941  0.45882353  0.47058824 ...,  0.17647059  0.18431373
   0.19215686]
 [ 0.44705882  0.47058824  0.47843137 ...,  0.18431373  0.19215686
   0.20392157]
 ..., 
 [ 0.58431373  0.58431373  0.59215686 ...,  0.54117647  0.5254902
   0.50588235]
 [ 0.58823529  0.57647059  0.56470588 ...,  0.52941176  0.51764706
   0.49411765]
 [ 0.59215686  0.56862745  0.54117647 ...,  0.5254902   0.51372549
   0.49411765]]
2017-08-14 12:41:15.428733: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:856] ARM has no NUMA node, hardcoding to return zero
2017-08-14 12:41:15.428852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 4.59GiB
2017-08-14 12:41:15.428947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-08-14 12:41:15.428998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-08-14 12:41:15.429024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)
2017-08-14 12:41:15.554341: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-08-14 12:41:15.554406: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices
2017-08-14 12:41:15.555134: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0x30c2d40 executing computations on platform Host. Devices:
2017-08-14 12:41:15.555184: I tensorflow/compiler/xla/service/service.cc:206]   StreamExecutor device (0): <undefined>, <undefined>
2017-08-14 12:41:15.555849: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
2017-08-14 12:41:15.555894: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices
2017-08-14 12:41:15.556610: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0x3112ff0 executing computations on platform CUDA. Devices:
2017-08-14 12:41:15.556652: I tensorflow/compiler/xla/service/service.cc:206]   StreamExecutor device (0): NVIDIA Tegra X2, Compute Capability 6.2
2017-08-14 12:41:18.518643: E tensorflow/stream_executor/cuda/cuda_dnn.cc:359] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-08-14 12:41:18.518718: E tensorflow/stream_executor/cuda/cuda_dnn.cc:326] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-08-14 12:41:18.518749: F tensorflow/core/kernels/conv_ops.cc:671] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 
Aborted (core dumped)

AastaLLL · August 15, 2017, 1:56am

Hi,

Thanks for the testing. It looks like more complicated than a building issue.
Could you tell me how to reproduce this issue? Do you use a public GitHub code or could you share your source?

Thanks.

JArchy · August 15, 2017, 7:18am

Hey,

Yeah sure, its using this as standard

https://github.com/matthewearl/deep-anpr

. Happy to share weights if that is needed. The training works correctly, it is just running the detect.py where the issue is present.

AastaLLL · August 16, 2017, 1:45am

Hi,

Yes, please share weight for us.
Thanks.

AastaLLL · August 16, 2017, 1:47am

Hi,

If you have tensorflow x86 environment. Could you also help us give it a try?
Thanks.

JArchy · August 16, 2017, 7:40am

Here is the weights on google drive:

[url]https://drive.google.com/file/d/0B3TAQ6gwtBNmZVVDTnAySHB3eU0/view?usp=sharing[/url]

AastaLLL · August 17, 2017, 1:50am

Thanks.

We will try to reproduce this issue, and update more information to you later.

AastaLLL · August 21, 2017, 3:05am

Hi,

Good news!
I can run deep_anpr sample with this whl (JetPack3.1):

$ sudo pip install tensorflow-1.3.0rc0-cp27-cp27mu-linux_aarch64.whl
$ sudo reboot
$ cd [deep_anpr folder]
$ ./detect.py in.jpeg weights.npz out.jpg

Thanks.

JArchy · August 21, 2017, 10:11am

Hmmm hasnt appeared to work on this jetson, with a slightly different error in the last line:

2017-08-21 10:07:38.406485: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:879] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node
Your kernel may have been built without NUMA support.
2017-08-21 10:07:38.406598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 5.67GiB
2017-08-21 10:07:38.406654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-08-21 10:07:38.406683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-08-21 10:07:38.406708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)
2017-08-21 10:07:38.406740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:657] Could not identify NUMA node of /job:localhost/replica:0/task:0/gpu:0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2017-08-21 10:07:39.100041: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-08-21 10:07:39.100116: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-08-21 10:07:39.100146: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 
Aborted (core dumped)

Weirdly enough our other jetson is working following the previous identical install instructions and versions. I think out next step will be to reflash the non-working jetson and attempt to reinstall from scratch to ensure nothing was done differently.

AastaLLL · August 22, 2017, 1:31am

Hi,

The link in comment #14 is built with cuDNNv6. Please flash device with JetPack3.1.
Thanks.

AastaLLL · August 22, 2017, 1:32am

Hi,

The link in comment #14 is built with cuDNNv6. Please flash device with JetPack3.1.
Thanks.

JArchy · August 23, 2017, 2:37pm

Reflashed the jetson and installed tensorflow with that .whl and it worked!

Thanks for the help.

Arlen0615 · October 17, 2017, 5:49am

Hi AastaLLL,

I have the same issue with https://github.com/igul222/improved_wgan_training
when you try gan_mnist.py

python gan_mnist.py

I got this…

nvidia@tegra-ubuntu:~/improved_wgan_training$ python gan_mnist.py 
Uppercase local vars:
	BATCH_SIZE: 50
	CRITIC_ITERS: 5
	DIM: 64
	ITERS: 200000
	LAMBDA: 10
	MODE: wgan-gp
	OUTPUT_DIM: 784
2017-10-17 05:31:07.423028: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero
2017-10-17 05:31:07.423174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 2.72GiB
2017-10-17 05:31:07.423231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-10-17 05:31:07.423259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-10-17 05:31:07.423286: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0)
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py:175: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
2017-10-17 05:31:12.753831: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-10-17 05:31:12.753904: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-10-17 05:31:12.753941: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 
Aborted (core dumped)

and I try #14, but it not working because I already install with same version

nvidia@tegra-ubuntu:~/tensorflow-tx2$ sudo pip install tensorflow-1.3.0-cp27-cp27mu-linux_aarch64.whl
The directory '/home/nvidia/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/nvidia/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Requirement already satisfied: tensorflow==1.3.0 from file:///home/nvidia/tensorflow-tx2/tensorflow-1.3.0-cp27-cp27mu-linux_aarch64.whl in /usr/local/lib/python2.7/dist-packages
Requirement already satisfied: protobuf>=3.3.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow==1.3.0)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow==1.3.0)
Requirement already satisfied: tensorflow-tensorboard<0.2.0,>=0.1.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow==1.3.0)
Requirement already satisfied: wheel in /usr/lib/python2.7/dist-packages (from tensorflow==1.3.0)
Requirement already satisfied: backports.weakref>=1.0rc1 in /usr/local/lib/python2.7/dist-packages (from tensorflow==1.3.0)
Requirement already satisfied: numpy>=1.11.0 in /usr/lib/python2.7/dist-packages (from tensorflow==1.3.0)
Requirement already satisfied: mock>=2.0.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow==1.3.0)
Requirement already satisfied: setuptools in /usr/local/lib/python2.7/dist-packages (from protobuf>=3.3.0->tensorflow==1.3.0)
Requirement already satisfied: werkzeug>=0.11.10 in /usr/local/lib/python2.7/dist-packages (from tensorflow-tensorboard<0.2.0,>=0.1.0->tensorflow==1.3.0)
Requirement already satisfied: html5lib==0.9999999 in /usr/local/lib/python2.7/dist-packages (from tensorflow-tensorboard<0.2.0,>=0.1.0->tensorflow==1.3.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python2.7/dist-packages (from tensorflow-tensorboard<0.2.0,>=0.1.0->tensorflow==1.3.0)
Requirement already satisfied: bleach==1.5.0 in /usr/local/lib/python2.7/dist-packages (from tensorflow-tensorboard<0.2.0,>=0.1.0->tensorflow==1.3.0)
Requirement already satisfied: pbr>=0.11 in /usr/local/lib/python2.7/dist-packages (from mock>=2.0.0->tensorflow==1.3.0)
Requirement already satisfied: funcsigs>=1; python_version < "3.3" in /usr/local/lib/python2.7/dist-packages (from mock>=2.0.0->tensorflow==1.3.0)

AastaLLL · October 18, 2017, 2:58am

Hi,

Could you run this command and share the result with us?

ll /home/nvidia/.cache/pip/

Please remember to flash TX2 with JetPack3.1.
This wheel file is built with the JetPack3.1 package.

Thanks.

Topic		Replies	Views
run tensorflow 1.3 on tx2 stuck Jetson TX2	20	5632	October 18, 2021
CUDA_ERROR_LAUNCH_FAILED error when running TensorFlow mnist example Jetson TX2	4	2904	December 7, 2017
TensorFlow Issue - 'NonMaxSuppressionV3' in binary Jetson TX2	16	3239	October 18, 2021
Tensorflow error in NVIDIA TX1 Jetson TX1	7	1917	December 30, 2017
failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED Jetson TX2	10	1276	March 1, 2018
trouble with Tensorflow and TX2. Jetson TX2	1	1917	March 1, 2018
fail to run tensorflow1.5 in tx2 Jetson TX2	3	704	February 12, 2018
Could not allocate memory: Tensorflow 1.5 on python 3 for Jetson TX2 Jetson TX2	4	1953	October 18, 2021
TensorFlow on Jetson TX2 Jetson TX2	47	19520	September 18, 2017
TensorFlow 1.5 on TX2 Errors Jetson TX2	6	2711	October 18, 2021

Tensorflow Memory Error

Related topics