Anyone knows how to solve ResourceExhaustedError on tx2?

(?, 40, 68, 192)
Tensor(“Softmax:0”, shape=(?, 2), dtype=float32)
Traceback (most recent call last):
File “./test.py”, line 167, in
sess.run(tf.global_variables_initializer(),options=run_options)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 900, in run
run_metadata_ptr)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1135, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1316, in _do_run
run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[522240,192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: truncated_normal/TruncatedNormal = TruncatedNormalT=DT_INT32, dtype=DT_FLOAT, seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
864.0KiB from random_normal_2/RandomStandardNormal
288.0KiB from random_normal_1/RandomStandardNormal
Remaining 1 nodes with 6.2KiB

Caused by op u’truncated_normal/TruncatedNormal’, defined at:
File “./test.py”, line 136, in
w4 = tf.Variable(tf.truncated_normal([fc1_unit, fc2_unit]))
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/random_ops.py”, line 174, in truncated_normal
shape_tensor, dtype, seed=seed1, seed2=seed2)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_random_ops.py”, line 850, in truncated_normal
name=name)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py”, line 787, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 3392, in create_op
op_def=op_def)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[522240,192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: truncated_normal/TruncatedNormal = TruncatedNormalT=DT_INT32, dtype=DT_FLOAT, seed=0, seed2=0, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
864.0KiB from random_normal_2/RandomStandardNormal
288.0KiB from random_normal_1/RandomStandardNormal
Remaining 1 nodes with 6.2KiB

After I ran test code for machinelearning on tx2, I 've got this messages.
-I have 6GB/28GB memeories left, and 3GB left.
-I tested only one image for machine learning code usign tensorflow.
Is there any way I can solve this problem?
On computer(ubuntu16.04 and (python version and tensorflow verison)same environment as jetson tx2), the code runs without any error.
I think there must be some solutions like

  1. similar to (nvidia-smi exclusive compute mode) on computer which is not supported on tx2, there must be a way to manipulate gpu.
    or
  2. I should erase some files, since I have installed many libraries and left some not used before I decided to use only python and tensorflow as framework. but I am not sure which files are okay to erase…
  3. If anyone has similar problem and knows some solutions above or any solutions except them, please help me thank you.

Hi,

First of all, ResourceExhaustedError indicates running out of memory.
To get more information, could you monitor device memory with tegrastats at the same time?

sudo ./tegrastats

Suggestions of your questions:
1. Could you share the type of your desktop GPU.
TX2 has 8G memory(although it shared by CPU/GPU) and is compatible to lots of GPUs(In gaming segment).

2. If you don’t load these libraries into memory, it won’t occupy the memory amount.

3. Could you check which batch-size do you use?
If you only inference ONE image per time, it’s recommended to set batch-size to 1 to save memory.

Thanks.

type of desktop GPU is 8GB GTX1080.
so I don’t know what the difference is.
I’m using 3G/8Gmemory on TX2
Below is the results of sudo ~/tegrastats.
RAM 3105/7846MB (lfb 12x4MB) CPU [0%@1230,off,off,0%@1218,0%@1224,0%@1229] EMC_FREQ 6%@665 GR3D_FREQ 11%@140 APE 150 BCPU@41.5C MCPU@41.5C GPU@40.5C PLL@41.5C Tboard@37C Tdiode@38.5C PMIC@100C thermal@41.1C VDD_IN 2111/2111 VDD_CPU 230/230 VDD_GPU 153/153 VDD_SOC 460/460 VDD_WIFI 0/0 VDD_DDR 422/422
RAM 3095/7846MB (lfb 12x4MB) CPU [24%@345,off,off,9%@345,9%@346,7%@345] EMC_FREQ 21%@204 GR3D_FREQ 0%@140 APE 150 BCPU@41.5C MCPU@41.5C GPU@40C PLL@41.5C Tboard@37C Tdiode@38.5C PMIC@100C thermal@40.9C VDD_IN 1996/2053 VDD_CPU 230/230 VDD_GPU 153/153 VDD_SOC 460/460 VDD_WIFI 0/0 VDD_DDR 364/393
RAM 3093/7846MB (lfb 12x4MB) CPU [21%@345,off,off,15%@345,13%@345,11%@345] EMC_FREQ 21%@204 GR3D_FREQ 0%@140 APE 150 BCPU@41.5C MCPU@41.5C GPU@40C PLL@41.5C Tboard@37C Tdiode@38.5C PMIC@100C thermal@40.9C VDD_IN 1995/2034 VDD_CPU 230/230 VDD_GPU 153/153 VDD_SOC 460/460 VDD_WIFI 0/0 VDD_DDR 364/383
RAM 3093/7846MB (lfb 12x4MB) CPU [16%@345,off,off,22%@345,28%@345,21%@345] EMC_FREQ 21%@204 GR3D_FREQ 3%@140 APE 150 BCPU@41.5C MCPU@41.5C GPU@40.5C PLL@41.5C Tboard@37C Tdiode@38.5C PMIC@100C thermal@40.9C VDD_IN 2034/2034 VDD_CPU 230/230 VDD_GPU 153/153 VDD_SOC 460/460 VDD_WIFI 0/0 VDD_DDR 384/383
RAM 3093/7846MB (lfb 12x4MB) CPU [27%@345,off,off,20%@345,21%@345,20%@345] EMC_FREQ 21%@204 GR3D_FREQ 1%@140 APE 150 BCPU@41.5C MCPU@41.5C GPU@40C PLL@41.5C Tboard@37C Tdiode@38.5C PMIC@100C thermal@40.9C VDD_IN 2072/2041 VDD_CPU 230/230 VDD_GPU 153/153 VDD_SOC 460/460 VDD_WIFI 0/0 VDD_DDR 403/387
RAM 3093/7846MB (lfb 12x4MB) CPU [18%@345,off,off,8%@345,10%@345,12%@345] EMC_FREQ 21%@204 GR3D_FREQ 0%@140 APE 150 BCPU@41.5C MCPU@41.5C GPU@40C PLL@41.5C Tboard@37C Tdiode@38.5C PMIC@100C thermal@40.9C VDD_IN 1995/2033 VDD_CPU 230/230 VDD_GPU 153/153 VDD_SOC 460/460 VDD_WIFI 0/0 VDD_DDR 345/380
RAM 3093/7846MB (lfb 12x4MB) CPU [19%@345,off,off,9%@346,6%@345,13%@345] EMC_FREQ 21%@204 GR3D_FREQ 0%@140 APE 150 BCPU@41.5C MCPU@41.5C GPU@40C PLL@41.5C Tboard@37C Tdiode@38.5C PMIC@100C thermal@40.9C VDD_IN 1958/2023 VDD_CPU 230/230 VDD_GPU 153/153 VDD_SOC 460/460 VDD_WIFI 0/0 VDD_DDR 345/375
RAM 3093/7846MB (lfb 12x4MB) CPU [10%@345,off,off,16%@345,12%@345,4%@345] EMC_FREQ 16%@204 GR3D_FREQ 0%@140 APE 150 BCPU@41.5C MCPU@41.5C GPU@40C PLL@41.5C Tboard@37C Tdiode@38.5C PMIC@100C thermal@40.9C VDD_IN 1958/2014 VDD_CPU 230/230 VDD_GPU 153/153 VDD_SOC 460/460 VDD_WIFI 0/0 VDD_DDR 345/371

Hi,

Could you share the log of TensorFlow session creation with us?
For example:

name: NVIDIA Tegra X2
major: 6 minor: 2 memoryClockRate (GHz) 1.3005
pciBusID 0000:00:00.0
Total memory: 7.67GiB
Free memory: 5.30GiB

2017-07-26 17:21:02.457343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-07-26 17:21:02.457374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y

More, could you share where your tensorflow package from.
Do you build it from source or download from public website?

Thanks.