SSD: functioned well on CPU but failed on GPU

995970435 · May 17, 2019, 10:54am

Using Jetson tx2 and Jetpack 4.3, python 3.5.2, tensorflow 1.9.0

Hello,
when I run SSD inference on a 300*300 picture on CPU with configure:
config = tf.ConfigProto(device_count = {“GPU”:0})
It will consume 2.89secs and give me the result

But when I do the inference on GPU, an error will occur:

2019-05-17 10:46:40.312189: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.100993: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.158767: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.37GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.343459: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.429324: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.06GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.508449: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.650280: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at where_op.cc:286 : Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Reshape_1)]]
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/cond/strided_slice/_209 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1016_...ided_slice", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoded_predictions/loop_over_batch/while/loop_over_classes/while/TensorArrayReadV3/_158)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_SSD.py", line 41, in <module>
    y_pred = sess.run([ssd_output], feed_dict = {ssd_input: image_resized})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Reshape_1)]]
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/cond/strided_slice/_209 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1016_...ided_slice", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoded_predictions/loop_over_batch/while/loop_over_classes/while/TensorArrayReadV3/_158)]]

Caused by op 'decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Where', defined at:
  File "test_SSD.py", line 9, in <module>
    saver = tf.train.import_meta_graph('./ssd_ckpt/ssd300.ckpt.meta')
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1960, in import_meta_graph
    **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/meta_graph.py", line 744, in import_scoped_meta_graph
    producer_op_list=producer_op_list)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
    _ProcessNewOps(graph)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3563, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3563, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3450, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Reshape_1)]]
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/cond/strided_slice/_209 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1016_...ided_slice", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoded_predictions/loop_over_batch/while/loop_over_classes/while/TensorArrayReadV3/_158)]]

Why is this happen?

AastaLLL · May 20, 2019, 1:22am

Hi,

Could check your JetPack version again? We don’t release JetPack4.3 yet.

A possible issue is that your TensorFlow doesn’t build with GPU support.
It’s recommended to reflash your system with JetPack4.2 and install our official TensorFlow package:
[url]https://devtalk.nvidia.com/default/topic/1038957/jetson-tx2/tensorflow-for-jetson-tx2-/[/url]

Thanks.

995970435 · May 21, 2019, 6:30am

Sorry I made a mistake, the version is JetPack 3.3

I think my TensorFlow has GPU support, because I installed it with official package.
The pip list shows this:
Package Version
tensorflow-gpu 1.9.0+nv18.8

Will it helpful if I reflash my system with a new version JetPack？ I just afraid it would introduce other issues.

Thanks

And my system can run the official tf_trt_models well, so I think the GPU support is ok

995970435 · May 22, 2019, 10:12am

Hello

Hello, are you still there?

AastaLLL · June 3, 2019, 5:20am

Hi,

Sorry for the late update.
I have double checked your issue today. It looks like you are running out of memory:

2019-05-17 10:46:41.650280: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at where_op.cc:286 : Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch

So a possible fix should be adding some swap or use a less complicated model.

Thanks.

995970435 · June 11, 2019, 11:16am

Thank you very much!

I have a another problem confuse me.

I notice that when every model works on GPU mode, they will consume many internal memory in jetson.
Like this SSD model, it works well on CPU but makes an out of memory error on GPU.

Is this a common thing that GPU mode consumes more internal memory than CPU mode?
Or it’s just me make some mistakes when I load my model?

Thanks.

AastaLLL · June 18, 2019, 9:09am

Hi,

It is known that TensorFlow will duplicate the model in GPU mode.
So, in general, it takes twice or more memory than CPU.

It’s recommended to use TensorRT instead.
Thanks.

Topic		Replies	Views
TensorFlow Issue - 'NonMaxSuppressionV3' in binary Jetson TX2	16	3150	October 18, 2021
Tensorflow crash when making an inference on Jetson Nano Jetson Nano jetpack , cuda , tensorflow	2	777	October 18, 2021
General Question about Jetsons GPU/CPU Shared Memory Usage Jetson TX2	35	7477	October 18, 2021
TensorFlow object detection and image classification accelerated for NVIDIA Jetson Jetson TX2	25	10510	June 3, 2019
GPU out of memory when the total ram usage is 2.8G Jetson TX2	28	18548	October 18, 2021
tensorflow.python.framework.errors_impl.ResourceExhaustedError Jetson Nano tensorflow	8	4815	October 18, 2021
Reduce TensorFlow GPU usage Jetson TX2	10	1277	October 18, 2021
Tensorflow fails to create a session and issue with docker Jetson TX2	10	2776	July 6, 2018
TensorFlow 1.11.0 wheel with JetPack 3.3 Jetson TX2	103	45382	November 13, 2019
No GPU availability through tensorflow Jetson AGX Orin cuda , tensorflow	18	3065	December 21, 2022

SSD: functioned well on CPU but failed on GPU

Related topics