SSD: functioned well on CPU but failed on GPU

Using Jetson tx2 and Jetpack 4.3, python 3.5.2, tensorflow 1.9.0

Hello,
when I run SSD inference on a 300*300 picture on CPU with configure:
config = tf.ConfigProto(device_count = {“GPU”:0})
It will consume 2.89secs and give me the result

But when I do the inference on GPU, an error will occur:

2019-05-17 10:46:40.312189: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.100993: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.158767: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.37GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.343459: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.429324: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.06GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.508449: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-05-17 10:46:41.650280: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at where_op.cc:286 : Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Reshape_1)]]
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/cond/strided_slice/_209 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1016_...ided_slice", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoded_predictions/loop_over_batch/while/loop_over_classes/while/TensorArrayReadV3/_158)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_SSD.py", line 41, in <module>
    y_pred = sess.run([ssd_output], feed_dict = {ssd_input: image_resized})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Reshape_1)]]
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/cond/strided_slice/_209 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1016_...ided_slice", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoded_predictions/loop_over_batch/while/loop_over_classes/while/TensorArrayReadV3/_158)]]

Caused by op 'decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Where', defined at:
  File "test_SSD.py", line 9, in <module>
    saver = tf.train.import_meta_graph('./ssd_ckpt/ssd300.ckpt.meta')
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1960, in import_meta_graph
    **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/meta_graph.py", line 744, in import_scoped_meta_graph
    producer_op_list=producer_op_list)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
    _ProcessNewOps(graph)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3563, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3563, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3450, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Reshape_1)]]
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/cond/strided_slice/_209 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1016_...ided_slice", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoded_predictions/loop_over_batch/while/loop_over_classes/while/TensorArrayReadV3/_158)]]

Why is this happen?

Hi,

Could check your JetPack version again? We don’t release JetPack4.3 yet.

A possible issue is that your TensorFlow doesn’t build with GPU support.
It’s recommended to reflash your system with JetPack4.2 and install our official TensorFlow package:
[url]https://devtalk.nvidia.com/default/topic/1038957/jetson-tx2/tensorflow-for-jetson-tx2-/[/url]

Thanks.

Sorry I made a mistake, the version is JetPack 3.3

I think my TensorFlow has GPU support, because I installed it with official package.
The pip list shows this:
Package Version
tensorflow-gpu 1.9.0+nv18.8

Will it helpful if I reflash my system with a new version JetPack? I just afraid it would introduce other issues.

Thanks

And my system can run the official tf_trt_models well, so I think the GPU support is ok

Hello

Hello, are you still there?

Hi,

Sorry for the late update.
I have double checked your issue today. It looks like you are running out of memory:

2019-05-17 10:46:41.650280: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at where_op.cc:286 : Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch

So a possible fix should be adding some swap or use a less complicated model.

Thanks.

Thank you very much!

I have a another problem confuse me.

I notice that when every model works on GPU mode, they will consume many internal memory in jetson.
Like this SSD model, it works well on CPU but makes an out of memory error on GPU.

Is this a common thing that GPU mode consumes more internal memory than CPU mode?
Or it’s just me make some mistakes when I load my model?

Thanks.

Hi,

It is known that TensorFlow will duplicate the model in GPU mode.
So, in general, it takes twice or more memory than CPU.

It’s recommended to use TensorRT instead.
Thanks.