General Question about Jetsons GPU/CPU Shared Memory Usage

gustavvz · May 9, 2018, 7:36am

Hi All,

There is something i don’t understand about the Jetsons Memory Usage:

The Tx2 has 8GB shared GPU/CPU Memory but how is this Value (dynamicly) divided / adressed?

For example, a Tensorflow Model that takes around ~2.5GB runs on a 4GB GPU, but on the Jetson it throws memory errors although tensorflow states that there are ~5GB free Memory available when trying to start inference.

What is the reason for this behavior? Is it possible to somehow reserve more of the 8GB for the GPU and downgrade the CPU usage or something like this?

Is there a maximum Model size the Jetson can handle? Something like 500MB? 1GB? 2GB?

Or are there additional dependencies that i am not aware of?

I would be really thankful if someone with a deeper knowledge about Jetsons Memory Usage could shine a light on this.

Thanks!
Gustav

AastaLLL · May 9, 2018, 8:02am

Hi,

We don’t limit Jetson memory to any particular process.
CPU/GPU can allocate it if the resource is available.

The reason why TensorFlow goes crash is that TF try to allocate a huge amount memory at once.
https://devtalk.nvidia.com/default/topic/1029742/jetson-tx2/tensorflow-1-6-not-working-with-jetpack-3-2/post/5242249/#5242249

This will lead to error in a shared memory system. Here is a workaround for your reference.

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)

Thanks.

gustavvz · May 22, 2018, 7:08am

hi @Aastall,

I always config my tf session with alow memory growth set to true, this does not solve the issue.

Are there other reasons / solutions you could think of?

AastaLLL · May 24, 2018, 9:38am

Hi,

Could you share the error log with us?

gustavvz · May 28, 2018, 7:38am

Hi AastaLLL,

I trained a Mask RCNN Model with Mobilenet V1 as Backbone. I am able to run it without GPU with CUDA_VISIBLE_DEVICES set to ‘-1’ on the Jetson Tx2.

my session config looks like this:

config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth=True

But on GPU it crashes with the following ERROR log:

2018-05-28 09:31:39.483435: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2018-05-28 09:31:39.643266: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2018-05-28 09:31:39.807578: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2018-05-28 09:31:39.847436: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2018-05-28 09:31:40.529444: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.91GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2018-05-28 09:31:42.243801: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.32GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

> FPS: 0.0

2018-05-28 09:31:44.285335: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at where_op.cc:331 : Internal: WhereOp: Could not launch cub::DeviceSelect::Flagged to copy indices out, status: too many resources requested for launch

Traceback (most recent call last):

  File "run_objectdetection.py", line 204, in <module>

    detection(model)

  File "run_objectdetection.py", line 140, in detection

    output_dict = sess.run(tensor_dict, feed_dict={image_tensor: vs.expanded()})

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run

    run_metadata_ptr)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run

    feed_dict_tensor, options, run_metadata)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run

    run_metadata)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call

    raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceSelect::Flagged to copy indices out, status: too many resources requested for launch

     [[Node: ClipToWindow/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ClipToWindow/Greater)]]

     [[Node: BatchMultiClassNonMaxSuppression_1/map/while/Identity/_159 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1167_BatchMultiClassNonMaxSuppression_1/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopBatchMultiClassNonMaxSuppression_1/map/while/TensorArrayReadV3_4/_31)]]

Caused by op u'ClipToWindow/Where', defined at:

  File "run_objectdetection.py", line 203, in <module>

    SPLIT_MODEL, SSD_SHAPE).prepare_od_model()

  File "/home/nvidia/realtime_object_detection/stuff/helper.py", line 177, in prepare_od_model

    self.load_frozenmodel()

  File "/home/nvidia/realtime_object_detection/stuff/helper.py", line 157, in load_frozenmodel

    tf.import_graph_def(od_graph_def, name='')

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 432, in new_func

    return func(*args, **kwargs)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 513, in import_graph_def

    _ProcessNewOps(graph)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 303, in _ProcessNewOps

    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3540, in _add_new_tf_operations

    for c_op in c_api_util.new_tf_operations(self)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3428, in _create_op_from_tf_operation

    ret = Operation(c_op, self)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1718, in __init__

    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceSelect::Flagged to copy indices out, status: too many resources requested for launch

     [[Node: ClipToWindow/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ClipToWindow/Greater)]]

     [[Node: BatchMultiClassNonMaxSuppression_1/map/while/Identity/_159 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1167_BatchMultiClassNonMaxSuppression_1/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopBatchMultiClassNonMaxSuppression_1/map/while/TensorArrayReadV3_4/_31)]]

This i don’t understand as the memory is shared between GPU and CPU. Has CPU the possibility to allocate more memory than GPU? is it possible to adjust /change this behavior?

AastaLLL · May 30, 2018, 8:02am

Hi,

Not sure which allocation function call is used by TensorFlow.

For malloc and cudaMalloc, it can allocate almost all the available physical memory (~7.x G).
No extra limitation on Jetson.

Thanks.

albertr · May 31, 2018, 7:13pm

Gustav,

Are you getting “arm-smmu 12000000.iommu: Unhandled context fault: iova” errors thrown by the kernel when memory allocation fails? We cannot allocate more than approx.4GB of GPU memory for cuda on TX2 and I think there might be kernel bug(s) preventing 64bit address translation or mapping working correctly in nvidia L4T 28.1 kernel. Can you try to run MemtestG80 with requesting more than 4GB on your TX2?

Please see here for more details: [url]https://devtalk.nvidia.com/default/topic/1002486/jetson-tx2/iommu-unhandled-context-fault-on-pci-device-dma/2[/url]

If someone from Nvidia can take a look at this problem, it would be much appreciated!
It can be easily reproduced by just running MemtestG80 on TX2.

-albertr

AastaLLL · June 4, 2018, 8:22am

Hi, albertr

Do you meet your error with TensorFlow?
If not, could you file a new topic to specify your issue?

Thanks.

gustavvz · June 4, 2018, 11:16am

@AastaLLL can you think of any reason why a model would load with cuda_visible_devices set to ‘-1’ but with normal device visibility would crash?

My hardware and kernel knowledge is way to little to understand this. Would be so great if you could help solving this!

albertr · June 4, 2018, 1:16pm

Hi AastaLLL,

I’ve mentioned it only because you said the following:

“For malloc and cudaMalloc, it can allocate almost all the available physical memory (~7.x G).”

In my testing cudaMalloc cannot allocate more than 4GB on TX2, and I was wondering if Gustav might have ran into the same error. That can be checked by looking into kernel log messages for appearance of “arm-smmu 12000000.iommu: Unhandled context fault: iova” errors. The steps to reproduced this were listed in the following post: [url]https://devtalk.nvidia.com/default/topic/1002486/jetson-tx2/iommu-unhandled-context-fault-on-pci-device-dma/post/5263542/#5263542[/url]

-albertr

AastaLLL · June 8, 2018, 6:01am

Hi, albertr

We do have some limitation on memory allocation before.
However, the limitation is removed after rel-28 which included in JetPack3.1.

You can find more detail in this topic:
[url]https://devtalk.nvidia.com/default/topic/1013464/jetson-tx2/gpu-out-of-memory-when-the-total-ram-usage-is-2-8g/[/url]

Thanks.

AastaLLL · June 8, 2018, 6:08am

Hi, gustavvz

Set cuda_visible_devices=-1 means no available GPU and the model will be created by CPU instead.

From this information, this error occurs when model is generated with GPU.
Not sure if any swap memory in your environment.
Please remember that swap space can only be used with CPU.

Thanks.

gustavvz · June 11, 2018, 1:36pm

Hi @AastaLLL no i dont have any Swap in my environment.
The CPU and GPU both have only the 8GB shared memory.

Are you sure there is no limitation for gpu memory allocation, in any form?

Can it be due to the CUDA Version used?

For example the tensorflow/models/research/deeplab Models only run on Jetson with CUDA8. but fails with CUDA9.
Any Ideas on that?

Thank you!

albertr · June 11, 2018, 2:07pm

AastaLLL, thanks! The forum thread you pointed me to is really helpful. We ran your test code posted on that thread and confirmed that it can allocate 7.7GB non-continuous, but only around 4GB continuous. I thought you have mentioned that this limitation was already removed in L4T 28.1 kernel / JetPack 3.1 release? Or the 4GB limit still present in JetPack 3.1? Is it a requirement to have swapping enabled to get past 4GB limit? Can you please clarify?

-albertr

gustavvz · June 12, 2018, 8:11am

I am not able to load a model that is smaller than 2GB in total into the gpu memory.
How is that possible if 4GB continuous should be doable?

linuxdev · June 12, 2018, 8:24am

This may not all apply (it is under the TK1), but should be of interest (this can probably be adapted for your case):
[url]https://devtalk.nvidia.com/default/topic/770634/jetson-tk1/large-coherent-dma-blocks/[/url]

What it comes down to is that some physical devices (in this case a GPU) need contiguous physical memory. Addresses translated by a memory manager won’t do the trick for some cases. If you were to allocate a large chunk of memory before the system is up and running, then you can probably reserve a larger amount for your GPU (I have not set this up on the TX2, I couldn’t tell you what applies or not for your case).

If you have swap, then other processes can use swap and give up on using as much physical RAM. It won’t matter though if memory has been fragmented. Allocating on the kernel command line (the “APPEND” key/value pair in “/boot/extlinux/extlinux.conf”) can take advantage of unfragmented physical memory before any other programs run.

AastaLLL · June 14, 2018, 8:12am

Hi,

Sorry that we are not familar with TensorFlow detail implementation.
But do you use the wheel compiled right on CUDA9.0?

Thanks.

AastaLLL · June 14, 2018, 8:17am

Hi,

If you have set the allow_growth configuration, it should not allocate a big chunk and won’t hit the error.

Thanks.

albertr · June 14, 2018, 5:21pm

AastaLLL, can you clarify the 4GB limitation?

AastaLLL · June 19, 2018, 8:29am

Hi, albertr

We are checking this issue internally.
Will update information with you later.

Thanks.

Topic		Replies	Views
GPU out of memory when the total ram usage is 2.8G Jetson TX2	28	18521	October 18, 2021
Tensorflow 1.6 not working with Jetpack 3.2 Jetson TX2	25	7073	October 18, 2021
Memory error OpenCV, Gstreamer, and Tensorflow memory error? Jetson TX2	15	2868	October 18, 2021
TensorFlow Issue - 'NonMaxSuppressionV3' in binary Jetson TX2	16	3147	October 18, 2021
TensorFlow 1.11.0 wheel with JetPack 3.3 Jetson TX2	103	45358	November 13, 2019
Faster R-CNN: too many resources requested for launch Jetson TX2	27	7150	September 14, 2018
SSD: functioned well on CPU but failed on GPU Jetson TX2	7	852	October 18, 2021
JetPack 3.2 — L4T R28.2 Production Release for Jetson TX1/TX2 Jetson TX2	45	10148	July 22, 2018
Tensorflow fails to create a session and issue with docker Jetson TX2	10	2772	July 6, 2018
failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED Jetson TX2	10	1256	March 1, 2018

General Question about Jetsons GPU/CPU Shared Memory Usage

Related topics