General Question about Jetsons GPU/CPU Shared Memory Usage

Hi All,

There is something i don’t understand about the Jetsons Memory Usage:

The Tx2 has 8GB shared GPU/CPU Memory but how is this Value (dynamicly) divided / adressed?

For example, a Tensorflow Model that takes around ~2.5GB runs on a 4GB GPU, but on the Jetson it throws memory errors although tensorflow states that there are ~5GB free Memory available when trying to start inference.

What is the reason for this behavior? Is it possible to somehow reserve more of the 8GB for the GPU and downgrade the CPU usage or something like this?

Is there a maximum Model size the Jetson can handle? Something like 500MB? 1GB? 2GB?

Or are there additional dependencies that i am not aware of?

I would be really thankful if someone with a deeper knowledge about Jetsons Memory Usage could shine a light on this.

Thanks!
Gustav

Hi,

We don’t limit Jetson memory to any particular process.
CPU/GPU can allocate it if the resource is available.

The reason why TensorFlow goes crash is that TF try to allocate a huge amount memory at once.
https://devtalk.nvidia.com/default/topic/1029742/jetson-tx2/tensorflow-1-6-not-working-with-jetpack-3-2/post/5242249/#5242249

This will lead to error in a shared memory system. Here is a workaround for your reference.

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)

Thanks.

hi @Aastall,

I always config my tf session with alow memory growth set to true, this does not solve the issue.

Are there other reasons / solutions you could think of?

Hi,

Could you share the error log with us?

Hi AastaLLL,

I trained a Mask RCNN Model with Mobilenet V1 as Backbone. I am able to run it without GPU with CUDA_VISIBLE_DEVICES set to ‘-1’ on the Jetson Tx2.

my session config looks like this:

config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth=True

But on GPU it crashes with the following ERROR log:

2018-05-28 09:31:39.483435: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2018-05-28 09:31:39.643266: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2018-05-28 09:31:39.807578: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2018-05-28 09:31:39.847436: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2018-05-28 09:31:40.529444: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.91GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

2018-05-28 09:31:42.243801: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.32GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

> FPS: 0.0

2018-05-28 09:31:44.285335: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at where_op.cc:331 : Internal: WhereOp: Could not launch cub::DeviceSelect::Flagged to copy indices out, status: too many resources requested for launch

Traceback (most recent call last):

  File "run_objectdetection.py", line 204, in <module>

    detection(model)

  File "run_objectdetection.py", line 140, in detection

    output_dict = sess.run(tensor_dict, feed_dict={image_tensor: vs.expanded()})

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run

    run_metadata_ptr)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run

    feed_dict_tensor, options, run_metadata)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run

    run_metadata)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call

    raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceSelect::Flagged to copy indices out, status: too many resources requested for launch

     [[Node: ClipToWindow/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ClipToWindow/Greater)]]

     [[Node: BatchMultiClassNonMaxSuppression_1/map/while/Identity/_159 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1167_BatchMultiClassNonMaxSuppression_1/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopBatchMultiClassNonMaxSuppression_1/map/while/TensorArrayReadV3_4/_31)]]

Caused by op u'ClipToWindow/Where', defined at:

  File "run_objectdetection.py", line 203, in <module>

    SPLIT_MODEL, SSD_SHAPE).prepare_od_model()

  File "/home/nvidia/realtime_object_detection/stuff/helper.py", line 177, in prepare_od_model

    self.load_frozenmodel()

  File "/home/nvidia/realtime_object_detection/stuff/helper.py", line 157, in load_frozenmodel

    tf.import_graph_def(od_graph_def, name='')

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 432, in new_func

    return func(*args, **kwargs)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 513, in import_graph_def

    _ProcessNewOps(graph)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 303, in _ProcessNewOps

    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3540, in _add_new_tf_operations

    for c_op in c_api_util.new_tf_operations(self)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3428, in _create_op_from_tf_operation

    ret = Operation(c_op, self)

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1718, in __init__

    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): WhereOp: Could not launch cub::DeviceSelect::Flagged to copy indices out, status: too many resources requested for launch

     [[Node: ClipToWindow/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ClipToWindow/Greater)]]

     [[Node: BatchMultiClassNonMaxSuppression_1/map/while/Identity/_159 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1167_BatchMultiClassNonMaxSuppression_1/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopBatchMultiClassNonMaxSuppression_1/map/while/TensorArrayReadV3_4/_31)]]

This i don’t understand as the memory is shared between GPU and CPU. Has CPU the possibility to allocate more memory than GPU? is it possible to adjust /change this behavior?

Hi,

Not sure which allocation function call is used by TensorFlow.

For malloc and cudaMalloc, it can allocate almost all the available physical memory (~7.x G).
No extra limitation on Jetson.

Thanks.

Gustav,

Are you getting “arm-smmu 12000000.iommu: Unhandled context fault: iova” errors thrown by the kernel when memory allocation fails? We cannot allocate more than approx.4GB of GPU memory for cuda on TX2 and I think there might be kernel bug(s) preventing 64bit address translation or mapping working correctly in nvidia L4T 28.1 kernel. Can you try to run MemtestG80 with requesting more than 4GB on your TX2?

Please see here for more details: https://devtalk.nvidia.com/default/topic/1002486/jetson-tx2/iommu-unhandled-context-fault-on-pci-device-dma/2

If someone from Nvidia can take a look at this problem, it would be much appreciated!
It can be easily reproduced by just running MemtestG80 on TX2.

-albertr

Hi, albertr

Do you meet your error with TensorFlow?
If not, could you file a new topic to specify your issue?

Thanks.

@AastaLLL can you think of any reason why a model would load with cuda_visible_devices set to ‘-1’ but with normal device visibility would crash?

My hardware and kernel knowledge is way to little to understand this. Would be so great if you could help solving this!

Hi AastaLLL,

I’ve mentioned it only because you said the following:

“For malloc and cudaMalloc, it can allocate almost all the available physical memory (~7.x G).”

In my testing cudaMalloc cannot allocate more than 4GB on TX2, and I was wondering if Gustav might have ran into the same error. That can be checked by looking into kernel log messages for appearance of “arm-smmu 12000000.iommu: Unhandled context fault: iova” errors. The steps to reproduced this were listed in the following post: https://devtalk.nvidia.com/default/topic/1002486/jetson-tx2/iommu-unhandled-context-fault-on-pci-device-dma/post/5263542/#5263542

-albertr

Hi, albertr

We do have some limitation on memory allocation before.
However, the limitation is removed after rel-28 which included in JetPack3.1.

You can find more detail in this topic:
https://devtalk.nvidia.com/default/topic/1013464/jetson-tx2/gpu-out-of-memory-when-the-total-ram-usage-is-2-8g/

Thanks.

Hi, gustavvz

Set cuda_visible_devices=-1 means no available GPU and the model will be created by CPU instead.

From this information, this error occurs when model is generated with GPU.
Not sure if any swap memory in your environment.
Please remember that swap space can only be used with CPU.

Thanks.

Hi @AastaLLL no i dont have any Swap in my environment.
The CPU and GPU both have only the 8GB shared memory.

Are you sure there is no limitation for gpu memory allocation, in any form?

Can it be due to the CUDA Version used?

For example the tensorflow/models/research/deeplab Models only run on Jetson with CUDA8. but fails with CUDA9.
Any Ideas on that?

Thank you!

AastaLLL, thanks! The forum thread you pointed me to is really helpful. We ran your test code posted on that thread and confirmed that it can allocate 7.7GB non-continuous, but only around 4GB continuous. I thought you have mentioned that this limitation was already removed in L4T 28.1 kernel / JetPack 3.1 release? Or the 4GB limit still present in JetPack 3.1? Is it a requirement to have swapping enabled to get past 4GB limit? Can you please clarify?

-albertr

I am not able to load a model that is smaller than 2GB in total into the gpu memory.
How is that possible if 4GB continuous should be doable?

This may not all apply (it is under the TK1), but should be of interest (this can probably be adapted for your case):
https://devtalk.nvidia.com/default/topic/770634/jetson-tk1/large-coherent-dma-blocks/

What it comes down to is that some physical devices (in this case a GPU) need contiguous physical memory. Addresses translated by a memory manager won’t do the trick for some cases. If you were to allocate a large chunk of memory before the system is up and running, then you can probably reserve a larger amount for your GPU (I have not set this up on the TX2, I couldn’t tell you what applies or not for your case).

If you have swap, then other processes can use swap and give up on using as much physical RAM. It won’t matter though if memory has been fragmented. Allocating on the kernel command line (the “APPEND” key/value pair in “/boot/extlinux/extlinux.conf”) can take advantage of unfragmented physical memory before any other programs run.

Hi,

Sorry that we are not familar with TensorFlow detail implementation.
But do you use the wheel compiled right on CUDA9.0?

Thanks.

Hi,

If you have set the allow_growth configuration, it should not allocate a big chunk and won’t hit the error.

Thanks.

AastaLLL, can you clarify the 4GB limitation?

Hi, albertr

We are checking this issue internally.
Will update information with you later.

Thanks.