Distributed Inception - CUDA allocation error

Hi guys,

I 'm trying to train from scratch in a distributed setting an Inception architecture.
Right now, my configuration use only one server (ps) on a jetson TX2 and one worker on another Jetson.

-The sources come from the Tensorflow github https://github.com/tensorflow/models/tree/master/research/inception
-The Jetsons use the latest jetpack and your latest official version of Tensorflow
-I use python 3.5.2

I encounter the following error on the worker device.

2018-09-27 15:51:53.972490: E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2018-09-27 15:51:53.978684: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304

For your record:
-Training runs fine on PCs.
-Training runs with no error on jetsons if I don’t define a cuda device (so only on CPUs).
-Training works on a single jetson with GPU (no distributed setting) but use a lot of memory and need to swap 800MB.

I tried the following solutions:
-I reduce everything to minimal (batch_size, size of the queue, etc.)
-I allow GPU memory growth

In inception_distributed_train.py:

sess_config = tf.ConfigProto(
          allow_soft_placement=True,
          log_device_placement=FLAGS.log_device_placement)

I put:

sess_config.gpu_options.allow_growth = True

Which do not seem to change anything (distributed or not)

The last thing I tested was to run the ps on a Jetson and the worker on the PC.
I get the following error on the ps (Jetson):

2018-09-28 07:23:54.782972: W tensorflow/core/common_runtime/bfc_allocator.cc:279] <allocator contains no memory>
2018-09-28 07:23:54.783154: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at assign_op.h:117 : Resource exhausted: OOM when allocating tensor with shape[768] and type float on /job:ps/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc
2018-09-28 07:23:54.783164: E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to alloc 4194304 bytes on host: CUDA_ERROR_UNKNOWN
2018-09-28 07:23:54.783405: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 4194304

But it works if I don’t use the GPU on the Jetson.

I would appreciate any solution for these errors.
Thanks in advance,

Regards,
Paul

PS: Sorry for my english, it’s not my native language

Hi,

Two things want you know first:
1. It’s NOT recommended to use Jetson for training since it is naturaly designed for inferencing.
2. We don’t have experience for distributed training on Jetson. It may have some issues.

Based on you log, the error comes from allocating big chunk of memory.
It should be avoided by setting gpu_options.allow_growth.
Please make sure the configuration is applied on all of devices. Both master and slaves.

Thanks.

Hi AastaLLL.

Thank you for your reply.

I am aware that distributed training is not a classic case of using Jetsons.
The reason that drives me to do it is that I am currently working on a research project.
In this project, we want to perform embedded learning on field.
Since the abilities of a single Jetson are limited, we seek to distribute this learning on a cluster of Jetsons.

Concerning the gpu_options.allow_growth:
Enabling this option does not seem to change anything (even in undistributed training).
This problem (the allow_growth not working) does not seem to be related to distribution.

Do you know why the option does not work?

Thanks again.

Hi Zarathoustra,

The allow_growth option will cause TensorFlow to allocate memory as needed, but it is still possible to run into memory issues.

In some scenarios we have found that using a different GPU memory allocator (rather than the default bfc allocator) will work. However, this may come with a performance tradeoff.

Could you try using a different allocator by calling the following in the shell you launch TensorFlow from?

export TF_GPU_ALLOCATOR=”cuda_malloc”

Thanks,
John

Hi John,

Thanks for the tip, I never thought about changing the allocator.
After several tests, I observe the same behavior whatever the chosen allocator.

I finally found my mistake, the configuration was badly applied :
Despite the fact that Tensorflow showed that the option allow-growth was enabled, assigning to each training session the option is not the right way to proceed.

For users interested in the solution, you must set the gpu options on the server instance.
It should look like that :

config = tf.ConfigProto()
config.gpu_options.allow_growth = True

server = tf.train.Server(
      ...,
      config=config)

Strangely this method is not used in the source code.

In any case, thank you for your time. I really appreciate the fact that you also help to solve unconventional problems.

Best Regards,
Paul