Hi guys,
I 'm trying to train from scratch in a distributed setting an Inception architecture.
Right now, my configuration use only one server (ps) on a jetson TX2 and one worker on another Jetson.
-The sources come from the Tensorflow github https://github.com/tensorflow/models/tree/master/research/inception
-The Jetsons use the latest jetpack and your latest official version of Tensorflow
-I use python 3.5.2
I encounter the following error on the worker device.
2018-09-27 15:51:53.972490: E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2018-09-27 15:51:53.978684: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
For your record:
-Training runs fine on PCs.
-Training runs with no error on jetsons if I don’t define a cuda device (so only on CPUs).
-Training works on a single jetson with GPU (no distributed setting) but use a lot of memory and need to swap 800MB.
I tried the following solutions:
-I reduce everything to minimal (batch_size, size of the queue, etc.)
-I allow GPU memory growth
In inception_distributed_train.py:
sess_config = tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement)
I put:
sess_config.gpu_options.allow_growth = True
Which do not seem to change anything (distributed or not)
The last thing I tested was to run the ps on a Jetson and the worker on the PC.
I get the following error on the ps (Jetson):
2018-09-28 07:23:54.782972: W tensorflow/core/common_runtime/bfc_allocator.cc:279] <allocator contains no memory>
2018-09-28 07:23:54.783154: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at assign_op.h:117 : Resource exhausted: OOM when allocating tensor with shape[768] and type float on /job:ps/replica:0/task:0/device:CPU:0 by allocator cuda_host_bfc
2018-09-28 07:23:54.783164: E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to alloc 4194304 bytes on host: CUDA_ERROR_UNKNOWN
2018-09-28 07:23:54.783405: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 4194304
But it works if I don’t use the GPU on the Jetson.
I would appreciate any solution for these errors.
Thanks in advance,
Regards,
Paul
PS: Sorry for my english, it’s not my native language