Using GPU in 2 processes (keras) in parallel - crash

Hi,

Is it possible to run 2 processes (keras) that use the GPU, in parallel ?

In particular, using the same code:

  • When I run one process - all good on the Xavier
  • When I run 2 keras processes in parallel on an amazon host with Tesla V100 - all good
  • using the exact same code on Xavier - both processes crash and exit at the same point.

Log from the the Xavier:


Using TensorFlow backend.
tf.estimator package not installed.
tf.estimator package not installed.
2019-01-28 21:26:59.851661: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] ARM64 does not support NUMA - returning NUMA node zero
2019-01-28 21:26:59.851902: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.5
pciBusID: 0000:00:00.0
totalMemory: 15.45GiB freeMemory: 7.96GiB
2019-01-28 21:26:59.851988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-28 21:27:00.514434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-28 21:27:00.514553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-28 21:27:00.514629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-28 21:27:00.514879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7388 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2
2019-01-28 21:27:00.516523: I tensorflow/core/common_runtime/direct_session.cc:307] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2

2019-01-28 21:27:00.751763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-28 21:27:00.751913: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-28 21:27:00.751968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-28 21:27:00.752003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-28 21:27:00.752144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7388 MB memory) -> physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
/home/nvidia/.local/lib/python3.6/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually.
warnings.warn('No training configuration found in save file: ’
[‘person’, ‘bicycle’, ‘car’, ‘motorbike’, ‘aeroplane’, ‘bus’, ‘train’, ‘truck’, ‘boat’, ‘traffic light’, ‘fire hydrant’, ‘stop sign’, ‘parking meter’, ‘bench’, ‘bird’, ‘cat’, ‘dog’, ‘horse’, ‘sheep’, ‘cow’, ‘elephant’, ‘bear’, ‘zebra’, ‘giraffe’, ‘backpack’, ‘umbrella’, ‘handbag’, ‘tie’, ‘suitcase’, ‘frisbee’, ‘skis’, ‘snowboard’, ‘sports ball’, ‘kite’, ‘baseball bat’, ‘baseball glove’, ‘skateboard’, ‘surfboard’, ‘tennis racket’, ‘bottle’, ‘wine glass’, ‘cup’, ‘fork’, ‘knife’, ‘spoon’, ‘bowl’, ‘banana’, ‘apple’, ‘sandwich’, ‘orange’, ‘broccoli’, ‘carrot’, ‘hot dog’, ‘pizza’, ‘donut’, ‘cake’, ‘chair’, ‘sofa’, ‘pottedplant’, ‘bed’, ‘diningtable’, ‘toilet’, ‘tvmonitor’, ‘laptop’, ‘mouse’, ‘remote’, ‘keyboard’, ‘cell phone’, ‘microwave’, ‘oven’, ‘toaster’, ‘sink’, ‘refrigerator’, ‘book’, ‘clock’, ‘vase’, ‘scissors’, ‘teddy bear’, ‘hair drier’, ‘toothbrush’]
mycode.py:74: FutureWarning: arrays to stack must be passed as a “sequence” type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
yield (np.stack(map(lambda x:x[1], batch)), # images
Killed


Any ideas ?

Hi,

Jetson Xavier only have ONE GPU.

By default, TensorFlow occupies all the GPU memory and may cause other GPU application crash.
Try to limit the maximal resource each app can access.

I’m not sure if this configure can be added from Keras but it works for TensorFlow users:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)

Thanks.