Using GPU in 2 processes (keras) in parallel - crash

dannykario · January 28, 2019, 11:24pm

Hi,

Is it possible to run 2 processes (keras) that use the GPU, in parallel ?

In particular, using the same code:

When I run one process - all good on the Xavier
When I run 2 keras processes in parallel on an amazon host with Tesla V100 - all good
using the exact same code on Xavier - both processes crash and exit at the same point.

Log from the the Xavier:

Using TensorFlow backend.
tf.estimator package not installed.
tf.estimator package not installed.
2019-01-28 21:26:59.851661: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] ARM64 does not support NUMA - returning NUMA node zero
2019-01-28 21:26:59.851902: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.5
pciBusID: 0000:00:00.0
totalMemory: 15.45GiB freeMemory: 7.96GiB
2019-01-28 21:26:59.851988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-28 21:27:00.514434: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-28 21:27:00.514553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-28 21:27:00.514629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-28 21:27:00.514879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7388 MB memory) → physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 → device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 → device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 → device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2
2019-01-28 21:27:00.516523: I tensorflow/core/common_runtime/direct_session.cc:307] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 → device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 → device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 → device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2

2019-01-28 21:27:00.751763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-28 21:27:00.751913: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-28 21:27:00.751968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-28 21:27:00.752003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-28 21:27:00.752144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7388 MB memory) → physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
/home/nvidia/.local/lib/python3.6/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually.
warnings.warn('No training configuration found in save file: ’
[‘person’, ‘bicycle’, ‘car’, ‘motorbike’, ‘aeroplane’, ‘bus’, ‘train’, ‘truck’, ‘boat’, ‘traffic light’, ‘fire hydrant’, ‘stop sign’, ‘parking meter’, ‘bench’, ‘bird’, ‘cat’, ‘dog’, ‘horse’, ‘sheep’, ‘cow’, ‘elephant’, ‘bear’, ‘zebra’, ‘giraffe’, ‘backpack’, ‘umbrella’, ‘handbag’, ‘tie’, ‘suitcase’, ‘frisbee’, ‘skis’, ‘snowboard’, ‘sports ball’, ‘kite’, ‘baseball bat’, ‘baseball glove’, ‘skateboard’, ‘surfboard’, ‘tennis racket’, ‘bottle’, ‘wine glass’, ‘cup’, ‘fork’, ‘knife’, ‘spoon’, ‘bowl’, ‘banana’, ‘apple’, ‘sandwich’, ‘orange’, ‘broccoli’, ‘carrot’, ‘hot dog’, ‘pizza’, ‘donut’, ‘cake’, ‘chair’, ‘sofa’, ‘pottedplant’, ‘bed’, ‘diningtable’, ‘toilet’, ‘tvmonitor’, ‘laptop’, ‘mouse’, ‘remote’, ‘keyboard’, ‘cell phone’, ‘microwave’, ‘oven’, ‘toaster’, ‘sink’, ‘refrigerator’, ‘book’, ‘clock’, ‘vase’, ‘scissors’, ‘teddy bear’, ‘hair drier’, ‘toothbrush’]
mycode.py:74: FutureWarning: arrays to stack must be passed as a “sequence” type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
yield (np.stack(map(lambda x:x[1], batch)), # images
Killed

Any ideas ?

AastaLLL · January 29, 2019, 3:16am

Hi,

Jetson Xavier only have ONE GPU.

By default, TensorFlow occupies all the GPU memory and may cause other GPU application crash.
Try to limit the maximal resource each app can access.

I’m not sure if this configure can be added from Keras but it works for TensorFlow users:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)

Thanks.

Topic		Replies	Views
GPU Sync failed in TX2 when running Tensorflow Jetson TX2	7	5275	October 18, 2021
Jetson Xavier NX - Tensorflow 2 container slower on GPU than on CPU Jetson Xavier NX tensorflow	5	2537	October 18, 2021
TensorFlow wheel for JetPack 4.0 !! Jetson AGX Xavier	16	3676	October 15, 2018
Problem running tensorflow Jetson Xavier NX tensorflow	4	2450	May 3, 2023
Surprised at how slow Xavier is on training small regression model compared to x86 with no GPU Maybe something wrong? Jetson AGX Xavier	2	1031	October 18, 2021
Tensorflow on TX2 GPU sync error Jetson TX2	6	4535	October 18, 2021
Tensorflow GPU Jetson TX2	2	575	October 18, 2021
Is my Tensorflow install really uses the GPU? Jetson Nano cuda , tensorflow	2	725	October 18, 2021
Tensorflow 1.15.5 can't sense GPU Jetson AGX Xavier tensorflow	6	1395	October 18, 2021
Tensorflow not using GPU in Jetson TX2 Jetson TX2	12	4288	October 18, 2021

Using GPU in 2 processes (keras) in parallel - crash

Related topics