Multi-GPU training not working

intelligence.expertise · April 12, 2020, 8:49am

Dear community,

I have a problem regarding tensorflow (TF) calculation on 2 GPUs: both GPUs are recognized by the TF, each of them works separetely, but not simultaneously.

Here is the test code:

import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# physical_devices = tf.config.list_physical_devices('GPU')
# tf.config.set_visible_devices(physical_devices, 'GPU')

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10)])
    model.compile(optimizer='adam',
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    model.fit(x_train, y_train, epochs=5)

Problem is that the computation dies at the following point:

2020-04-11 22:45:54.156999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7245 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:08:00.0, compute capability: 6.1)
2020-04-11 22:45:54.157457: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-11 22:45:54.158053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7624 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:09:00.0, compute capability: 6.1)
Train on 60000 samples
Epoch 1/5
2020-04-11 22:45:59.774139: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10

Starting from this moment nothing happens, so that I have to kill the process / reboot PC.

I have already tested many things that I had found in internet. Concretely:

I verified each of the GPU’s ports on my motherboard separately. It worked and it means that each of the ports are fine.
I verified each of my GPUs separately. To do so, I used the function set_visible_devices. Using this function I managed to train the model on each of the GPUs separetely. Each of the cases was checked with nvidia-smi
I started with Tensorflow 2.0, it did not work, so I updated it to TF 2.1. The problem remains
Purged and reinstalled the Nvidia drivers 430.50. Updated them to 440.64. The problem remains

Any advice would work, so thanks in advance for your help!

P.S. Initially, my two GPUs were connected via SLI technology. But I red that the TF needs no SLI technology, so I removed it.

Topic		Replies	Views
tensorflow:19.12-tf2-py3 no multiple gpus Frameworks (archived) tensorflow	0	600	January 2, 2020
K80 GPU disappears when tries to run 2 TensorFlow applications (one in each GPU) simultaneously. CUDA Programming and Performance	8	1904	August 1, 2017
Multi GPU computing not working properly CUDA Setup and Installation	2	1159	April 15, 2018
Can't allocate gpu memory to multiple gpus while training CUDA Programming and Performance	0	437	January 2, 2019
Tensorflow on TX2 GPU sync error Jetson TX2	6	4556	October 18, 2021
Why Multi-GPU slower than single GPUï¼Ÿ CUDA Programming and Performance	2	7618	September 14, 2011
One GPU is utilized 100% and Second GPU utilization is 0% CUDA Programming and Performance cuda , tensorflow	3	1164	October 9, 2020
run tensorflow 1.3 on tx2 stuck Jetson TX2	20	5632	October 18, 2021
Tensorflow Memory Error Jetson TX2	25	15348	October 18, 2021
Treat the multi-GPU as single GPU and memory size Frameworks (archived) tensorflow	1	502	September 3, 2019

Multi-GPU training not working

Related topics