I have a problem regarding tensorflow (TF) calculation on 2 GPUs: both GPUs are recognized by the TF, each of them works separetely, but not simultaneously.
Here is the test code:
import tensorflow as tf mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 # physical_devices = tf.config.list_physical_devices('GPU') # tf.config.set_visible_devices(physical_devices, 'GPU') strategy = tf.distribute.MirroredStrategy() with strategy.scope(): model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10)]) model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) model.fit(x_train, y_train, epochs=5)
Problem is that the computation dies at the following point:
2020-04-11 22:45:54.156999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7245 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:08:00.0, compute capability: 6.1) 2020-04-11 22:45:54.157457: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-04-11 22:45:54.158053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7624 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:09:00.0, compute capability: 6.1) Train on 60000 samples Epoch 1/5 2020-04-11 22:45:59.774139: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
Starting from this moment nothing happens, so that I have to kill the process / reboot PC.
I have already tested many things that I had found in internet. Concretely:
I verified each of the GPU’s ports on my motherboard separately. It worked and it means that each of the ports are fine.
I verified each of my GPUs separately. To do so, I used the function set_visible_devices. Using this function I managed to train the model on each of the GPUs separetely. Each of the cases was checked with
I started with Tensorflow 2.0, it did not work, so I updated it to TF 2.1. The problem remains
Purged and reinstalled the Nvidia drivers 430.50. Updated them to 440.64. The problem remains
Any advice would work, so thanks in advance for your help!
P.S. Initially, my two GPUs were connected via SLI technology. But I red that the TF needs no SLI technology, so I removed it.