Slow 1080Ti compared to GTX960 running tensorflow

After installing tensorflow/keras on my Dell XPS/GTX960M (640 Cuda cores) running Ubuntu16.04, I also installed on a Ubuntu 16.04 desktop (i7-3930 with Geforce 1080Ti with 11GB DDR5 and 64GB of DDR3 memory). The execution times for 25 epochs of the mnist demo code takes 0.64 sec on the laptop but twice as long on the desktop with the 1080Ti. Similarly, the execution times for the Boston house-prices takes 2x longer on the 1080Ti.

Initially I thought the problem was which PCIExpress slot the 1080 Ti card was placed, but this is not the reason because after 3 complete re-installations of Ubuntu, the 2x slower performance on the 1080Ti card has not improved. The 1080Ti card now sits in slot 0 (PCIE16_1); the only other card is my wifi card which sits in a PCIE16_2 slot.

In googling performance concerns about Tensorflow, I’ve read that the input may need to be optimized or that one may need to compile Tensorflow from source with additional nvcc options. Since I’m running the exact same R-Keras code, I suspect compiling from scratch may be needed. Can anyone advise on how to do this? There are no such options in the install_keras() function. Alternatively, any other suggestions to boost the performance of the 1080Ti is welcome.

Below I’ve pasted the first few lines of the output from the R-Keras code - my naive conclusion is that the 1080Ti card is not running as fast as it could be (highlighted in orange). Note: nvidia-smi out shows the “Volatile GPU-Util” on the laptop will hit 58% but the 1080 Ti card never runs over 10%, but I don’t know if this is a true reflection of the load.

Laptop with 960M:
2018-04-16 20:02:06.924282: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-16 20:02:06.992276: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-16 20:02:06.992639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.0975
pciBusID: 0000:01:00.0
totalMemory: 1.96GiB freeMemory: 1.59GiB
2018-04-16 20:02:06.992653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-16 20:02:07.451438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-16 20:02:07.451460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-04-16 20:02:07.451485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-04-16 20:02:07.451681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1351 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
60000/60000 [==============================] - 2s 36us/step

  • loss: 0.2539 - acc: 0.9266
    Epoch 2/25
    60000/60000 [==============================] - 1s 23us/step
  • loss: 0.1053 - acc: 0.9692

Desktop with 1080Ti
Epoch 1/25
2018-04-16 20:04:48.824036: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-16 20:04:48.824472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:02:00.0
totalMemory: 10.91GiB freeMemory: 10.39GiB
2018-04-16 20:04:48.824493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-16 20:04:49.075027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-16 20:04:49.075071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-04-16 20:04:49.075081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-04-16 20:04:49.075346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10058 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
60000/60000 [==============================] - 3s 56us/step

  • loss: 0.2586 - acc: 0.9253
    Epoch 2/25
    60000/60000 [==============================] - 3s 45us/step

######## Also when i run tensorflow on the GPU vs the CPU on both the laptop or desktop with 1080Ti

15 epochs

laptop: 19.79 sec with GPU // 37.57 sec with CPU

1080Ti: 40.17 sec with GPU // 51.7 sec with CPU . <- very disappointing!

I note that the runtime without GPU (so CPU only) is significantly higher (1.38x) on the desktop than on the laptop. This seems suspicious to me. The CPU in the desktop is an i7-7930K Sandy Bridge-E 6-Core 3.2GHz (3.8GHz Turbo), correct? What is the CPU in the laptop?

I wonder whether two different versions of Tensorflow executable were used, e.g. one a debug build the other a release build.

The laptop is a Dell xps 9550 so it should be a Skylake i7-6700HQ up to 3.5Ghz - quad core

The desktop is a i7-3930 Sandy Bridge 6-core on a Gigabyte x79-UP4 motherboard

Update- some google searches show this to be somewhat common- looks like i have compile Tensorflow from source with various options- that i dont fully grasp yet

i7-3930 Sandy Bridge has no support for AVX2, while Skylake i7-6700HQ does. It is possible that this accounts for the desktop machine being slower, because based on core count and core frequency alone, I would expect the desktop machine to come out ahead of the laptop for a CPU-only run of Tensorflow.

It definitely seems like a good idea to research the various configuration settings to built Tensorflow optimally for each of these two platforms.

On the Sandy Bridge desktop (i7-3930k) i compiled Tensorflow 1.7 from scratch (biggest issue was getting the right version of bazel; only 0.11 worked, and versions 0.12 and 0.52 failed).

I found this URL for building atop cuda9.1, but i have cuda9.0 installed: https://github.com/tensorflow/tensorflow/issues/15656

bazel build --config=opt --config=cuda --incompatible_load_argument_is_label=false //tensorflow/tools/pip_package:build_pip_package

But after building a new virtual env and installing the freshly compiled tensorflow, I only gained about 1 second in speed. A larger improvement (3 sec) resulted from customizing the session with this code:

import tensorflow as tf
import keras
from keras.backend.tensorflow_backend import set_session
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.5)
gpu_options.allow_growth = True 
gpu_options.force_gpu_compatible = True
sess = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads = 2048,
        gpu_options=gpu_options))

I ran the code below 4 times for benchmarking:

[b]

  • With default pip installed tensorflow 1.7: mean=39.85 sec, sd=0.17 sec
  • With default pip installed tensorflow 1.7 and customized session: mean=37.05 sec, sd=0.23 sec
  • With freshly compiled tensorflow 1.7 and customized session: mean= 36.225 sec, sd=0.2

[/b]

Additionally, i inserted the MSI GeForce GTX 1080 Ti ARMOR 11G into a Gigabyte Z77MX motherboard with i7-3770 (Ivy Bridge).
With the pip installed tensorflow (but not customized session): mean=34.4 sec sd=0.32

Both motherboards have Gen 3 PCIExpress slots, but since the slightly newer CPU of the 2nd desktop and the new-ish laptop are faster, i’m thinking that maybe its time to upgrade my desktop’s motherboard and CPU :-(

As a reminder, recall that the laptop is still almost 2x faster than any of my desktops using any version of tensorflow or session settings (20 sec for the same code).

FYI, here is the tensorflow-keras code used for benchmarking.

import keras,time
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
from keras import models
from keras import layers
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

from keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

t0 = time.time()
network.fit(train_images, train_labels, epochs=15, batch_size=128)
test_loss, test_acc = network.evaluate(test_images, test_labels)
t1=time.time()
print(t1-t0)

print('test_acc:', test_acc)

Update: after building a new desktop with i7-8700 (6-cores) on a MSI Z370M gaming motherboard, the 1080Ti GPU is now 30-40% faster than the laptop’s GTX960M.

However, for logistical reasons, i replaced the MSI 10 80Ti GPU for a similar EVGA GPU card. The MSI card was about 25mm taller than the EVGA card; the EVGA fit into a new, smaller PC case better (Corsair Air 240). I have not tested the EVGA GPU on the old desktop running a i7-3930 CPU on a Gigabyte UP4 motherboard inside a full-sized case. Both desktops have 64gb memory.