GV100 performance issues

Hi,

I have been lucky enough to win a gv100. It now lives happily next to my old trusty titan X (Maxwell).
I expected it to perform at least twice as fast as the titan X but alas, it shows only 25% improvement.
I am using Keras mainly with tensorflow 1.13.1 backend, with cuda 10.0. The computer is running E5-1650 0 @ 3.20GHz. Data is stored on a fast m.2 drive. I was wondering if anyone can help me improve the performance or at least steer me in the right direction of how to find it or on what forum to post the question.

Best regards,
Moshe

In general there are several possible performance bottlenecks you would want to look into.

Even with a fast disk it is possible to be CPU bound due to the input preprocessing pipeline (particularly for image data with complex augmentations). One easy way to test this is to take your model training script, leave the preprocessing steps alone, and replace the model with a trivial, one-layer “model” that connects inputs to logits. If this simplified script runs with similar performance to your full model, then you are IO bound and need to find ways to optimize your data loading (perhaps by doing more pre-processing or doing augmentations on the GPU using a library like DALI).

Another possibility is that your model has many very small layers. In this case, there might not be enough parallel work to saturate a highly-parallel V100. To diagnose this, you can collect a profile of your training script using nsight systems (or nvprof). When viewing the timeline, if the kernels are often very short (a few 10s of microseconds) and the gaps between kernels (when you zoom in) are very wide you will want to find ways to increase the computational intensity of your model. For example, you might need to increase the batch size, or if you are training an RNN, use a cudnn rnn cell as these are optimized to expose parallelism.

https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html

Thank you for your reply
So, following you advice, I write this little work of art that does no IO and no CPU:

import numpy as np

import keras
from keras.models import Sequential, Model, load_model
from keras.layers import Activation, Dense, Multiply, Input
from keras import metrics
from keras.optimizers import Adam
from keras import backend as K

import warnings
warnings.filterwarnings("ignore")

class DataGenerator:
    def __init__(self):
        pass
    def create_train(self, batch_size, shape):
        assert shape[2] == 3
        while True:
            batch_images1 = np.ones((batch_size, shape[0], shape[1], shape[2])).astype("float")
            batch_labels = np.zeros((batch_size, 28))
            yield batch_images1, batch_labels

train_datagen = DataGenerator()


def create_model(input_shape, n_out):
    a = Input(shape=input_shape)
    x = keras.layers.GlobalAveragePooling2D()(a)
    b = Dense(n_out, activation="softmax")(x)
    model = Model(inputs=a, outputs=b)

    return model


model = create_model(
    input_shape=(512,512,3),
    n_out=28)

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['acc'])

model.summary()

epochs = 3;batch_size = 10

train_generator = train_datagen.create_train(
    batch_size, (512,512,3))
validation_generator = train_datagen.create_train(
    batch_size, (512,512,3))
K.set_value(model.optimizer.lr, 0.0001)

history = model.fit_generator(
    train_generator,
    steps_per_epoch=10000//batch_size,
    validation_data=validation_generator,
    validation_steps=20,
    epochs=epochs,
    verbose=1)

results were the same with both cards:

m@dl4:~/retina/models$ CUDA_VISIBLE_DEVICES="1" python t_gv100.py                                                                                             
Using TensorFlow backend.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 512, 512, 3)       0         
_________________________________________________________________
global_average_pooling2d_1 ( (None, 3)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 28)                112       
=================================================================
Total params: 112
Trainable params: 112
Non-trainable params: 0
_________________________________________________________________
2019-06-21 12:43:32.077198: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.076
pciBusID: 0000:03:00.0
totalMemory: 11.93GiB freeMemory: 11.82GiB
2019-06-21 12:43:32.077237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-06-21 12:43:32.443339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-21 12:43:32.443389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 
2019-06-21 12:43:32.443399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N 
2019-06-21 12:43:32.444021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11436 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0, compute capability: 5.2)
Epoch 1/3
1000/1000 [==============================] - 64s 64ms/step - loss: 0.0365 - acc: 1.0000 - val_loss: 0.0364 - val_acc: 1.0000
Epoch 2/3
1000/1000 [==============================] - 63s 63ms/step - loss: 0.0364 - acc: 1.0000 - val_loss: 0.0364 - val_acc: 1.0000
Epoch 3/3
1000/1000 [==============================] - 64s 64ms/step - loss: 0.0364 - acc: 1.0000 - val_loss: 0.0364 - val_acc: 1.0000

m@dl4:~/retina/models$ CUDA_VISIBLE_DEVICES="0" python t_gv100.py 
Using TensorFlow backend.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 512, 512, 3)       0         
_________________________________________________________________
global_average_pooling2d_1 ( (None, 3)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 28)                112       
=================================================================
Total params: 112
Trainable params: 112
Non-trainable params: 0
_________________________________________________________________
2019-06-21 12:38:35.289458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: Quadro GV100 major: 7 minor: 0 memoryClockRate(GHz): 1.627
pciBusID: 0000:04:00.0
totalMemory: 31.72GiB freeMemory: 31.41GiB
2019-06-21 12:38:35.289504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-06-21 12:38:35.673814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-21 12:38:35.673864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 
2019-06-21 12:38:35.673874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N 
2019-06-21 12:38:35.675236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 w
ith 30472 MB memory) -> physical GPU (device: 0, name: Quadro GV100, pci bus id: 0000:04:00.0, compute capability: 7.0)
Epoch 1/3
1000/1000 [==============================] - 64s 64ms/step - loss: 0.0365 - acc: 1.0000 - val_loss: 0.0364 - val_acc: 1.0000
Epoch 2/3
1000/1000 [==============================] - 64s 64ms/step - loss: 0.0364 - acc: 1.0000 - val_loss: 0.0364 - val_acc: 1.0000
Epoch 3/3
1000/1000 [==============================] - 65s 65ms/step - loss: 0.0364 - acc: 1.0000 - val_loss: 0.0364 - val_acc: 1.0000

This still seems wrong to me… I’ll try to run nvprof next. what should I run it on? this or my real model?

Your script does actually still include CPU tasks because it must convert data from numpy arrays and copy it to the GPU. For a trivial model like this one, those conversions and copies will dominate, so we’d expect your V100 to perfom about the same as older GPUS.

To improve IO performance, you want to completely avoid Python in the io pipeline, so fit_generator() is not a great choice (even though it attempt to overlap the python IO with model execution). Using the tf.data API would be a better choice.

With tf.data you will also want to enable CPU prefetching to ensure CPU and GPU processing are overlapped. This would be done with something like the following.

train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE).

GPU prefetching should also be enabled in the input pipeline to ensure PCIe transfers and GPU processing are overlapped.

train_dataset = train_dataset.apply(tf.data.experimental.prefetch_to_device('/gpu:0'))).

One additional thing that might be worth trying is to force pinned memory allocations (which can improve host/device transfers. (This can hurt performance on machines that do not have ample host memory, because it reduces the host’s ability to page out memory.)

config = tf.ConfigProto()
config.gpu_options.force_gpu_compatible = True # Force pinned memory
sess = tf.Session(config=config)
tf.keras.backend.set_session(sess)

Even with these steps in place, your trivial model will still be IO/memcpy limited. But you should see a much higher throughput. Then applying this optimized IO pipeline back to your original model should provide enough IO performance for the V100 to stretch its legs and finally outrun your Maxwell Titan X.

Thank you very much for this, this is a really useful information. I have probably wasted days of my life by now not implementing it…
I found this tutorial that contains very good explanation of what you said, including how to implement it into keras - see 用 tf.data 加载图片  |  TensorFlow Core
Just for future reference if anyone else asks the same questions. It is still a lot of restructuring in my script so will take me awhile to implement.

Thank you again!