Low performance on P6000 with AMD 1920x

Hi,

I ran a brand new setup solely for DL purposes. After putting all the parts together, I ran a few benchmark tests and thought the results were pretty low compared to what I could find online.

The documentation I followed can be found here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Here is the setup I have:

HDW:
MotherBoard: MSI x399 gaming pro AC carbon
RAM: 4 x 16Gb
Processor: AMD Threadripper 1920x
GPU: NVidia Quadro P6000

SFW:
OS: Ubuntu Server 19.04

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 430.40       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P6000        Off  | 00000000:41:00.0 Off |                  Off |
| 26%   37C    P8     9W / 250W |  22807MiB / 24449MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1832      G   /usr/lib/xorg/Xorg                             8MiB |
|    0      2004      G   /usr/bin/gnome-shell                           4MiB |
|    0      5186      C   /opt/anaconda3/envs/PythonGPU/bin/python   22781MiB |
+-----------------------------------------------------------------------------+
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168

Benchmarks:

I tried this test: https://github.com/keras-team/keras/blob/master/examples/mnist_mlp.py

Which gives me those results:

60000 train samples
10000 test samples
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 512)               401920    
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                5130      
=================================================================
Total params: 669,706
Trainable params: 669,706
Non-trainable params: 0
_________________________________________________________________
Train on 60000 samples, validate on 10000 samples
Epoch 1/20
60000/60000 [==============================] - 6s 95us/step - loss: 0.2449 - acc: 0.9255 - val_loss: 0.1094 - val_acc: 0.9663
Epoch 2/20
60000/60000 [==============================] - 4s 65us/step - loss: 0.1024 - acc: 0.9690 - val_loss: 0.0872 - val_acc: 0.9734
Epoch 3/20
60000/60000 [==============================] - 4s 73us/step - loss: 0.0757 - acc: 0.9768 - val_loss: 0.0836 - val_acc: 0.9750
Epoch 4/20
60000/60000 [==============================] - 4s 66us/step - loss: 0.0611 - acc: 0.9812 - val_loss: 0.0663 - val_acc: 0.9806
Epoch 5/20
60000/60000 [==============================] - 5s 87us/step - loss: 0.0512 - acc: 0.9843 - val_loss: 0.0662 - val_acc: 0.9826
Epoch 6/20
60000/60000 [==============================] - 4s 69us/step - loss: 0.0438 - acc: 0.9871 - val_loss: 0.0725 - val_acc: 0.9812
Epoch 7/20
60000/60000 [==============================] - 4s 68us/step - loss: 0.0381 - acc: 0.9891 - val_loss: 0.0753 - val_acc: 0.9821
Epoch 8/20
60000/60000 [==============================] - 5s 84us/step - loss: 0.0337 - acc: 0.9902 - val_loss: 0.0769 - val_acc: 0.9821
Epoch 9/20
60000/60000 [==============================] - 5s 78us/step - loss: 0.0317 - acc: 0.9905 - val_loss: 0.0853 - val_acc: 0.9820
Epoch 10/20
60000/60000 [==============================] - 4s 71us/step - loss: 0.0279 - acc: 0.9920 - val_loss: 0.0774 - val_acc: 0.9835
Epoch 11/20
60000/60000 [==============================] - 5s 83us/step - loss: 0.0267 - acc: 0.9921 - val_loss: 0.0779 - val_acc: 0.9854
Epoch 12/20
60000/60000 [==============================] - 5s 78us/step - loss: 0.0238 - acc: 0.9933 - val_loss: 0.1056 - val_acc: 0.9806
Epoch 13/20
60000/60000 [==============================] - 5s 81us/step - loss: 0.0258 - acc: 0.9929 - val_loss: 0.0870 - val_acc: 0.9835
Epoch 14/20
60000/60000 [==============================] - 5s 79us/step - loss: 0.0219 - acc: 0.9939 - val_loss: 0.1002 - val_acc: 0.9834
Epoch 15/20
60000/60000 [==============================] - 5s 76us/step - loss: 0.0206 - acc: 0.9943 - val_loss: 0.0910 - val_acc: 0.9833
Epoch 16/20
60000/60000 [==============================] - 5s 81us/step - loss: 0.0210 - acc: 0.9942 - val_loss: 0.0963 - val_acc: 0.9841
Epoch 17/20
60000/60000 [==============================] - 4s 69us/step - loss: 0.0185 - acc: 0.9949 - val_loss: 0.0958 - val_acc: 0.9854
Epoch 18/20
60000/60000 [==============================] - 5s 82us/step - loss: 0.0185 - acc: 0.9951 - val_loss: 0.1040 - val_acc: 0.9839
Epoch 19/20
60000/60000 [==============================] - 4s 70us/step - loss: 0.0186 - acc: 0.9948 - val_loss: 0.1011 - val_acc: 0.9838
Epoch 20/20
60000/60000 [==============================] - 5s 88us/step - loss: 0.0184 - acc: 0.9951 - val_loss: 0.0974 - val_acc: 0.9856
Test loss: 0.09737630939959879
Test accuracy: 0.9856


I then ran this test: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_train.py

Which gave me the following results:

I0806 17:02:55.815466 140026390669120 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /tmp/cifar10_train/model.ckpt.
2019-08-06 17:02:56.218330: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2019-08-06 17:02:56.687217: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-06 17:02:58.196827: step 0, loss = 4.68 (391.8 examples/sec; 0.327 sec/batch)
2019-08-06 17:03:00.722650: step 10, loss = 4.60 (506.8 examples/sec; 0.253 sec/batch)
2019-08-06 17:03:03.028074: step 20, loss = 4.52 (555.2 examples/sec; 0.231 sec/batch)
2019-08-06 17:03:05.343251: step 30, loss = 4.38 (552.9 examples/sec; 0.232 sec/batch)
2019-08-06 17:03:07.656660: step 40, loss = 4.39 (553.3 examples/sec; 0.231 sec/batch)
2019-08-06 17:03:09.945660: step 50, loss = 4.36 (559.2 examples/sec; 0.229 sec/batch)
2019-08-06 17:03:12.268526: step 60, loss = 4.28 (551.0 examples/sec; 0.232 sec/batch)
2019-08-06 17:03:14.569985: step 70, loss = 4.19 (556.2 examples/sec; 0.230 sec/batch)
2019-08-06 17:03:16.854503: step 80, loss = 4.12 (560.3 examples/sec; 0.228 sec/batch)
2019-08-06 17:03:19.187332: step 90, loss = 4.14 (548.7 examples/sec; 0.233 sec/batch)
I0806 17:03:21.569198 140026390669120 basic_session_run_hooks.py:692] global_step/sec: 4.27834

Question:

The performance displayed seems pretty low compared to what I can see on different benchmark websites. How can I improve the overall performance? Are these results consistent with a Quadro P6000?

Thank you!

Hi,

For testing purposes, I completely reinstalled the system, this time with a Ubuntu 18.10 and Kernel 4.18.0.25.

Again, I thoroughly followed the instructions from https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

I now get performances even worse than previously.

Here are my infos:

nvidia-smi
Wed Aug  7 14:08:34 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P6000        On   | 00000000:41:00.0 Off |                  Off |
| 26%   35C    P8    18W / 250W |      1MiB / 24448MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Performance with CIFAR10 test:

I0807 14:12:09.224721 140453946832704 session_manager.py:502] Done running local_init_op.
I0807 14:12:09.578963 140453946832704 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /tmp/cifar10_train/model.ckpt.
2019-08-07 14:12:14.432445: step 0, loss = 4.66 (229.2 examples/sec; 0.558 sec/batch)
2019-08-07 14:12:55.661491: step 10, loss = 4.57 (31.0 examples/sec; 4.123 sec/batch)
2019-08-07 14:13:37.720794: step 20, loss = 4.48 (30.4 examples/sec; 4.206 sec/batch)
2019-08-07 14:14:20.119666: step 30, loss = 4.41 (30.2 examples/sec; 4.240 sec/batch)

I’m using anaconda3 and added the two following modules:
tensorflow-gpu
tensorflow_datasets

Please advise, thanks!
nvidia-bug-report.log.gz (1.2 MB)

Additional info:

./bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Quadro P6000"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 24448 MBytes (25635848192 bytes)
  (30) Multiprocessors, (128) CUDA Cores/MP:     3840 CUDA Cores
  GPU Max Clock rate:                            1645 MHz (1.64 GHz)
  Memory Clock rate:                             4513 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 65 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS

And:

./bin/x86_64/linux/release/bandwidthTest 
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro P6000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			13.1

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			4.8

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   32000000			381.7

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

For what is worth, here is the CIFAR test on my laptop lenovo x220:

I0807 21:35:22.315685 139777471268672 session_manager.py:502] Done running local_init_op.
I0807 21:35:22.895225 139777471268672 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /tmp/cifar10_train/model.ckpt.
2019-08-07 21:35:28.726388: W tensorflow/core/framework/allocator.cc:107] Allocation of 21196800 exceeds 10% of system memory.
2019-08-07 21:35:28.726388: W tensorflow/core/framework/allocator.cc:107] Allocation of 21196800 exceeds 10% of system memory.
2019-08-07 21:36:09.557077: step 0, loss = 4.66 (26.9 examples/sec; 4.757 sec/batch)

Hi,

For what it’s worth (and the few who courageously read this thread), I managed to get the performances back to the values posted on the original post (~500 examples per sec).

To do so, on Ubuntu 18.10 / kernel 4.18, here is what I did:
purge nvidia / cuda
install nvidia driver 418-67
install cudatoolkit=10.0 and cudnn=7.6.0 inside conda

I’m leaning towards a compatibility issue between the different libs (cuda/cudnn) and driver, and TF (v1.14.1, inside conda).

I’m still curious to know, for those who run the P6000, what performances you obtain with the CIFAR10 benchmark?

Thanks!