Tesla K20 vs Titan X performance for the same code

I have Tesla K20 and Titan X cards in my workstation.
I’m running neural network simulations using Theano library (CUDA 7.5 + CuDNN v3), dataset is ~600MB.

Here are some performance results:

Single simulation:
(first number is GPU utilization, second is time to completion)
Titan X: 35%, 12.9 min
K20: 80%, 9.3 min.

Two simulations
(identical independent instances of the code, running in parallel):
Titan X: 55%, 17 min
K20: 95%, 16.6 min.

Three simulations:
Titan X: 65%, 22 min.
K20: 99%, 24.6 min.

Four simulations:
Titan X: 70%, 25.8 min
K20: crashes (can’t allocate memory)

Utilization info is from Nvidia Control Panel - GPU Utilization Graph. By the way, where can I see GPU memory usage?

Can anyone explain these differences?

Why is Titan X not utilized more fully for the single simulation? How come it is slower for the single simulation case? Why Tesla can’t handle 4 simulations? 4 copies of the dataset (2.4GB) should fit in its memory (5GB), right?

K20 has less memory than Titan X, so eventually as you increase memory demands, K20 will run out of memory before Titan X will.

You’ve multiplied some number by 4, but that does not mean those are the only memory demands that theano/cuDNN is placing on the GPU.

You can witness memory usage using nvidia-smi in a console window. Use nvidia-smi --help to understand the various options, or there should be a man page for it on linux.

If your Titan X is also running a display, that my cause it to be slower. And you don’t mention if this is linux or windows, but your Titan X is likely to be somewhat slower in windows due to WDDM.

Thanks! I’m on Windows. Neither card is used to display output.
Here’s nvidia-smi output:

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe
Thu Nov 12 16:06:29 2015

+------------------------------------------------------+
| NVIDIA-SMI 354.35     Driver Version: 354.35         |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT... WDDM  | 0000:01:00.0     Off |                  N/A |
| 27%   67C    P2    79W / 250W |   1654MiB / 12288MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K600        WDDM  | 0000:02:00.0     Off |                  N/A |
| 25%   48C    P8    N/A /  N/A |    412MiB /  1024MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20c          TCC  | 0000:03:00.0     Off |                    0 |
| 41%   55C    P0    72W / 225W |   1424MiB /  4799MiB |     45%      Default |
+-------------------------------+----------------------+----------------------+

This is a snapshot for when there’s a single simulation is runing on both cards.

What I’d like to understand is why Tesla is faster in this case, and why Titan X is not utilized more.

TCC vs. WDDM may make a difference. Try putting the titan in TCC mode. You should be able to do this with nvidia-smi

Wow, that really helped! Now Titan X is almost twice faster than before, and is significantly faster than K20 even for the single simulation case. The utilization didn’t change - still only 35%.

Thanks for the tip!

I wonder if it can go even faster if I switch from Tesla driver (354.35) to the latest Geforce driver (358.91)…