Pascal Titan X benchmark thread

Got it installed and working in Windows 7 x64. Was able to switch to the TCC driver which was very nice.

The first test I ran was AllanMac’s random memory read benchmark test and WOW!

TITAN X (Pascal) : 28 SM : 12158 MB
Probing from: 256 - 5120 MB ...
alloc MB, probe MB,    msecs,     GB/s
     256,    14336,    39.98,   350.20
     512,    14336,    40.34,   347.05
     768,    14336,    40.49,   345.75
    1024,    14336,    40.53,   345.41
    1280,    14336,    40.53,   345.45
    1536,    14336,    40.60,   344.81
    1792,    14336,    40.63,   344.61
    2048,    14336,    40.67,   344.24
    2304,    14336,    40.66,   344.32
    2560,    14336,    40.65,   344.37
    2816,    14336,    40.67,   344.28
    3072,    14336,    40.68,   344.14
    3328,    14336,    40.67,   344.20
    3584,    14336,    40.69,   344.05
    3840,    14336,    40.68,   344.11
    4096,    14336,    40.68,   344.12
    4352,    14336,    40.69,   344.06
    4608,    14336,    40.65,   344.42
    4864,    14336,    40.69,   344.09
    5120,    14336,    40.71,   343.94

No dropoff!!!

Will be running other tests throughout the day so if there are any requests post them here.

Oh and CUDA-Z shows 12.5 Teraflops for 32-bit float compute.

Woot!

Jimmy P’s bandwidth test;

TITAN X (Pascal) @ 480.480 GB/s

N               [GB/s]          [perc]          [usec]          test
1048576         240.25                  50.00   17.5             Pass
2097152         292.82                  60.94   28.6             Pass
4194304         322.16                  67.05   52.1             Pass
8388608         342.68                  71.32   97.9             Pass
16777216        355.41                  73.97   188.8            Pass
33554432        361.96                  75.33   370.8            Pass
67108864        365.12                  75.99   735.2            Pass
134217728       366.94                  76.37   1463.1                   Pass

Non-base 2 tests!

N               [GB/s]          [perc]          [usec]          test
14680102        353.38                  73.55   166.2            Pass
14680119        353.40                  73.55   166.2            Pass
18875600        348.98                  72.63   216.4            Pass
7434886         214.26                  44.59   138.8            Pass
13324075        320.04                  66.61   166.5            Pass
15764213        333.42                  69.39   189.1            Pass
1850154         85.38           17.77   86.7             Pass
4991241         194.24                  40.43   102.8            Pass

Which probably needs to be tweaked for Pascal, but a bit better than the test for the GTX 1080 in terms of %

Bandwidth test from the CUDA 8.0 SDK;

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: TITAN X (Pascal)
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     11855.6

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12883.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     344028.2

Result = PASS

For generation and evaluation of all 13! arrangements of an array in memory against a linear test function at 0.75 seconds;

Using single GPU TITAN X (Pascal)

Starting value= -7919.02 , goal value= 111.493, number of floating point values in array= 13
Objective: To minimize the absolute difference between a processed starting value and the target value.
The order in which the values are fed into the test function do produce a set of distinct results dependant on order
In this test case there should only be one optimal value and one corresponding permutation of inputs which generate that value.
The results will be validated by CPU std::next_permutation(), and the performance difference between CUDA and CPU implementations will be compare
NOTE: CPU version may take a long time to finish!

Starting GPU testing:

Will evaluate 6227020800 permutations of array and return an optimal permutation and the optimal value associated with that permutation.

num_blx= 47508, adj_size= 1

Testing 13! version.
GPU timing: 0.759 seconds.
GPU answer is: 8783.86

Permutation as determined by OK CUDA implementation is as follows:
Start value= -7919.02
Using idx # 4 ,input value= -12345.7, current working return value= -8645.24
Using idx # 8 ,input value= -1111.2, current working return value= -8700.8
Using idx # 1 ,input value= -333.145, current working return value= -8728.56
Using idx # 6 ,input value= -27.79, current working return value= -8730.29
Using idx # 12 ,input value= -42.0099, current working return value= -8732.29
Using idx # 11 ,input value= -1.57, current working return value= -8732.38
Using idx # 9 ,input value= 0.90003, current working return value= -8732.32
Using idx # 5 ,input value= 2.47, current working return value= -8732.1
Using idx # 10 ,input value= 10.1235, current working return value= -8731.42
Using idx # 7 ,input value= 8.888, current working return value= -8730.61
Using idx # 2 ,input value= 7.1119, current working return value= -8729.19
Using idx # 3 ,input value= 127.001, current working return value= -8703.79
Using idx # 0 ,input value= 31.4234, current working return value= -8672.37

Absolute difference(-8672.37-111.493)= 8783.86

As a point of reference the Maxwell Titan X took about 1.25 seconds for the same task and the GTX 1080 about 1.05 seconds.
The CPU std::next_permutation() for the same task on an overlclocked 4.5 GHz i7 took 128 seconds.

source code;

https://sites.google.com/site/cudapermutations/

The TITAN X is a BEAST!

For thrust::sort() of an array of 134,217,728 elements randomly generated 32-bit floating point values including all memory copies both directions averaged over 8 iterations;

Num Elements= 134217728

Using device number 0 which is a TITAN X (Pascal)

number of sorting iterations= 8
GPU timing average: 0.131625 seconds including all memory copies both directions.
CPU timing: 12.82 seconds.

Error= 0

Max Difference= 0

which is slightly better than the Maxwell Titan X, though of that 131 ms ~85 ms is the memory copies times so the sort alone takes about 46 ms.

CudaaduC wrote:

No dropoff!!!!

Now that’s more like it! Regarding recent discussion about memory P-states, is the result in original post using the GDDR5X at full speed?

... so if there are any requests post them here.

Looking at latest CompuBench “Ocean Surface Simulation”, (CompuBench - performance benchmark for various compute APIs (OpenCL, RenderScript)), is it possible to explore why (in this one test) AMD Fury X can still match the Pascal Titan X?

thanks, and great work!

Unless I am seriously mistaken Fury X has an HBM memory subsystem with higher throughput than the GDDR5X memory subsystem of the Titan X (Pascal); it has about 20% more bandwidth, I think.

While the lack of a memory performance drop-off is great new, I am a bit disappointed about the memory bandwidth of the new Titan X in absolute terms, as the FLOPS have grown a lot more than the bandwidth compared to the old Titan X.

I just ran some minimum value reduction kernels and the best bandwidth number I have seen yet using the grid stride method is about ~368 GBs. This is 76% of the maximum 480 GBs. The Maxwell Titan X was able to get about 90% at best in terms of percentage of maximum (305 GBs out of 336 GBs maximum)

NVVP does not recognize the Pascal GTX Titan X yet, so cannot profile using that GPU. I assume this will be fixed when they finally release CUDA 8.

In general the memory-bound applications I have been running are about 15-20% faster than when using the Maxwell Titan X, which is about the same as the difference in these memory bandwidth numbers.

Could you give this one a try, kinda curious if it still works on newer cards:

http://www.skybuck.org/CUDA/BandwidthTest/version%200.16/Packed/TestCudaMemoryBandwidthPerformance.rar

Then post log and/or graphs.

From the very brief testing I did it seemed that the memory badwidth issues noticed on the 1080 likely affect the TITAN X too. The memory clock behaved similarly, stuck at 4513 MHz with measured by around 75% of peak.

njuffa wrote:

While the lack of a memory performance drop-off is great news, I am a bit disappointed about the memory bandwidth of the new Titan X in absolute terms, as the FLOPS have grown a lot more than the bandwidth compared to the old Titan X.

CudaaduC wrote:

NVVP does not recognize the Pascal GTX Titan X yet, so cannot profile using that GPU. I assume this will be fixed when they finally release CUDA 8.

pszilard wrote:

The memory clock behaved similarly, stuck at 4513 MHz with measured by around 75% of peak.

Given that some (many?) of us would be happy to trade a few of those 12 TFLOPs for extra bandwidth, here’s a thought experiment you might be able to try, once the tools catch up:

Given the voltage and power limits for Pascal Titan X, and that versions (?) of GDDR5X can run much faster than the advertised 10 Gbps (and that Nvidia might be using the best available GDDR5X chips in this pricy prosumer flagship), if you drop your GPU clock by 20% (and lower GPU voltage to match), how high can you crank those GDDR5X chips?

Does Pascal titan X support fast FP16? thats the real question as to value of this card. If it does, then its good value. If not, better to have a farm of cheaper cards.

pretty quiet thread. so does everyone assume it has slow FP16? Noone has even mentioned it anywhere in the world I have looked.

@LukeCuda, the Pascal TITAN X is an sm_61 device so it’s going to have the same FP16 support as the other 10-series GeForce devices.

Also, it looks like NVIDIA just updated the CUDA GPUs page:

Too many TITAN cards, NVIDIA.

Here’s a large list of TITAN names.

So the name of the Pascal Titan is “NVIDIA Titan X” and the old Maxwell version is “GeForce GTX Titan X”? But the actual Pascal card boasts a giant “GeForce GTX” name in self-illuminated green glowing light. This release photo is directly from the new Nvidia Pascal Titan X page.

Thanks Allan! I contacted nvidia but they have been ‘researching’ for a week!

So Titan x is worthless for deep learning if you’re building an Ai gpu farm! Good job nvidia

LukeCuda - my guess is Nvidia had to choose between GP100 or GP104 as a base for deriving this version of Titan. Since (affordable) HBM is not quite ready (and since GV100 is not far off), logical choice was to bake a bigger GP104. Hopefully we’ll get some love next time round.

So my one real question is really what the new memory setup does to cuFFT speeds. Most of my project is based around doing lots of single precision FFTs. Since these are generally memory limited whether the bonus speed of the Titan (vs say the Maxwell Titan) I am curious makes a difference in any meaningful way. I was really holding my hopes up for HBM on the Titan but I can understand the many reasons why they decided to do it with GDDR5X as a middle ground.