Grim memory bandwidth GTX 1080

Maybe I am missing something but really bad memory bandwidth performance relative to the Titan X;

CUDA 8.0, compiled for compute 6.1, Windows 7, most recent driver 368.39

Jimmy P’s bandwidth test;

GeForce GTX 1080 @ 320.320 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         167.58                  52.32   25.0             Pass
 2097152         201.28                  62.84   41.7             Pass
 4194304         220.49                  68.83   76.1             Pass
 8388608         233.88                  73.01   143.5            Pass
 16777216        241.08                  75.26   278.4            Pass
 33554432        244.60                  76.36   548.7            Pass
 67108864        246.62                  76.99   1088.5                   Pass
 134217728       247.62                  77.30   2168.1                   Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        241.17                  75.29   243.5            Pass
 14680119        241.07                  75.26   243.6            Pass
 18875600        239.46                  74.76   315.3            Pass
 7434886         168.07                  52.47   176.9            Pass
 13324075        224.64                  70.13   237.2            Pass
 15764213        232.21                  72.49   271.6            Pass
 1850154         78.47           24.50   94.3             Pass
 4991241         155.81                  48.64   128.1            Pass

And worse of all AllanMac’s random memory read test;

GeForce GTX 1080 : 20 SM : 8192 MB
Probing from: 256 - 5120 MB ...
alloc MB, probe MB,    msecs,     GB/s
     256,    14336,    86.03,   162.73
     512,    14336,    88.77,   157.72
     768,    14336,    93.01,   150.52
    1024,    14336,    95.23,   147.02
    1280,    14336,  1351.07,    10.36
    1536,    14336,  2346.30,     5.97
    1792,    14336,  3096.08,     4.52
    2048,    14336,  3678.75,     3.81
    2304,    14336,  4140.27,     3.38
    2560,    14336,  4519.34,     3.10
    2816,    14336,  4832.45,     2.90
    3072,    14336,  5097.10,     2.75
    3328,    14336,  5321.54,     2.63
    3584,    14336,  5519.53,     2.54
    3840,    14336,  5696.16,     2.46
    4096,    14336,  5849.73,     2.39
    4352,    14336,  5987.34,     2.34
    4608,    14336,  6105.49,     2.29
    4864,    14336,  6212.00,     2.25
    5120,    14336,  6307.61,     2.22

The second test is just terrible, as the even the GTX 980 in WDDM would not drop off until about 2304

The only applications which run faster than the Titan X are pure-compute and even then the difference is only about 25%.

Extremely disappointing so far. but maybe someone can comment on possible fixes?

I am missing something?

Tested my super-simple 1024 bin histogram code and the same issue, about 240 GBs for the GTX 1080 and about 305 GBs for the GTX Titan X.

Not even close to the purported 320 GBs, as the best I have measured is about 247 GBs.

I wonder if the GTX 10x0 cards default to executing CUDA kernels at a lower MEM clock similar to the Maxwell v2 cards?

Here was the incantation that boosted the MEM speed.

Allan,

I already set that first thing to 5005 for memory and 1911 for compute via NVSMI. All posted results are after the manual boost to both the memory and compute clocks.

Did WDDM 2.0 somehow get into my Win7 machine?

I have had zero exposure to GDDR5X. I assume this is a card bought at retail, not an engineering sample or some other non-standard part. Some thoughts:

(1) Make sure the test uses a sufficient grid size. I would recommend initially aiming for 20 “waves” of thread blocks, at 256 threads per thread block. GDDR5X may have higher latency compared to GDDR5, requiring an increase in concurrency to reach the full bandwidth.

(2) Memory subsystem performance in the past was sometimes limited by mechanisms operating at core frequency, and full memory throughput only occurred with core frequency higher than default clocks, so run at the highest application clock that nvidia-smi allows setting.

(3) GDDR requires “training” of the receivers in the memory interface. If there is an electrically noisy wire it may result in frequent re-training, sapping throughput. Such issue may be more prevalent on some cards but not others, or with some memory vendors but not others. I wouldn’t draw conclusions until having tried at least two cards from two different vendors.

(4) The configuration of the GPU memory interface is usually done via VBIOS (I think the driver may override, but not sure), so make sure you have the latest version installed for your particular card. Given that what is shipping now are initial runs of new hardware, the VBIOS settings may have started out on the conservative side.

One more dumb question… do you actually see the MEM clock reaching 5005 MHz in GPU-Z (or equivalent)?

Also, with CUDA 8 RC, bandwidthTest should now be in your path so:

bandwidthTest --dtod --mode=range --start=1073741824 --end=1073741824 --increment=1

should be close to the max?

nvidia-smi -rac          // resets app clocks
bandwidthTest ... -> 166439 MB/s
nvidia-smi -ac 3505,1531 // max app clocks
bandwidthTest ... -> 193423 MB/s

And probe_bw with friendly arguments can produce numbers much closer to theoretical:

> probe_bw 0 64 128 1 1024
GeForce GTX 980 : 16 SM : 4096 MB
Probing from: 64 - 128 MB ...
alloc MB, probe MB,    msecs,     GB/s
      64,   458752,  2195.56,   204.05
      65,   458752,  2197.35,   203.88
      66,   458752,  2199.09,   203.72
      67,   458752,  2200.84,   203.56
      68,   458752,  2202.49,   203.41

If the 1080 can’t do this then that’s a surprise to us all!

Trying with a lower memory clock but a high core clock might also be instructive. A higher re-try rate for memory transactions at high memory clocks seems at least a possible explanation for degraded memory performance.

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0\bin\win64\Release>bandwidthTest --dtod --mode=range --start=1073741824 --end=1073741824 --increment=1
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1080
 Range Mode

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   1073741824                   234456.7

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

GPU-Z shows a number close to 2505 MHZ when running the bandwidth test, which seems to be about half of what it is supposed to be. The same situation regarding GPU-Z with the Titan X, which shows 1700 MHz.

Maybe GDDR5X reports quarter rate clocks?

Perhaps try probe_bw with a tiny starting allocation?

You can see where the cache no longer impacts bandwidth:

C:\temp\probe_bw\f91b67c112bcba98649d>probe_bw 0 4 128 4 1024
GeForce GTX 980 : 16 SM : 4096 MB
Probing from: 4 - 128 MB ...
alloc MB, probe MB,    msecs,     GB/s
       4,   458752,   916.47,   488.83
       8,   458752,  1489.29,   300.81
      12,   458752,  1730.24,   258.92
      16,   458752,  1867.19,   239.93
      20,   458752,  1951.63,   229.55
      24,   458752,  2009.09,   222.99
      ...

Whatever the case, your bandwidthTest’s 240 GB/s is a very reliable measurement.

I would expect that copying 1GB would demonstrate peak performance… :(

Not sure if this adds anything to the discussion but:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1080
 Shmoo Mode

................................................................................
.
 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   1024                         1003.9
   2048                         1907.3
   3072                         2861.0
   4096                         3814.7
   5120                         4768.4
   6144                         5722.0
   7168                         6675.7
   8192                         8030.9
   9216                         8174.4
   10240                        4890.6
   11264                        4561.1
   12288                        4087.2
   13312                        3492.3
   14336                        3423.4
   15360                        3365.9
   16384                        15258.8
   17408                        16212.5
   18432                        17166.1
   19456                        17257.0
   20480                        19073.5
   22528                        22085.1
   24576                        22888.2
   26624                        26100.6
   28672                        28108.3
   30720                        30116.0
   32768                        32123.8
   34816                        34131.5
   36864                        34332.3
   38912                        38147.0
   40960                        40154.7
   43008                        42162.4
   45056                        44170.2
   47104                        46177.9
   49152                        48185.7
   51200                        50193.4
   61440                        57220.5
   71680                        64247.5
   81920                        73315.5
   92160                        80427.0
   102400                       91233.4
   204800                       158945.7
   307200                       228881.8
   409600                       275865.1
   512000                       314290.2
   614400                       365480.0
   716800                       389467.8
   819200                       404876.7
   921600                       430836.4
   1024000                      408261.9
   1126400                      251455.7
   2174976                      203706.3
   3223552                      212331.4
   4272128                      215448.6
   5320704                      215476.8
   6369280                      218371.4
   7417856                      216130.8
   8466432                      221226.0
   9515008                      217319.5
   10563584                     216847.6
   11612160                     219838.2
   12660736                     223411.3
   13709312                     223659.2
   14757888                     224031.8
   15806464                     220378.2
   16855040                     219698.8
   18952192                     221324.2
   21049344                     224974.7
   23146496                     221910.3
   25243648                     221325.7
   27340800                     225368.4
   29437952                     224040.3
   31535104                     224541.7
   33632256                     226193.4
   37826560                     222195.4
   42020864                     222205.9
   46215168                     224758.4
   50409472                     214537.4
   54603776                     222770.9
   58798080                     221794.3
   62992384                     221816.2
   67186688                     222724.9

Result = PASS

I should have mentioned that was using an old (CUDA 5.5) version of bandwidthTest

GeForce GTX 1080 : 20 SM : 8192 MB
Probing from: 4 - 128 MB ...
alloc MB, probe MB,    msecs,     GB/s
       4,   458752,   803.50,   557.56
       8,   458752,  1439.18,   311.29
      12,   458752,  1801.37,   248.70
      16,   458752,  2023.22,   221.43
      20,   458752,  2166.84,   206.75
      24,   458752,  2266.93,   197.62
      28,   458752,  2340.18,   191.44
      32,   458752,  2396.21,   186.96
      36,   458752,  2440.00,   183.61
      40,   458752,  2475.62,   180.97
      44,   458752,  2504.87,   178.85
      48,   458752,  2529.43,   177.11
      52,   458752,  2550.20,   175.67
      56,   458752,  2568.16,   174.44
      60,   458752,  2583.81,   173.39
      64,   458752,  2597.54,   172.47
      68,   458752,  2609.51,   171.68
      72,   458752,  2620.09,   170.99
      76,   458752,  2629.85,   170.35
      80,   458752,  2638.49,   169.79
      84,   458752,  2646.42,   169.29
      88,   458752,  2653.44,   168.84
      92,   458752,  2659.81,   168.43
      96,   458752,  2665.84,   168.05
     100,   458752,  2671.18,   167.72
     104,   458752,  2676.26,   167.40
     108,   458752,  2680.84,   167.11
     112,   458752,  2685.08,   166.85
     116,   458752,  2689.13,   166.60
     120,   458752,  2692.79,   166.37
     124,   458752,  2696.12,   166.16
     128,   458752,  2699.45,   165.96

I have a GTX Titan X connected to the display using WDDM and here is the output for the larger test in the same workstation

GeForce GTX TITAN X : 24 SM : 12288 MB
Probing from: 256 - 3072 MB ...
alloc MB, probe MB,    msecs,     GB/s
     256,    14336,    62.56,   223.77
     512,    14336,    82.18,   170.35
     768,    14336,    91.44,   153.10
    1024,    14336,    96.08,   145.71
    1280,    14336,    98.88,   141.58
    1536,    14336,   100.59,   139.18
    1792,    14336,   101.92,   137.36
    2048,    14336,   102.97,   135.97
    2304,    14336,   453.11,    30.90
    2560,    14336,   829.54,    16.88
    2816,    14336,  1144.72,    12.23
    3072,    14336,  1394.51,    10.04

then same test with the same parameter on the GTX 1080 which is not connected to the display;

GeForce GTX 1080 : 20 SM : 8192 MB
Probing from: 256 - 3072 MB ...
alloc MB, probe MB,    msecs,     GB/s
     256,    14336,    86.04,   162.72
     512,    14336,    88.60,   158.01
     768,    14336,    92.89,   150.72
    1024,    14336,    95.05,   147.29
    1280,    14336,  1349.74,    10.37
    1536,    14336,  2346.90,     5.97
    1792,    14336,  3094.10,     4.52
    2048,    14336,  3670.97,     3.81
    2304,    14336,  4133.45,     3.39
    2560,    14336,  4512.10,     3.10
    2816,    14336,  4826.02,     2.90
    3072,    14336,  5089.95,     2.75

And then my other Titan X in TCC mode from an earlier measurement;

GeForce GTX TITAN X : 24 SM : 12287 MB
Probing from: 256 - 5120 MB ...
alloc MB, probe MB,    msecs,     GB/s
     256,    14336,    65.84,   212.63
     512,    14336,    73.56,   190.33
     768,    14336,    81.83,   171.10
    1024,    14336,    85.98,   162.83
    1280,    14336,    88.48,   158.24
    1536,    14336,    90.14,   155.32
    1792,    14336,    91.33,   153.29
    2048,    14336,    92.24,   151.78
    2304,    14336,   158.51,    88.32
    2560,    14336,   255.79,    54.73
    2816,    14336,   343.30,    40.78
    3072,    14336,   418.63,    33.44

So while the TCC driver helps even without that the Titan X using WDDM and connected to the display is vastly better than the GTX 1080. Not even close.

I mean please, 10 GBs for an allocation above 1024 MB for the ‘faster’ GTX 1080, when the Titan X still is around 160 GBs? This is no trivial difference.

I feel like a fool for dropping $1,500 on two of these ‘Founder Edition’ GPUs.

I share your annoyance but I also feel like there must be something we’re still missing. How can this card be performing so well in gaming benchmarks? I know it is supposed to have improved delta colour compression but still.

Per Allan’s suggestion tried changed the type for the random read from 32 bit int to uint2 and uint4;

uint2;

GeForce GTX 1080 : 20 SM : 8192 MB
Probing from: 256 - 3072 MB ...
alloc MB, probe MB,    msecs,     GB/s
     256,    14336,    60.40,   231.77
     512,    14336,    60.68,   230.73
     768,    14336,    60.90,   229.90
    1024,    14336,    60.93,   229.78
    1280,    14336,   674.32,    20.76
    1536,    14336,  1173.48,    11.93
    1792,    14336,  1547.71,     9.05
    2048,    14336,  1836.11,     7.62
    2304,    14336,  2067.65,     6.77
    2560,    14336,  2256.71,     6.20
    2816,    14336,  2413.75,     5.80
    3072,    14336,  2545.81,     5.50

uint4;

GeForce GTX 1080 : 20 SM : 8192 MB
Probing from: 256 - 3072 MB ...
alloc MB, probe MB,    msecs,     GB/s
     256,    14336,    59.43,   235.56
     512,    14336,    59.74,   234.35
     768,    14336,    59.91,   233.70
    1024,    14336,    59.96,   233.50
    1280,    14336,   336.78,    41.57
    1536,    14336,   585.28,    23.92
    1792,    14336,   771.99,    18.14
    2048,    14336,   916.03,    15.28
    2304,    14336,  1032.12,    13.56
    2560,    14336,  1126.32,    12.43
    2816,    14336,  1204.77,    11.62
    3072,    14336,  1270.85,    11.02

So Allan was correct in that this improved performance a bit, but still that huge drop-off after 1024 MB which does not occur for Maxwell GPUs until after 2048 MB allocations.

Any “memory bound” application I have runs much slower on the GTX 1080 when compared to the Titan X. Purely compute bound applications are 20-25% faster on the GTX 1080, but since I work in medical imaging this matter much less than memory bandwidth.

Quick, where are the conspiracy theorists telling us about evil ploys …

I looked at the JEDEC GDDR5X specification, and it does mention QDR as well as DDR operation, so seeing GPU-Z report mem clock as 1/4 of the effective data rate of 10 GHz seems plausible.

I have not seen a detailed analysis of GDDR5X as used with GPUs anywhere yet, and my own in-depth experience with DRAM stopped with single data rate SDRAM. So not sure what we may be missing.

I wonder whether the VBIOS on these cards may be using some conservative fail-safe settings rather than performance settings. Is a newer VBIOS available from the vendor site?

As for feeling like a fool, is simply returning (or worst case, RMAing) the cards not an option?

In an older thread Genoil remarked P100 has a changed TLB page size and perhaps logic.

In March, txbob said “With modifications to TLB structure (e.g. page sizes, TLB cache size/structure) in future GPUs, it should be possible theoretically to ameliorate this behavior. I can’t make any forward looking statements about future GPUs or plans at this time…” Perhaps we’ll learn more about these possible ameliorations now that P100 and GP104 have shipped.

Except here the observed behavior seems to point to “apeioration” rather than amelioration (I knew knowledge of Latin would come in handy some day :-)

The 160 GB/sec in your first test is suspiciously half of 320 GB/sec.

I found the following bandwidth test data reported for a GTX 1080 under Linux, which looks more like what one would expect (https://www.pugetsystems.com/labs/hpc/GTX-1080-CUDA-performance-on-Linux-Ubuntu-16-04-preliminary-results-nbody-and-NAMD-803/)

Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			278829.1

You may want to boot your system into Linux to see whether the results for your system show material differences when operating under Linux vs Windows.

278.8 MB/s is much better!