Grim memory bandwidth GTX 1080

well, on CPUs memaccess became slower when page tables becomes larger than LLC (last level cache). Probably we have the same behaviour on GPUS. 1080 has smaller cache than TitanX, so it. compare with 980 non-Ti if you can

the more important question is how to use those 2 MB pages???

Some of the nvprof profiling reveals that the GPU is not being utilized fully (80% relative to 100% DRAM utilization for the GTX 980). This leads me to believe there is some limiting/throttling happening which is not happening with Maxwell in WDDM.

My current machine is not dual-boot OS but within a day or two I will see if the OS makes a difference. In the past even with WDDM there was not a huge difference but maybe this is a different case.

I did notice that the GTX 1080 thrust::sort() performance seems a bit better than the Titan X, but for sparse LU, discrete knapsack problem, and any other DAG the problem which have a good amount of global memory traffic the Titan X is much faster.

Hopefully some magic driver or update will address this issue.

Strange, cause according to Tech - Tested the gtx 1080 seems to have better compute Performance than Titan X and TV-L1 optical Flow calculation is memory bandwidth bound.

I have some more results. I’ve just run the RabbitCT benchmark on all three cards:

	256		512		1024

GTX 980 Ti 0.4056s 0.421201s 2.2152s
GTX TITAN X 0.4056s 0.421201s 2.3556s
GTX 1080 0.4056s 0.4368s 1.9656s

I think that the 256^3 reconstruction is limited by either PCI-Express bandwidth or possibly single-threaded CPU performance.

The 1024^3 reconstruction is approaching the point where it is limited by texturing throughput but the memory bandwidth utilization is also extremely high and I suspect that latency might also be playing a part here. The time for the GTX 1080 corresponds to about 190GT/s which is around the theoretical texturing throughput of the GTX TITAN X.

The time for the 512^3 reconstruction should be 1/8th of the time for the 1024^3 reconstruction but the cache miss rate is much higher which would appear to be eating up memory bandwidth and/or increasing latency.

So maybe the best way to read this is that at 512^3 it is memory bandwidth limited and the GTX TITAN X wins but at 1024^3 it is compute limited and the GTX 1080 wins?

That’s too strong a conclusion to make from the measurements. A more direct way to estimate if you’re bandwidth or compute limited is to change the memory and GPU clock rates (using nvidia-smi) on the same board running the same test each time. If you’re completely bandwidth limited, the benchmark should be linear in memory clock speed and insensitive to GPU core speed. Completely compute limited would be the converse.

Concurring with SPWorley, such analysis is best based on a 2D shmoo plot using multiple core and multiple memory clock settings, say five to six in each dimension. This traditional approach is impacted somewhat by the fact that for various GPUs, memory throughput is not entirely independent of core clock.

Thanks for the suggestion - I’ll definitely try that. My guess is that in both cases it won’t be completely independent of either clock speed though.

if TV-L1 optical Flow were bandwidth bound wouldnt the AMD be the fastest with much faster HBM memory(512GB/s)? I guess the change in architecture has a big factor regarding amd/nvidia though.

Lol, after nvidia’s last GTX 970 ram gate I was waiting/expecting something like this ! :)

However in this case it seems like a driver issue to me… something similiar seemed to have happened last time and it was fixed ?!

Anyway just for kicks, could you run my bandwidth test too:

http://www.skybuck.org/CUDA/BandwidthTest/version%200.16/Packed/

(I would also be interested in memory latency tests, I wrote another program for that too, but it seems to crash currently… the one above in the link seems to work fine on my system, not updated for newer systems/gpus but might still work !)

@Luke: We observed that the performance of our inhouse TV-L1 cuda implementation seems to scale roughly linear with the memory bandwidth of the GPU, Might be because most operations in the algorithm are pointwise or stencil operations, so low arithmetic intensity.

I will do one better and compare to the laptop tiny GTX 980M using Windows 8.1 with WDDM, and even then mobile 980M destroys the desktop 1080 for random reads with an allocation size >1024 MB:

Windows 8.1 WDDM GTX laptop 980M;

GeForce GTX 980M : 12 SM : 4096 MB
Probing from: 256 - 2560 MB ...
alloc MB, probe MB,    msecs,     GB/s
     256,    14336,   102.03,   137.22
     512,    14336,   102.90,   136.06
     768,    14336,   103.16,   135.71
    1024,    14336,   102.99,   135.93
[b]    1280,    14336,   103.06,   135.84
    1536,    14336,   103.16,   135.72
    1792,    14336,   103.13,   135.75
    2048,    14336,   103.14,   135.74
    2304,    14336,   223.73,    62.57
    2560,    14336,   403.05,    34.74[/b]

The desktop GTX 1080;

GeForce GTX 1080 : 20 SM : 8192 MB
Probing from: 256 - 5120 MB ...
alloc MB, probe MB,    msecs,     GB/s
     256,    14336,    86.03,   162.73
     512,    14336,    88.77,   157.72
     768,    14336,    93.01,   150.52
    1024,    14336,    95.23,   147.02
[b]    1280,    14336,  1351.07,    10.36
    1536,    14336,  2346.30,     5.97
    1792,    14336,  3096.08,     4.52
    2048,    14336,  3678.75,     3.81
    2304,    14336,  4140.27,     3.38
    2560,    14336,  4519.34,     3.10[/b]

The 2-year old mobile GTX 980M in Windows 8.1 is over 35 times faster than the desktop GTX 1080 for random memory reads for an allocation size of 2048 MB, which is a fairly common size for our applications.

Also the L2 cache size for the GTX 980M is not larger than the GTX 1080 so that is not the issue.

Any ideas why the GTX 1080 memory bandwidth is this poor?

maybe you should file a bug

Yes I just did thanks, as this is a big deal for my company.
Bug #1776406

From what I’ve heard, the TLB fix will come with a July win 10 update / CUDA 8 production release. But I don’t know if that is the same issue. On Linux, we see 1070 performs better in random memory access than 1080, at reasonable bandwidths for the 1070 (83% of max bandwidth) and only 60% for 1080. This is using the ethash OpenCL kernel, which shows similar behavior to @allanmac’s random test, that he has actually written to verify my claims :)

TL;DR

I get peak bandwidth of ~275 GB/s on a 1080 whereas ~285 GB/s on a stock Titan X. And I see no performance drop of random reads on the 1080 reported by CudaaduC.

So you may want to double-check before getting too pissed :)

Also, the driver suck really badly, hopefully a new one will be out soon.

Instead of ‘TL;DR’ and you had read the posts you would have seen that I ran through a number of tests and for all the GTX 1080 bandwidth was far worse than the GTX Titan X in the same machine. The random memory reads is the most extreme example.

This looks like an OS related problem, as all the linux folks are not seeing this behavior.

You will be happy to note that for my lovely brute force problems of which you are a big fan the GTX 1080 is about 30-40% faster than the Titan X.

The driver sucks, and apparently not only on Linux. In any case, the hardware does not have “grim memory bandwidth”. ;)

Me not really, but hopefully your application will he happier after a new driver is released.

These Paswell boards are pretty decent, though.

On a related note, unless I’m imagining or my colleagues made some changes that I can’t figure out, now my measurements give 231 GB/s on the 1080. Weird.

That is around the number I am seeing ~237 GBs for the CUDA 8 bandwidth test.
You are probably right that it is a driver issue, and possibly complications with this new type of memory.

Will post if I hear anything back from the bug report.

Anyone tested for partition camping on GDDR5X? Is it an issue these days?

I remember it still being visible on Kepler.

Did I understand correctly that global address hashing was meant to be the solution to partition camping? How does it actually work? Surely, whatever the hash function you can still find an access pattern that ends up probing a single partition? How did NVIDIA decide which access patterns were likely or unlikely to occur in order to design a hash function? Doesn’t a hash just make it more difficult for a CUDA programmer to know whether their access pattern causes partition camping?

Is it reasonable to assume that purely sequential access will still have minimal partition camping even with hashing?