Nvidia Pascal TITAN Xp, TITAN X, GeForce GTX 1080 Ti, GTX 1080, GTX 1070, GTX 1060, GTX 1050 & GT 1030

Skip ahead to 15:25 in this gtc talk and you can hear an nvidia engineer talking about dpXa:

http://on-demand.gputechconf.com/gtc/2016/video/S6136.html

One interesting thing he says is that this instruction will be enabled in their “inference products”, which are coming out soon. It seemed as if someone mentioned the p100 did not have this, so perhaps they’ll do some lower-end tesla spins like they did with the Maxwell teslas.

Thank you very much for posting these benchmarks. This is exactly what I was looking for ! ^_^
I hope to see similar benches of the 1070 soon :-D.

Just to give credit where credit is due: I simply linked to the benchmark results posted by the AMBER team, they did all the work. I would think there is potential upside to the preliminary GTX 1080 results reported for AMBER, as I see multiple graphics cards vendors racing to create vendor-overclocked GTX 1080 variants with custom cooling and power supply solutions. For now, their holy grail seems to be reliable operation at >= 2 GHz for the core.

On the other hand, when I read that some are considering pumping voltage up to 1.25 volts, I wonder how long those GPUs are supposed to last.

Here’s some OpenCL benchmarks on Linux.

Long enough before it becomes obsolete.

I expect product update cycles to slow considerably across the semiconductor industry; Intel has already pretty much stated as much by extending/abandoning their classical tick-tock scheme. So the GTX 1080 might not become obsolete quite as quickly as some may think.

I don’t know what exactly the nominal voltage for TSMC’s 16FF+ process is, but would think it is likely somewhere between 0.8 and 0.9 volt, so pushing voltage to 1.25 V seems like asking for trouble to me. I am all for using optimized power supply and cooling solutions to boost performance, on the other hand, as that should be beneficial to overall reliability as well.

Do the dp4a and dp2a instructions operate on signed or unsigned integers?

I managed to get my hands on a GTX 1080 and I have the same erratic fan speed problem that is being reported by others but apparently a fix is on the way in the next driver release.

I’ve clocked it doing about 190GT/s. That’s some way short of the theoretical 257GT/s but faster than a pair of GTX 670 4GB.

Thank you everyone for posting up reviews. I’ve decided on my next pascal microarchitecture card :-D.

@Gogar, the cuda 8.0 ptx docs are pretty clear on signed support. You can have signed, unsigned or mixed sign.

Thanks Scott! Didn’t expect it to be documented yet. That’s amazing by the way, I’m kind of in awe at the possibilities.

http://www.nvidia.com/download/driverResults.aspx/103913/en-us

http://www.nvidia.com/download/driverResults.aspx/103909/en-us

Nvidia GeForce 368.39 WHQL driver for GTX 1080 & GTX 1070.

I finally got to play with a 1080 today and I thought I’d post my findings.

So the int8/16 dot production instructions (dp4a and dp2a) are indeed full throughput and work just as advertised. The fp16 support, however, has a lot of functionality that isn’t documented:

HFMA2.<MRG_H0|MRG_H1|F32>.<FTZ|FMZ>.<SAT> d, a.<H0_H0|H1_H1|F32>, <->b.<H0_H0|H1_H1>, <->c.<H0_H0|H1_H1|F32>;

<HADD2|HMUL2>.<MRG_H0|MRG_H1|F32>.<FTZ|FMZ>.<SAT> d, a.<H0_H0|H1_H1|F32>, <->b.<H0_H0|H1_H1>;

On sm_61 this of course runs at 1/64 throughput (the instruction, not the math) so all this is of limited use. But it might be good to be aware of it if you get ahold of hardware that isn’t crippled ( sm_60, sm_62? ).

Interestingly, there’s a lot of mixed fp32/fp16 precision support. Though the mode I was most interested in (fp16x2 dot product with fp32 accumulate) isn’t supported. Any time you mix in fp32 registers the instruction can only work on one of the packed fp16 values in other registers (as far as I can tell anyway).

The H0_H0 and H1_H1 flags are there to facilitate a SIMD matrix multiply outer product, but are also used to select the packed values in single throughput mode. The merge flags don’t currently work on sm_61 but it’s clear they’re meant to facilitate BFI like functionality.

Oh, and one other thing… it looks like there’s a performance bug in the compiler. If you’re loading in memory in fp16 and then converting to fp32 for compute, ptxas tries to be clever and use HADD2 instead of F2F to do the conversion. But on sm_61 hardware this is a 16x slower pathway. On sm_60 hardware this makes a lot more sense since it’s a 4x speedup. I’ve already submitted a bug for this.

Thanks Scott, very interesting results. Do you know what nvidia’s intent was with the dp instructions? Are they going to make an sgemm routine that utilizes this, or are they leaving it to people like you?

I think Nvidia’s intent is pretty clear in that the dp instructions were meant for fast deep learning inference (in both cuDNN and cuBLAS). Google just built dedicated 8 bit hardware for the same purpose. It would be interesting to compare the performance of the two. I would expect google’s chip to be lower power but Nvidia likely has a speed edge. Also, the GPU is likely a lot more general purpose and probably is able to implement a wider range of SOTA networks.

But, it’s an interesting challenge to adapt these instructions for training purposes. There are several recent papers on ultra low precision training that indicates that this is possible (and even desirable for added regularization).

Yes, the Google chip looks interesting as well, but I didn’t think they had said what their intent was. The early reports seemed to indicate that it would only be available for Google’s cloud offering, and you would not be able to buy them outright. It did seem disingenuous when Google claimed a 10x speedup over GPUs when they likely didn’t use pascal as a reference, and certainly not the 8 bit instructions. I don’t believe they announced the process size either, but for the volumes they’re making it seems hard to justify the cost of 14nm, but they may not care at all about cost.

Thanks for the deviceQuery output.
It says that the watchdog kernel timeout killer is enabled. This implies that GP104’s sm_61 does not support compute premption like sm_60’s P100. The P100 whitepaper does promise this for P100 (“both interactive graphics tasks and interactive debuggers can run in concert with long-running compute tasks.”) The GTX 1080 whitepaper discusses preemption in much more detail but does not specifically give the same promise as that lone clear P100 sentence.

It could just be a driver issue, with the hardware supporting the feature but simply not implemented yet.

https://devtalk.nvidia.com/default/topic/938369/cuda-programming-and-performance/cuda-8-errors-when-using-two-1080-gpus-in-multithreading-way/post/4889786/#4889786

Here’s the devicequery using CUDA 8.0 RC’s devicequery. Devicequery from CUDA 6.5 doesn’t support it properly and should be ignored.

Ah, that report does NOT list the watchdog killer enabled. That’s promising, though it may just be that the display at the time was using an IGP and not the GTX 1080.

I’m still wondering if there is a lone fp16x2 unit or if the operations are being emulated/microcoded with ~10-11 fp32 conversion and math ops?

If HFMA2(a,b,c) and HMUL2(a,b) run at the same rate then it’s probably not being emulated.

Hopefully the final CUDA 8.0 docs include an updated throughput table and a Pascal “Tuning Guide”.