Nvidia Pascal TITAN Xp, TITAN X, GeForce GTX 1080 Ti, GTX 1080, GTX 1070, GTX 1060, GTX 1050 & GT 1030

Robert_Crovella · May 28, 2016, 1:05pm

Yes, cc6.1 supports FP16 operations (add, multiply) natively, it just isn’t a very fast path. Whenever the proper documentation for cc6.1 comes out (alas, it is not in the CUDA 8 RC docs, I was hoping it would be, so now waiting for final CUDA 8 docs) this will be evident.

What you’ll find is that FP16 is natively supported but is not a fast path. This means that the throughput will not be the 2x higher throughput (as compared to FP32 throughput) such as what you will see reported for cc6.0 (e.g. Tesla P100). mfatica chose words carefully:

Note this is different than saying:

“There is no fp16 in GP104.”

which would not be a correct statement, as you’ve now discovered.

vvilly · May 28, 2016, 1:24pm

But then we have effectively more memory correct? We at least get that

Robert_Crovella · May 28, 2016, 1:26pm

Yes (although I wouldn’t say you have more memory, I would say you have the possibility for more parameter storage), I was not trying to suggest there is no value, just trying to clear up what I thought might be some confusion. I was really responding to robik’s posting, which now unfortunately is on a previous forum page.

allanmac · May 28, 2016, 3:10pm

Single rate fp16! I’ll take it.

Any throughput numbers? 128 ops/clock? I’m going to bet it’s… 16 ops/clock.

Well played, @mfactica.

vvilly · May 28, 2016, 4:48pm

@txbob No problem, I figured that.

@allanmac ditto, glass half full here, it’s pascal for me

allanmac · May 28, 2016, 7:42pm

An Anandtech’er believes there is just one FP16x2 unit on the GP104 SMP.

scottgray · May 28, 2016, 8:06pm

Ok, I think I can confirm that Anandtech’ers post:

asm("fma.rn.f16x2 %0, %1, %2, %3;" : "=r"(d) : "r"(a), "r"(b), "r"(c));

--:1:2:-:2      HFMA2 R0, R0, R2, R3;
01:-:-:-:1      MOV R2, param_0[0];
--:-:-:Y:2      MOV R3, param_0[1];
02:1:-:-:1      STG.E [R2], R0;

The assembler is putting barrier flags on the HFMA2 which usually means the instruction isn’t implemented on a cuda core, but on some shared resource like the SFUs. Then depending on the arch you’ll have more or less of this shared resource and that controls the throughput.

Now I’m worried about the int8 performance… I’ll look at that next. But that’s a bit trickier since nvdisasm is currently seg faulting on any code that contains dp4a or dp2a (probably just missing string table entries).

scottgray · May 28, 2016, 8:24pm

Ok… that’s a relief. Here’s the asm from dp4a:

asm("dp4a.u32.u32 %0, %1, %2, %3;" : "=r"(d) : "r"(a), "r"(b), "r"(c));

# 0x001fc400fe2007f6
# 0x4c98078000870001 --:-:-:-:6      MOV R1, c[0x0][0x20];
# 0x4c98078005270000 --:-:-:-:1      MOV R0, c[0x0][0x148];
# 0x4c98078005470005 --:-:-:-:1      MOV R5, c[0x0][0x150];
# 0x001fc800fe8007f1
# 0x4c98078005070002 --:-:-:-:1      MOV R2, c[0x0][0x140];
# 0x4c98078005170003 --:-:-:-:4      MOV R3, c[0x0][0x144];
# 0x53d8028005370000 --:-:-:-:2      dp4a.u32.u32 R0, R0, c[0x0][0x14c], R5;
# 0x001ffc00ffe000f1
# 0xeedc200000070200 --:1:-:-:1      STG.E [R2], R0;
# 0xe30000000007000f --:-:-:-:f      EXIT;

The 2 clocks on dp4a is just to satisfy the STG dependency.

scottgray · May 28, 2016, 8:45pm

Also note that dp4a is loading an operand directly from a constant. Only cuda core instructions can do that. So the only question is if this is a full throughput instruction, or only half throughput like VMAD.S8.S8 currently is. But either way this could really speed things up.

scottgray · May 28, 2016, 9:05pm

Ok, just compiled 4 dp4a’s in a row with no dependencies and the stall counts are all being set to 1. This means it’s likley a true full throughput instruction. This means the 1080 has 8228-8873 * 4 = 33-36 Tops of int8. Or I think nvidia likes to call these DLops (deep learning ops)

allanmac · May 28, 2016, 9:10pm

It’s interesting that dp4a/dp2a are marked as sm_61+ and not in sm_60 (GP100).

scottgray · May 28, 2016, 9:30pm

I talked to nvidia about this at GTC. It seems sm_60 was designed first and these instructions didn’t quite make it into the tape-out.

NVD · May 29, 2016, 5:57am

https://devtalk.nvidia.com/default/topic/938369/cuda-programming-and-performance/cuda-8-errors-when-using-two-1080-gpus-in-multithreading-way/post/4889786/#4889786

Finally a CUDA 8.0 devicequery that properly supports Pascal.

Jimmy_Pettersson · May 29, 2016, 10:15pm

anyone know if FP16 Atomics might also be supported?

I read FP64 Atomics would be supported but it’s not yet been updated in the CUDA 8.0 RC docs.

NVD · May 29, 2016, 10:47pm

GTX 1070 reviews are online now at your favourite review sites, for those that want a Pascal GP104 card at a cheaper price.

allanmac · May 29, 2016, 11:41pm

The sm_60+ ATOM.ADD.F64 intrinsic is defined in the sm_60_atomic_functions.hpp file:

__SM_60_ATOMIC_FUNCTIONS_DECL__ double atomicAdd(double *address, double val)
{
  return __dAtomicAdd(address, val);
}

I don’t see any mention of fp16x2 atomics despite their availablity in sm_52 for GLSL:

https://www.opengl.org/registry/specs/NV/shader_atomic_fp16_vector.txt

Icare3D’s blog post too.

ldaddr · May 29, 2016, 11:46pm

I"m more interested in the Cuda performance.
Anyone in possession of the 1080 care to do any cuda benches vs the 980ti/titan-x? :)

It seems they put the general gaming performance of the (1070) slightly ahead of the titanX.
Now, I’m interested in what the new architecture and cuts here and there spell w.r.t to cuda performance.

LukeCuda · May 29, 2016, 11:54pm

Im astounded the GTX 1080 has 1/64 FP16 performance.

Can anyone indicate whether this was soley a money making decision to gimp the 1080? Or are there technical reasons to leave it out such as die size, etc. ??

From what I read on other sites it might just be a software switch.

BulatZiganshin · May 30, 2016, 1:42am

now it looks very likely that performance is almost exactly defined by the alu*frequency and memory bandwidth. i.e. it’s almost the smae maxwell 5.2 with only slight changes inside the SM. thanks to high frequencies, even 1070 is pretty close to titanX in the computation speed, but new cards has lower memory speeds than 980ti/titanx

ldaddr · May 31, 2016, 5:40pm

Given the lack of details being published, I’m starting to believe this too.

Can someone with a 1080 do some performance benchmarks to expose the performance profile of this card?
I’m particularly interested in memory access, transfer latencies and what type of execution increases can be had.

The new line (1070/1080) has : Reduced memory bandwidth, reduced memory bit width, and less cuda cores than the 980ti/titanX.
Also, as I understand it, DDR5x has increased random access latency over DDR5?

In January 2016, JEDEC standardized GDDR5X SGRAM.[2] GDDR5X targets a transfer rate of 10 to 14 Gbit/s, twice that of GDDR5. Essentially, it provides the memory controller the option to use either a double data rate mode that has a prefetch of 8n, or a quad data rate more that that has a prefetch of 16n.[3] GDDR5 only has a double data rate mode that has an 8n prefetch.[4]

It seemingly makes up for that with increased clock speeds and less power usage via transistor gate shrinkage.
Am I missing something? I’m going to be even more speculative of the 1070’s performance…

And why wont Nvidia comment immediately about FP16 on 1070/1080?

Topic		Replies	Views
GTX 580 is not as good as GTX480 for CUDA ? CUDA Programming and Performance	23	4024	November 7, 2010
Fermi architecture details where can I find them? CUDA Programming and Performance	16	4113	April 8, 2012
GF100 vs GF104 Performance question CUDA Programming and Performance	18	9044	September 4, 2010
GTX 480 / 470 Double Precision Reduced? CUDA Programming and Performance	178	266598	October 9, 2010
Inside Pascal: NVIDIA's Newest Computing Platform Technical Blog	51	1038	December 8, 2017
More details on new Tesla w/ Fermi GPU posted CUDA Programming and Performance	37	11619	December 12, 2009
GTX 460 CUDA Programming and Performance	58	60409	August 5, 2010
Is nvidia forcing SP compute customers into expensive cards? Why is SP Cuda so slow on gtx680? Somet CUDA Programming and Performance	49	13498	May 20, 2012
TITAN X CUDA Programming and Performance	35	10560	March 23, 2015
How is 1/8 DP performance in GF-100 done? CUDA Programming and Performance	33	11270	November 7, 2010

Nvidia Pascal TITAN Xp, TITAN X, GeForce GTX 1080 Ti, GTX 1080, GTX 1070, GTX 1060, GTX 1050 & GT 1030

Related topics