Nvidia Pascal TITAN Xp, TITAN X, GeForce GTX 1080 Ti, GTX 1080, GTX 1070, GTX 1060, GTX 1050 & GT 1030

Yes, cc6.1 supports FP16 operations (add, multiply) natively, it just isn’t a very fast path. Whenever the proper documentation for cc6.1 comes out (alas, it is not in the CUDA 8 RC docs, I was hoping it would be, so now waiting for final CUDA 8 docs) this will be evident.

What you’ll find is that FP16 is natively supported but is not a fast path. This means that the throughput will not be the 2x higher throughput (as compared to FP32 throughput) such as what you will see reported for cc6.0 (e.g. Tesla P100). mfatica chose words carefully:

Note this is different than saying:

“There is no fp16 in GP104.”

which would not be a correct statement, as you’ve now discovered.

But then we have effectively more memory correct? We at least get that

Yes (although I wouldn’t say you have more memory, I would say you have the possibility for more parameter storage), I was not trying to suggest there is no value, just trying to clear up what I thought might be some confusion. I was really responding to robik’s posting, which now unfortunately is on a previous forum page.

Single rate fp16! I’ll take it.

Any throughput numbers? 128 ops/clock? I’m going to bet it’s… 16 ops/clock.

Well played, @mfactica.

@txbob No problem, I figured that.

@allanmac ditto, glass half full here, it’s pascal for me

An Anandtech’er believes there is just one FP16x2 unit on the GP104 SMP.

Ok, I think I can confirm that Anandtech’ers post:

asm("fma.rn.f16x2 %0, %1, %2, %3;" : "=r"(d) : "r"(a), "r"(b), "r"(c));

--:1:2:-:2      HFMA2 R0, R0, R2, R3;
01:-:-:-:1      MOV R2, param_0[0];
--:-:-:Y:2      MOV R3, param_0[1];
02:1:-:-:1      STG.E [R2], R0;

The assembler is putting barrier flags on the HFMA2 which usually means the instruction isn’t implemented on a cuda core, but on some shared resource like the SFUs. Then depending on the arch you’ll have more or less of this shared resource and that controls the throughput.

Now I’m worried about the int8 performance… I’ll look at that next. But that’s a bit trickier since nvdisasm is currently seg faulting on any code that contains dp4a or dp2a (probably just missing string table entries).

Ok… that’s a relief. Here’s the asm from dp4a:

asm("dp4a.u32.u32 %0, %1, %2, %3;" : "=r"(d) : "r"(a), "r"(b), "r"(c));

# 0x001fc400fe2007f6
# 0x4c98078000870001 --:-:-:-:6      MOV R1, c[0x0][0x20];
# 0x4c98078005270000 --:-:-:-:1      MOV R0, c[0x0][0x148];
# 0x4c98078005470005 --:-:-:-:1      MOV R5, c[0x0][0x150];
# 0x001fc800fe8007f1
# 0x4c98078005070002 --:-:-:-:1      MOV R2, c[0x0][0x140];
# 0x4c98078005170003 --:-:-:-:4      MOV R3, c[0x0][0x144];
# 0x53d8028005370000 --:-:-:-:2      dp4a.u32.u32 R0, R0, c[0x0][0x14c], R5;
# 0x001ffc00ffe000f1
# 0xeedc200000070200 --:1:-:-:1      STG.E [R2], R0;
# 0xe30000000007000f --:-:-:-:f      EXIT;

The 2 clocks on dp4a is just to satisfy the STG dependency.

Also note that dp4a is loading an operand directly from a constant. Only cuda core instructions can do that. So the only question is if this is a full throughput instruction, or only half throughput like VMAD.S8.S8 currently is. But either way this could really speed things up.

Ok, just compiled 4 dp4a’s in a row with no dependencies and the stall counts are all being set to 1. This means it’s likley a true full throughput instruction. This means the 1080 has 8228-8873 * 4 = 33-36 Tops of int8. Or I think nvidia likes to call these DLops (deep learning ops)

It’s interesting that dp4a/dp2a are marked as sm_61+ and not in sm_60 (GP100).

I talked to nvidia about this at GTC. It seems sm_60 was designed first and these instructions didn’t quite make it into the tape-out.

https://devtalk.nvidia.com/default/topic/938369/cuda-programming-and-performance/cuda-8-errors-when-using-two-1080-gpus-in-multithreading-way/post/4889786/#4889786

Finally a CUDA 8.0 devicequery that properly supports Pascal.

anyone know if FP16 Atomics might also be supported?

I read FP64 Atomics would be supported but it’s not yet been updated in the CUDA 8.0 RC docs.

GTX 1070 reviews are online now at your favourite review sites, for those that want a Pascal GP104 card at a cheaper price.

The sm_60+ ATOM.ADD.F64 intrinsic is defined in the sm_60_atomic_functions.hpp file:

__SM_60_ATOMIC_FUNCTIONS_DECL__ double atomicAdd(double *address, double val)
{
  return __dAtomicAdd(address, val);
}

I don’t see any mention of fp16x2 atomics despite their availablity in sm_52 for GLSL:

https://www.opengl.org/registry/specs/NV/shader_atomic_fp16_vector.txt

Icare3D’s blog post too.

I"m more interested in the Cuda performance.
Anyone in possession of the 1080 care to do any cuda benches vs the 980ti/titan-x? :)

It seems they put the general gaming performance of the (1070) slightly ahead of the titanX.
Now, I’m interested in what the new architecture and cuts here and there spell w.r.t to cuda performance.

Im astounded the GTX 1080 has 1/64 FP16 performance.

Can anyone indicate whether this was soley a money making decision to gimp the 1080? Or are there technical reasons to leave it out such as die size, etc. ??

From what I read on other sites it might just be a software switch.

now it looks very likely that performance is almost exactly defined by the alu*frequency and memory bandwidth. i.e. it’s almost the smae maxwell 5.2 with only slight changes inside the SM. thanks to high frequencies, even 1070 is pretty close to titanX in the computation speed, but new cards has lower memory speeds than 980ti/titanx

Given the lack of details being published, I’m starting to believe this too.

Can someone with a 1080 do some performance benchmarks to expose the performance profile of this card?
I’m particularly interested in memory access, transfer latencies and what type of execution increases can be had.

The new line (1070/1080) has : Reduced memory bandwidth, reduced memory bit width, and less cuda cores than the 980ti/titanX.
Also, as I understand it, DDR5x has increased random access latency over DDR5?

In January 2016, JEDEC standardized GDDR5X SGRAM.[2] GDDR5X targets a transfer rate of 10 to 14 Gbit/s, twice that of GDDR5. Essentially, it provides the memory controller the option to use either a double data rate mode that has a prefetch of 8n, or a quad data rate more that that has a prefetch of 16n.[3] GDDR5 only has a double data rate mode that has an 8n prefetch.[4]

It seemingly makes up for that with increased clock speeds and less power usage via transistor gate shrinkage.
Am I missing something? I’m going to be even more speculative of the 1070’s performance…

And why wont Nvidia comment immediately about FP16 on 1070/1080?