Yes, cc6.1 supports FP16 operations (add, multiply) natively, it just isn’t a very fast path. Whenever the proper documentation for cc6.1 comes out (alas, it is not in the CUDA 8 RC docs, I was hoping it would be, so now waiting for final CUDA 8 docs) this will be evident.
What you’ll find is that FP16 is natively supported but is not a fast path. This means that the throughput will not be the 2x higher throughput (as compared to FP32 throughput) such as what you will see reported for cc6.0 (e.g. Tesla P100). mfatica chose words carefully:
Note this is different than saying:
“There is no fp16 in GP104.”
which would not be a correct statement, as you’ve now discovered.
Yes (although I wouldn’t say you have more memory, I would say you have the possibility for more parameter storage), I was not trying to suggest there is no value, just trying to clear up what I thought might be some confusion. I was really responding to robik’s posting, which now unfortunately is on a previous forum page.
The assembler is putting barrier flags on the HFMA2 which usually means the instruction isn’t implemented on a cuda core, but on some shared resource like the SFUs. Then depending on the arch you’ll have more or less of this shared resource and that controls the throughput.
Now I’m worried about the int8 performance… I’ll look at that next. But that’s a bit trickier since nvdisasm is currently seg faulting on any code that contains dp4a or dp2a (probably just missing string table entries).
Also note that dp4a is loading an operand directly from a constant. Only cuda core instructions can do that. So the only question is if this is a full throughput instruction, or only half throughput like VMAD.S8.S8 currently is. But either way this could really speed things up.
Ok, just compiled 4 dp4a’s in a row with no dependencies and the stall counts are all being set to 1. This means it’s likley a true full throughput instruction. This means the 1080 has 8228-8873 * 4 = 33-36 Tops of int8. Or I think nvidia likes to call these DLops (deep learning ops)
I"m more interested in the Cuda performance.
Anyone in possession of the 1080 care to do any cuda benches vs the 980ti/titan-x? :)
It seems they put the general gaming performance of the (1070) slightly ahead of the titanX.
Now, I’m interested in what the new architecture and cuts here and there spell w.r.t to cuda performance.
Im astounded the GTX 1080 has 1/64 FP16 performance.
Can anyone indicate whether this was soley a money making decision to gimp the 1080? Or are there technical reasons to leave it out such as die size, etc. ??
From what I read on other sites it might just be a software switch.
now it looks very likely that performance is almost exactly defined by the alu*frequency and memory bandwidth. i.e. it’s almost the smae maxwell 5.2 with only slight changes inside the SM. thanks to high frequencies, even 1070 is pretty close to titanX in the computation speed, but new cards has lower memory speeds than 980ti/titanx
Given the lack of details being published, I’m starting to believe this too.
Can someone with a 1080 do some performance benchmarks to expose the performance profile of this card?
I’m particularly interested in memory access, transfer latencies and what type of execution increases can be had.
The new line (1070/1080) has : Reduced memory bandwidth, reduced memory bit width, and less cuda cores than the 980ti/titanX.
Also, as I understand it, DDR5x has increased random access latency over DDR5?
In January 2016, JEDEC standardized GDDR5X SGRAM.[2] GDDR5X targets a transfer rate of 10 to 14 Gbit/s, twice that of GDDR5. Essentially, it provides the memory controller the option to use either a double data rate mode that has a prefetch of 8n, or a quad data rate more that that has a prefetch of 16n.[3] GDDR5 only has a double data rate mode that has an 8n prefetch.[4]
It seemingly makes up for that with increased clock speeds and less power usage via transistor gate shrinkage.
Am I missing something? I’m going to be even more speculative of the 1070’s performance…
And why wont Nvidia comment immediately about FP16 on 1070/1080?