i don’t see any differences at all. why you think that it has fp16 and especially 2xfp16 operations? they are easily emulated py PTX, like kepler simd commands dropped in maxwell
p03 is obviously a GP100 engine. the only thing that i don’t know is its L2 cache size, but 128 KB register memory per sheduler and 64 KB of shared memory per SM is definitely GP100. so he tries to guess how GP104 was made from GP100 by disabling some parts? of course they don’t, GP104 packs ALUs much denser than GP100
OTOH, the difference between SM 5.2 and 6.0 isn’t really so much. SM 5.2 shared the single 96 KB shared memory block among 4 shedulers, in 6.0 there are two 64 KB blocks, each shared by only 2 shedulers. That’s all. So Nvidia increased the shared memory bandwidth, but in another way than in Kepler. There are already plenty of resources shared by only 2 shedulers (L1$, DP ALUs…), so SM 6.0 just decreased sharing a little
Article phrase “Was introduced in Pascal, 2-way SIMD (Single Instruction, Multiple Data) of FP16 specifications, will be taken over even GP104” probably just reflects his misunderstading - there is no universal “Pascal” SM at all
fp16x2 support is actually pretty difficult to detect. I wont trust anyone else’s opinion on the matter till I have one in hand, or if there’s some official word on it.
So the only question really seems to be now is if GP104 is a die shrink of sm_52 or sm_53. I would guess (or maybe hope) it’s sm_53. At the very least we do have sm_53 right now and can practice writing kernels for when it finally is released widely in a Pascal part.
Apparently the GTX 1080 will have a texture sampling rate of 277 GTexels/sec. That’s ridiculous! If I can realize that in my application it will be amazing but without a corresponding increase in memory bandwidth it will be a real challenge.
Unfortunately if fp16 costs too many transistors (needing two 11x11-bit multipliers) then int16 is even more costly (needing two 16x16 multipliers). Four 8x8 bit multipliers is likely about the same transistor complexity as fp16.
Yeah, NVIDIA’s riding that deep learning express with no competitors in sight (Except maybe Nervana in 2017, but no one uses Neon even as it being the performance frontrunner, sorry Scott Gray. Functional API when?). $129k for 170TFLOPS what the !%@#&^.
I just got through a bunch of code cleanup/ refactoring/ unit test writing. It should be much easier to wrap my work in an API with the next release. Neon now has a lot more engineers working on it and will start getting much nicer pretty quickly. This will include a full graph backend this summer.
Nvidia is just charging that much because it can… but I think they’re just motivating people to find clever ways to max out the usefulness of the consumer cards. Also, I wouldn’t count AMD out.
Will the GTX 1080 support the FP64 atomic add instruction introduced with the GP100? That sounded pretty useful to avoid loss of precision in the final step of a reduction kernel.
While the 96KB shared memory is attractive, it seems a bit disappointing if the GP104 is indeed more similar to Maxwell than to the GP100. I’m guessing I wasn’t the only one who was already gearing up to start utilizing FP16 GEMMs.