So I guess it’s a relief for everyone that sm_61 retains sm_52’s 96 KB of shared memory?
Based on today’s reviews and the deviceQuery screenshot by @NVD, the summary of sm_61 so far is:
- 128 cores per SMP
- 64K registers
- 96KB of shared memory with a 48KB block limit
- maximum of 1024 threads per block
- maximum of 2048 resident threads per SMP
half and half2 operations (*)
It sure seems similar to sm_52.
Hopefully there aren’t any lurking surprises for CUDA devs. :)
Unconfirmed in GP104 Not included in GP104
i don’t see any differences at all. why you think that it has fp16 and especially 2xfp16 operations? they are easily emulated py PTX, like kepler simd commands dropped in maxwell
@BZ, the half/half2 operations are in sm_53 and GP100 but you’re right — I haven’t yet seen any explicit confirmation that fp16/fp16x2 are in GP104.
I’ll strike that line and put an asterisk next to it! :)
Meanwhile, here are two always glorious (estimated) block diagrams of GP104 by Hiroshige Goto (from his article here):
p03 is obviously a GP100 engine. the only thing that i don’t know is its L2 cache size, but 128 KB register memory per sheduler and 64 KB of shared memory per SM is definitely GP100. so he tries to guess how GP104 was made from GP100 by disabling some parts? of course they don’t, GP104 packs ALUs much denser than GP100
OTOH, the difference between SM 5.2 and 6.0 isn’t really so much. SM 5.2 shared the single 96 KB shared memory block among 4 shedulers, in 6.0 there are two 64 KB blocks, each shared by only 2 shedulers. That’s all. So Nvidia increased the shared memory bandwidth, but in another way than in Kepler. There are already plenty of resources shared by only 2 shedulers (L1$, DP ALUs…), so SM 6.0 just decreased sharing a little
Article phrase “Was introduced in Pascal, 2-way SIMD (Single Instruction, Multiple Data) of FP16 specifications, will be taken over even GP104” probably just reflects his misunderstading - there is no universal “Pascal” SM at all
It’s starting to look like a good time to invest in gold bullion. :-(
fp16x2 support is actually pretty difficult to detect. I wont trust anyone else’s opinion on the matter till I have one in hand, or if there’s some official word on it.
So the only question really seems to be now is if GP104 is a die shrink of sm_52 or sm_53. I would guess (or maybe hope) it’s sm_53. At the very least we do have sm_53 right now and can practice writing kernels for when it finally is released widely in a Pascal part.
if cuda8 EAP can generate SM 6.1 SASS, it can be checked even without pascal hardware
Apparently the GTX 1080 will have a texture sampling rate of 277 GTexels/sec. That’s ridiculous! If I can realize that in my application it will be amazing but without a corresponding increase in memory bandwidth it will be a real challenge.
Great idea – I tried sm_60 through sm_65 weeks ago.
All were rejected. :)
We can all try again when CUDA 8.0 RC is available.
No mention of fp16 in that whitepaper. :(
In lieu of fp16x2 I will accept 8-bit and 16-bit normalized integer 32-bit wide SIMD operations (e.g. 8-bit normalized: 255*255 = 255).
Please include add, sub, mul and mad ops.
That will be all. :)
Unfortunately if fp16 costs too many transistors (needing two 11x11-bit multipliers) then int16 is even more costly (needing two 16x16 multipliers). Four 8x8 bit multipliers is likely about the same transistor complexity as fp16.
Ah, but dumping the SASS shows Maxwell already has 16 bit multipliers, right?
An extra add or two and a shift should make them suitable for normalized integer ops. An add here, a shift there, and next thing you know…
Ha, I wasn’t entirely serious about my request for normalized ints… just lamenting the lack of fp16x2 in the GP104. =)
Four days till release and still no info on double-rate FP16? The suspense is killing me (and the US economy).
If NVIDIA’s stock performance for the last 12 months is any indication (up over 100% since this time last year);
Then US economy is doing well enough to avoid NIRP
Yeah, NVIDIA’s riding that deep learning express with no competitors in sight (Except maybe Nervana in 2017, but no one uses Neon even as it being the performance frontrunner, sorry Scott Gray. Functional API when?). $129k for 170TFLOPS what the !%@#&^.
I just got through a bunch of code cleanup/ refactoring/ unit test writing. It should be much easier to wrap my work in an API with the next release. Neon now has a lot more engineers working on it and will start getting much nicer pretty quickly. This will include a full graph backend this summer.
Nvidia is just charging that much because it can… but I think they’re just motivating people to find clever ways to max out the usefulness of the consumer cards. Also, I wouldn’t count AMD out.
Will the GTX 1080 support the FP64 atomic add instruction introduced with the GP100? That sounded pretty useful to avoid loss of precision in the final step of a reduction kernel.
While the 96KB shared memory is attractive, it seems a bit disappointing if the GP104 is indeed more similar to Maxwell than to the GP100. I’m guessing I wasn’t the only one who was already gearing up to start utilizing FP16 GEMMs.