Nvidia Pascal TITAN Xp, TITAN X, GeForce GTX 1080 Ti, GTX 1080, GTX 1070, GTX 1060, GTX 1050 & GT 1030

allanmac · May 17, 2016, 9:37pm

So I guess it’s a relief for everyone that sm_61 retains sm_52’s 96 KB of shared memory?

Based on today’s reviews and the deviceQuery screenshot by @NVD, the summary of sm_61 so far is:

128 cores per SMP
64K registers
96KB of shared memory with a 48KB block limit
maximum of 1024 threads per block
maximum of 2048 resident threads per SMP
~~half and half2 operations~~ (*)

It sure seems similar to sm_52.

Hopefully there aren’t any lurking surprises for CUDA devs. :)

(*) ~~Unconfirmed in GP104~~ Not included in GP104

BulatZiganshin · May 17, 2016, 10:22pm

i don’t see any differences at all. why you think that it has fp16 and especially 2xfp16 operations? they are easily emulated py PTX, like kepler simd commands dropped in maxwell

allanmac · May 17, 2016, 10:33pm

@BZ, the half/half2 operations are in sm_53 and GP100 but you’re right — I haven’t yet seen any explicit confirmation that fp16/fp16x2 are in GP104.

I’ll strike that line and put an asterisk next to it! :)

Meanwhile, here are two always glorious (estimated) block diagrams of GP104 by Hiroshige Goto (from his article here):

(PDF)

BulatZiganshin · May 17, 2016, 11:01pm

p03 is obviously a GP100 engine. the only thing that i don’t know is its L2 cache size, but 128 KB register memory per sheduler and 64 KB of shared memory per SM is definitely GP100. so he tries to guess how GP104 was made from GP100 by disabling some parts? of course they don’t, GP104 packs ALUs much denser than GP100

OTOH, the difference between SM 5.2 and 6.0 isn’t really so much. SM 5.2 shared the single 96 KB shared memory block among 4 shedulers, in 6.0 there are two 64 KB blocks, each shared by only 2 shedulers. That’s all. So Nvidia increased the shared memory bandwidth, but in another way than in Kepler. There are already plenty of resources shared by only 2 shedulers (L1$, DP ALUs…), so SM 6.0 just decreased sharing a little

Article phrase “Was introduced in Pascal, 2-way SIMD (Single Instruction, Multiple Data) of FP16 specifications, will be taken over even GP104” probably just reflects his misunderstading - there is no universal “Pascal” SM at all

SPWorley · May 18, 2016, 12:46am

It’s starting to look like a good time to invest in gold bullion. :-(

scottgray · May 18, 2016, 1:01am

fp16x2 support is actually pretty difficult to detect. I wont trust anyone else’s opinion on the matter till I have one in hand, or if there’s some official word on it.

So the only question really seems to be now is if GP104 is a die shrink of sm_52 or sm_53. I would guess (or maybe hope) it’s sm_53. At the very least we do have sm_53 right now and can practice writing kernels for when it finally is released widely in a Pascal part.

BulatZiganshin · May 18, 2016, 11:20am

if cuda8 EAP can generate SM 6.1 SASS, it can be checked even without pascal hardware

NVD · May 18, 2016, 3:53pm

http://www.geforce.com/hardware/10series/geforce-gtx-1070

GTX 1070 specs released.

sam_hawker · May 18, 2016, 4:01pm

Apparently the GTX 1080 will have a texture sampling rate of 277 GTexels/sec. That’s ridiculous! If I can realize that in my application it will be amazing but without a corresponding increase in memory bandwidth it will be a real challenge.

allanmac · May 18, 2016, 4:19pm

Great idea – I tried sm_60 through sm_65 weeks ago.

All were rejected. :)

We can all try again when CUDA 8.0 RC is available.

BulatZiganshin · May 18, 2016, 9:42pm

allanmac · May 18, 2016, 11:08pm

No mention of fp16 in that whitepaper. :(

allanmac · May 19, 2016, 6:06pm

In lieu of fp16x2 I will accept 8-bit and 16-bit normalized integer 32-bit wide SIMD operations (e.g. 8-bit normalized: 255*255 = 255).

Please include add, sub, mul and mad ops.

That will be all. :)

SPWorley · May 21, 2016, 7:21am

Unfortunately if fp16 costs too many transistors (needing two 11x11-bit multipliers) then int16 is even more costly (needing two 16x16 multipliers). Four 8x8 bit multipliers is likely about the same transistor complexity as fp16.

allanmac · May 21, 2016, 1:39pm

Ah, but dumping the SASS shows Maxwell already has 16 bit multipliers, right?

An extra add or two and a shift should make them suitable for normalized integer ops. An add here, a shift there, and next thing you know…

Ha, I wasn’t entirely serious about my request for normalized ints… just lamenting the lack of fp16x2 in the GP104. =)

Clochette · May 22, 2016, 4:51pm

Four days till release and still no info on double-rate FP16? The suspense is killing me (and the US economy).

CudaaduC · May 22, 2016, 6:00pm

If NVIDIA’s stock performance for the last 12 months is any indication (up over 100% since this time last year);

http://bigcharts.marketwatch.com/quickchart/quickchart.asp?symb=nvda&insttype=&freq=&show=

Then US economy is doing well enough to avoid NIRP

Clochette · May 22, 2016, 6:52pm

Yeah, NVIDIA’s riding that deep learning express with no competitors in sight (Except maybe Nervana in 2017, but no one uses Neon even as it being the performance frontrunner, sorry Scott Gray. Functional API when?). $129k for 170TFLOPS what the !%@#&^.

scottgray · May 22, 2016, 8:01pm

I just got through a bunch of code cleanup/ refactoring/ unit test writing. It should be much easier to wrap my work in an API with the next release. Neon now has a lot more engineers working on it and will start getting much nicer pretty quickly. This will include a full graph backend this summer.

Nvidia is just charging that much because it can… but I think they’re just motivating people to find clever ways to max out the usefulness of the consumer cards. Also, I wouldn’t count AMD out.

Gogar · May 25, 2016, 7:42pm

Will the GTX 1080 support the FP64 atomic add instruction introduced with the GP100? That sounded pretty useful to avoid loss of precision in the final step of a reduction kernel.

While the 96KB shared memory is attractive, it seems a bit disappointing if the GP104 is indeed more similar to Maxwell than to the GP100. I’m guessing I wasn’t the only one who was already gearing up to start utilizing FP16 GEMMs.

Topic		Replies	Views
GTX 580 is not as good as GTX480 for CUDA ? CUDA Programming and Performance	23	3929	November 7, 2010
Inside Pascal: NVIDIA's Newest Computing Platform Technical Blog	51	806	December 8, 2017
Fermi architecture details where can I find them? CUDA Programming and Performance	16	4046	April 8, 2012
GF100 vs GF104 Performance question CUDA Programming and Performance	18	8940	September 4, 2010
TITAN X CUDA Programming and Performance	35	10446	March 23, 2015
GTX 480 / 470 Double Precision Reduced? CUDA Programming and Performance	178	266089	October 9, 2010
Is nvidia forcing SP compute customers into expensive cards? Why is SP Cuda so slow on gtx680? Somet CUDA Programming and Performance	49	13304	May 20, 2012
CUDA 8 Features Revealed Technical Blog	51	959	November 8, 2018
GTX 1070 CUDA/Mem performance thread CUDA Programming and Performance	5	15162	August 8, 2016
GTX 460 CUDA Programming and Performance	58	60257	August 5, 2010

Nvidia Pascal TITAN Xp, TITAN X, GeForce GTX 1080 Ti, GTX 1080, GTX 1070, GTX 1060, GTX 1050 & GT 1030

Related topics