Nvidia Pascal TITAN Xp, TITAN X, GeForce GTX 1080 Ti, GTX 1080, GTX 1070, GTX 1060, GTX 1050 & GT 1030

Where is this claim published? Did a cursory search but didn’t find anything on it.

I am not sure why some people are surprised about the GTX 1070 / GTX 1080 specifications. For the market targeted by these cards (high-end consumer graphics; i.e. predominantly gaming) both double precision and FP16 are largely irrelevant. However, some (fairly minimal) level of hardware support for both of these has to be implemented in order to provide clean architectural abstraction.

Obviously leaving these feature out reduces die size, increases yield, and reduces manufacturing cost. In as far as TSMC’s supply of wafers on their cutting edge process is limited, it can also mean higher unit numbers available to NVIDIA. It possibly also has a minor positive effect on power consumption and operating frequencies. While features that aren’t used are normally clocked off (and powered down), a smaller die means on average shorter wiring, thus less parasitic capacitance, thus reduced power loss and faster signal propagation. Narrower memory interfaces require fewer I/O pins, reducing package cost and possibly the power consumption of the memory interface.

I seem to recall stories going back several years about the European Union considering to impose power limits on various PC components, so longer term the power efficiency of their GPUs may be as important to NVIDIA’s bottom line as it is to the customer’s electricity bill. Clearly it will also help in the embedded market when these GPU designs trickle down to integrated parts with ARM CPUs.

We all started from an assumption… sm_53 on Tegra X1 was a Maxwell SM design with full rate fp16x2, and P100 was announced with fp16x2. We all just assumed fp16x2 was part of Pascal’s default architecture. But once GTX 1080 was launched it seemed missing, and we’ve all been assembling clues about what fp16x2 support exists in the GP104 hardware.

One clue is that NVidia itself never explicitly stated GP104 would have fp16x2, and in fact mfatica says in this thread there’s “no fast fp16 in GP104”. But Scott and Allan here have dug into the SASS output and show that fp16x2 SASS instructions are not emulated, they’re explicit (which doesn’t mean there are a full set of fp16x2 ALUs, but does support the idea that they do exist on the hardware). And finally, answering your question, Ryan Smith from Anandtech explicitly reports that the fp16x2 on GP104 is artificially limited.
Extra clues: multiple sources support the idea that the fp16x2 units are inside the fp32 cores, not external like the fp64 cores, also supporting the “artifically limited” idea.
Another very indirect clue: even after compensating for the L2 cache size increase, GP104 uses about 10% more transistors per core than GM204 or GM200. The extra transistor budget could easily be part of Pascal’s other features or even in helping with frequency optimization, but it is also supportive of the idea of extra logic per fp32 core.

The most likely answer is the simplest: GP104 has no fast fp16x2 onboard and we’re all just spinning our overexcited speculation wheels. On this forum we’re all CUDA geeks so we’re just hopeful because fp16x2 is an exciting new feature and we want to play with it.

previous tegra (cc 5.3) had double fp16 perfroamnce, probably new one (cc 6.2) will be the same. i agree with the rest njuffa said. this is gaming card, and titan too. for computations, there is a tesla and xeon phi. we can use gaming cards for computations only because nvidia engineers are too lazy to disable cuda completely

This has nothing to do with “laziness”. Obviously supporting CUDA across its entire universe of GPUs, instead of just professional GPUs, increases NVIDIA’s cost of providing CUDA. However, the positive trade-off for NVIDIA is that this has allowed CUDA to achieve significant market impact due the excellent affordability of a parallel programming platform starting at around $50 for a CUDA-capable low-end consumer card plus software provided free of charge. This has made it the dominating accelerator technology in the world today. This in turn allows NVIDIA to recoup initially higher costs via their top of the line hardware. So “CUDA everywhere” is an approach that works well for NVIDIA.

Compare this to a competing accelerator technology that requires an initial expenditure of several thousand dollars for hardware and software combined, which has seen low market adoption rate in good part due to the high cost of entry.

I’m holding out for fp11x11x10.

:)

Competing accelerator technology? You mean OpenCL that is supported on basic video cards. I don’t recall them costing thousands of dollars of anything.

I think we all just want transparency and explicit statements. Especially w.r.t to intent.
Supporting cuda across the entire universe of GPUs instead of just ‘professional’ GPUS does not increase the cost of anything. You make a microarchitecture with it baked in and call it a day. It’s a new architectural paradigm that lends itself to multiple functions.

Btw, when my GPU cost just as much as my CPU and more, it better do something else besides increase my FPS.

If there is a sell on the coming HSA age, then you have to start crafting GPUs to be compute accelerators and not just graphics accelerators. Under such thinking, there are pure business/profit reasons to support CUDA from the $50 card all the way up to the $15,000 card and I think this is what Nvidia is doing. It attracts new customers and demand, opens up new markets, and it’s why their stock is up over 100% from last year.

As I recall, OpenCL is supported on a slew of non-professional GPUs. OpenGL too. It’s an open standard of compute and thus is openly supported on all cards that are to be considered having value. The same is now the standard for more generalized compute on the GPU. You can’t go about declaring GPUs as world changing general purpose accelerator chips with one hand and then decide to smite its progress with the other… That’s a good way to allow the competition to gain ground.

So, I consider CUDA (nvidia) / OpenCL a given feature of modern day GPUs.
Quadro (‘professional’ GPU cards) sell for reasons beyond what were discussing here as they always have.
FP64 is not an expectation in GP104.
FP16 is at this junction.

The fact that there is no explicit mention of FP16 and there is questionable performance as to what a 1070/1080 can do w.r.t to a titanX/980ti on CUDA performance beyond consume less power, makes it a questionable ‘purchase’ in some people’s eyes. I hope you can understand that.

You play games in tech, eventually someone else will eat your lunch. There are more than enough features to work towards in future roadmaps than to be artificially gimping the capability of ‘supposedly game changing’ hardware releases.

ECC/CRC error checking/quality/higher uptime/Warranty support/etc have been the clear demarcation between consumer/professional in most of tech. Professionals will continue to buy professional hardware. In the meantime, it wouldn’t be smart to gimp features that will maybe attract future professional customers to your product line.

business users are not all idiots. smart humans dont buy one big expensive box when many little boxes are exponentially better.

even at 20tflops FP16, p100 is of no benefit to real large scale embarassingly parallel problems. we can get more for less $ elsewhere.

hiding fp16 support from the GTX just means we buy AMD this generation.

I was referring to Intel’s accelerator products, i.e. MIC. Doesn’t seem to be getting much traction.

I do not see anything being “hidden”, whatever that means. FP16 is a feature important to certain markets only, and as far as I can see NVIDIA has achieved significant market share in those market segments with appropriately targeted products. It appears to have served them well, judging by the way their stock has been going in recent years.

Every consumer should of course acquire the hardware and software that best suits their needs. That does not mean that the preferred feature list of an individual consumer automatically translates into a market addressable by hardware vendors. Various presentations at GTC this year made it clear that NVIDIA takes the markets requiring FP16 performance extremely seriously, and I would be very surprised if no new products will be introduced for that space going forward.

Kind of… Business users have many other priorities besides raw performance. The MTBF, for one, is something that they’d like to be guaranteed for some number of years. The geforce cards either do not have as long of an MTBF, or they’re not publicised. I’ve heard many talks at GTC alluding to the fact that geforce cards have a much higher failure rate, but I haven’t seen concrete info on that. The p100 also has HBM, so for memory-bound kernels, that could make a large difference.

Does anyone know the behavior of the dpXa instructions as far as truncation? I see no documentation about them. Do they act similar to what allanmac was saying where 255*255=255?

thats the old way of thinking. modern thinkers pool independent risks. this means we dont care about MTBF because we have 100 way redundency built in. buy 100 geforces/radeons and swap them in/out as needed. save thousands.

p100 has 720GBs. it cant compete with the bandwidth of a few geforces in parallel. let alone 10+ of them (one p100 may cost a dozen gpus).

theres literally no reason to buy a p100 unless you are ok wasting money. or, you cant parallelise your problem… ie you need the fast links between gpus.

p100 is for medium scale medium-parallelisable problems. not the high end of highly distributed HPC science.

They result in no truncation in the multiplication, e.g., idp4a, takes the two 32-bit registers, packed with 8-bit numbers a = (a.x, a.y, a.z, a.w) and b = (b.x, b.y, b.z, b.w), and compute the product accumulating it to a 32-bit integer. c = a.xb.x + a.yb.y + a.zbz + a.wbw + c. The dot product itself cannot overflow, with overflow only possible when accumulating into c.

That’s true in some cases, but in others the cost of maintenance outweighs the reduced cost of the cards. I would love to pay amazon to maintain a cluster of these, but look at how much they charge currently for the weaker grid cards. The cost of the card is usually just one factor.

As far as memory bandwidth goes, you’re assuming that there’s no latency requirement. If a single card simply can’t finish your smallest division of work as fast as it’s required, then higher bandwidth memory does make a difference.

Also, more cards obviously means higher power consumption. It does not scale linearly by flops, so just because you’re spreading the same work over two cards doesn’t mean each will use half the power usage of one. In many cases the power consumption will still be closer to TDP so the card doesn’t have to continually clock down the cores while running. When you increase power consumption of another 1000 cards by 50% on a 230W card, you better believe that is budgeted for.

Lastly, there are features tesla has that are critical namely gpudirect and nvlink. When we’re talking about large clusters of cards the data exchange speed becomes imperative, and if you’re stuck on pcie it can be a huge bottleneck. There’s a reason they spent so much r&d money on this chip.

I agree that not all use cases benefit from teslas, but at the same time you’re underestimating the p100 features.

Thanks, I misunderstood the intent of the instruction. I thought it was four separate MULs and not a dot product.

"GP100 has lots of FP16 units (for deep learning training). GP10x does not.

Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.

The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit."

At GTC 2016, Jen-Hsun was talking about GPU Inference Engine(GIE) & Tesla M4.

Guess Nvidia will have a GP106 Tesla P4(?) with much higher performance for deep learning inference than the GM206 Tesla M4 with the changes in Pascal GP10x architecture.

It’s going to be neat to see what unexpected uses people have for dp4a and dp2a.

Should speed up a very low precision version of sgemm it seems. Not sure if that’s enough precision for most applications, but at one of the gtc talks an nvidia engineer mentioned this instruction in the context of deep learning.

The AMBER team has published benchmark results for AMBER 16, including preliminary data for the GTX 1080 and the P100 in the DGX-1 appliance:

[url]http://ambermd.org/gpus/benchmarks.htm#Benchmarks[/url]

On GPUs, AMBER uses a combination of single-precision floating-point arithmetic and 64-bit fixed-point arithmetic.

Overall the GTX 1080 numbers look very good, about ~2x the performance of the gtx 980. These simulation benchmarks are of particular interest.

Thanks for posting!