Is nvidia forcing SP compute customers into expensive cards? Why is SP Cuda so slow on gtx680? Somet

It seems to be a gtx680sx … everyone remember the 386sx? No math coprocessor. Remember what a failed strategy that was? Nvidia, quit thinking like Adobe and start thinking like the leader you are becoming. You think you are milking a niche market but you will hobble your primary strategy instead.

On paper, the 680 should run faster than the 580 for most single precision kernels, yet it doesn’t, even when compiled with sm_30. It is well known that nvidia kills double precision performance on their consumer cards in order to create a market for very expensive but niche HPC cards. Have they extended that to single precision as well? The specs just don’t support the fact that it runs 30%-50% slower on most sp cuda kernels.

As a cuda developer for lots of companies, I can say without hesitation that every commercial applications I know will jump to AMD if nvidia discontinues the 580 line and forces them to buy teslas at 5x the cost of a 580 to put in their products. They rarely need double precision nor all that memory and buy a lot of gtx cards. I know, I develop their systems. The benefits of cuda as a language do not justify the unbelievable expense of their HPC cards for most apps.

Will nvidia discontinue the 580 before a serious cuda-capable consumer card is released? nvidia, I hope you are listening. Very bad idea. You will sell more cards and make more money by keeping fast cuda performance in your cheap cards than by playing these stupid games. openCL and DirectCompute give us GPU options. Dumbing down DP was as far as you can stretch it and you should be backing that down, not the opposite. Don’t press your luck.

I am already preparing my clients for this possibility and preparing to port code because the 580 looks redundant. Cuda and nvidia are better, but not THAT much better and I fear that nvidia overvalues itself in this regard. OpenCL and AMD will get a very big boost if my fears are correct, and I will be first in line. I could use some serious reassurance here, nvidia.

Nvidia just released their gaming cards. It remains to be seen what the price of the compute oriented cards will be. Will they lower the price to remain competitive against GCN?

From what I’ve heard it’s becoming less and less profitable to build huge chips with each processing node which means if they want to sell these 500 mm2 compute oriented cards they have to increase the price to make a decent profit.

Either way I’m hoping competition will resolve the differences with AMD over time :-)

Gtx680 compute performance was cut down way farther than anything needed to improve game performance. This was not about shifting die size from compute to tesslation or other game function. Their hpc business is super profitable and they want more of it. They know that most compute apps do not require double precision. I predict they will discontinue the 580 without a compute capable replacement. I know several companies that will be forced into extremely expensive solutions as a result. They seem to think their compute customers are insensitive to price but the only folks effected are those doing commercial products. Most academics do need dp but most commercial apps do not. They think this is like auto cad or adobe where they can charge professionals 5x for a driver. They have misjudged. They are messing with commercial appliances that can and will switch to other hpc solutions. I’m nvidias biggest fan. If they cant win me over on this, then they have a real problem on their hands. I will switch several companies to amd or intel so fast it will take their breath away. Terrible strategy for a company trying to propagate hpc. Sacrifice sound strategy for a short bump in profit. Nvidia has been prone to ameature mistakes in the past, alienating lots of people. They will never be dominant until they quit doing this.

It is very simple. Nvidia main revenue is from gaming cards. Their main focus is gaming and the new cards are very good at that. HPC is a different thing and I suspect they will come out with a new set of cards specifically designed for that. It will be expensive for a regular user, but it will be affordable for use in clusters, where on average will benefit more than 1 user.

Have you ever benchmarked your code on a compute capability 2.1 device (GF104/114/etc)? I tried that some months back and was really surprised at how much slower the performance was, even after correcting for the lower memory bandwidth and number of CUDA cores. The smaller L2 cache and poor utilization of the extra 16 CUDA cores per multiprocessor really hurt performance. The GTX 680 sounds like an evolution of that design, so you might be seeing a continuation of that trend.

I look forward to benchmarking our applications on the GTX 680 (whenever our order finally shows up) so I can understand how architectural changes overall affect performance. Based on the documentation, I suspect three major sources of problems for people:

  1. The dramatic shift in multiprocessor resources relative to # of CUDA cores. In some cases, reoptimizing your block size might help mitigate the change, but in other cases, the reduction in # of registers, shared memory, or integer throughput per CUDA core might permanently hurt performance.

  2. The drop in L2 cache overall, and the drop in L1 cache per CUDA core.

  3. Limitations in the compile-time scheduling of instructions resulting in underutilized CUDA cores. Kepler really has put some pressure on the compiler team to deliver the full performance from the hardware, and I would not be surprised if the CUDA 4.2 beta is falling short in this department.

Honestly, if NVIDIA had followed the release schedule it had taken with the last 3 major architecture generations, then the GTX 680 we have now would have been branded the GTX 660, and we wouldn’t be complaining because we would have all bought the giant 500 mm^2 flagship Kepler GPU in GeForce or Tesla form. (We might still have been complaining, but because Kepler would have been very late in this alternate universe, just like Fermi.) Instead, NVIDIA shipped the midrange GPU first, and called it a GTX 680 because it beats the GTX 580 at graphics tasks by a wide margin.

This all sucks for us, but I don’t think it indicates that NVIDIA is trying to force all compute developers onto Tesla yet. Weeks after Kepler has been released, the GTX 580 is still the fastest single precision device you can buy, beating all available Tesla cards. If NVIDIA intends to “force” everyone to buy Teslas for single precision compute, first they have to release something faster than the 580. :)

That said, I too am looking into AMD’s GCN architecture now that it looks like GCN and Kepler are converging. So far I’m finding myself in a twisty maze of confusing (and kind of ugly-looking) documentation, so NVIDIA has some time to catch up while I figure out if the grass is actually greener on the other side of the fence. :)

The performance reduction is dramatically worse than that seen between GF110/104. I have benchmarked extensively and tried to mitigate but it’s clear to me that nvidia’s new strategy is to ‘pay for compute’. This isn’t just smaller silicon and it’s clear that GK104 is unsuited to gpgpu even in a larger chip. It’s going to give them a small bump in revenue as a few apps switch to Fermi instead of 580, but is absolutely brain damaged. AMD is about to get a lot more business, I’m going to see to that unless nvidia does some reassuring VERY quickly that consumer cards will get fast compute.

Hmmm. NV built something wonderful for gamers. GTC [] is just around the corner. Hows’bout we give NV a chance to wow us too?

PS: anyone tried the new DX SDK version of N-body (nBodyCS.exe)? These guys report quite a speedup for the GTX 680: link N-body (near end of page, search on “N-body”). Assuming the calculations are equivalent, why does the DX code path turn the tables so dramatically?

I think you are inferring a strategy based on extremely circumstantial evidence. As it stands, Kepler currently has no serious effect on GPU-compute sales because it doesn’t compete in that space. (Well, no effect other than annoying people who buy a GTX 680 hoping it will be faster than the GTX 580 on compute.) The existence of the GTX 680 doesn’t somehow make the GTX 580 disappear or the Tesla C2090 more appealing. In fact, it could be a great thing if it drives the price of the GTX 580 down in the short term. Nor does the GTX 680 suddenly make the Radeon HD 7970 any more or less appealing than it was before the release. In February, your decision for big and cheap single precision performance was GTX 580 vs. HD 7970, and in April, it still is the same choice.

And I wouldn’t underestimate the die size issue. GK104 is only 294 mm^2, which suggests that NVIDIA has a lot of transistors to play with when making the 500 mm^2 “flagship” (GK100?) version of Kepler. I seriously doubt they are just going to stick us with 12 SMXs and call it good. I have no idea what they will spend those extra transistors on, but I will at least wait until the compute-version of Kepler is released to draw any conclusions. If, at that time, NVIDIA decides to not release a GeForce version of the Kepler compute-happy GPU, then we will know what the strategy is and can react accordingly.

That said, I’m all for more reports from people successfully switching from CUDA to OpenCL with AMD GPUs. I want to know what the pitfalls are before I invest a significant amount of time to find that GCN’s practical throughput is no better than GK104’s.

if the 600 series takes over and the 500 discontinued, which we should fully expect, then there is no compute-capable card except Tesla. A compute version of Kepler will be a high-margin follow-on to tesla, not a consumer gaming card. This is nearly a done deal, nvidia appears to be removing compute from its consumer line. the 580 is a serious sore-thumb in their current lineup and will be discontinued soon.

Fully agree. In my particular case, gtx680 runs about as fast as a tesla 2050, or some 40% slower than gtx580 (all using SP). The 3.0 arch hardly seems to matter. I’m holding my breath for the bona fide cuda release, and maybe a driver update, but it looks pretty grim. The DP performance on gtx680 is horrible. I guess nvidia has decided to bifurcate consumer and hpc lines in earnest. It’s rather sad. Does anyone know how the newest AMD card stacks up to 680gtx in terms of SP performance and local memory size?

Yes, 7970 absolutely destroys the 680 in compute and, frankly, kills the 580 as well in nearly every test. Extremely dark days ahead for nvidia.

So how bad is GK104 at SP really? Has anyone seen any good benchmarks?
Personally I’d like to see some cublas SGEMM benchmarks for various matrix sizes, still haven’t found anything yet.

Well, dark days for people (myself included) who like high performance hardware at gamer prices. I suspect NVIDIA will do just fine, regardless. Game sites seem to love the GTX 680, and big clusters were going to buy Tesla cards no matter what. Don’t overestimate the importance of our niche in the market. :)

link N-body (near end of page, search on “N-body”, scroll 1 screen down).

NV’s SDK has three versions of the N-body demo: CUDA, OpenCL and DirectCompute (DX). Assuming calculations are equivalent, can anyone guess why the DX code path turns the tables on the 7970 so dramatically? And what’s this talk of L0 caches? Threads in a warp sharing registers? <img src=‘<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

I work with many companies building appliances and consumer products using gpgpu. nvidia has actively promoted this since the beginning and it seems they are now conceding this market, hoping some of it will migrate to HPC. It won’t. It will go to AMD.

If nvidia ever hopes the grow the hpc market out of laboratories and into commercial products, it better rethink this strategy.

While I’m also really disappointed by the SP performances of the 680, I have to disagree with your sentence.

We work with nvidia since 2008, and from the beginning they want us to promote Quadro/Tesla cards to our customer.

nvidia always tried to promote tesla over gtx to us too because it makes them more money, but they never indicated there was a problem writing cuda for gtx. They were/are fully aware of our business and our reliance on gtx for compute. In fact, they reaffirmed their commitment to cuda performance in their gaming series and actively promoted it in the press. It was their lead press for the 400/500 series. They encouraged cuda for games above all.

Tesla is slower and 5x the price of the 580.

Serious bait-and-switch nvidia! That’s ok. AMD’s top gaming card has DP performance that smokes tesla and SP performance that makes makes nvidia laughable without the sleezy games. We need a partner who is reliable and won’t screw us. I’ve been a dedicated nvidia guy until now. This is just sleezy.

With the SHOC benchmark suite (v1.1.2 not tuned for Kepler) in almost all tests GTX 680 is considerably slower than 580, exceptions were sgemm (~10% faster) and spmv_csr_vector benchmarks (20-50%). That IMO doesn’t sound too good and I doubt that the final version of the compiler with Kepler support will improve this much.

Can I find good GF110/114 Cuda benchmarks somewhere?

I feel your pain, but the AMD world isn’t all flowers and unicorns either.

My biggest beef with AMD right now is the complete absence of any kind of (documented and supported) low-level programming. You write your code in C with a limited number of compiler intrinsics, and, if you’re not happy with the performance you get (or the compiler makes a boo-boo, which happens quite often), you are out of options. There is instruction set documentation, and, like in CUDA, there is an option to instruct the compiler to dump the low-level assembly code (so that you can find out exactly why your program runs 3x slower than you expected), but there’s no legal way to edit and recompile the dump.

Tweaking your C kernel code for hours and reinspecting assembly dumps in hopes that the compiler will finally do the right thing can be entertaining at first, but it eventually gets old.