Is nvidia forcing SP compute customers into expensive cards? Why is SP Cuda so slow on gtx680? Somet

Fortunately, I am very used to the same using cuda. While the 4.1 compiler truly rocks, their previous compiler was littered with really stupid bugs for generations. We spent man-weeks finding many of them and proving they were true compiler bugs. Yuck.

AMD has a very unified cpu/gpu approach to the ecosystem, which will be great in the end, but they are certainly behind when it comes to libraries, api and unified address space, etc. We rarely use canned libraries, it’s always a custom kernel, so libs and such are less important to us.

Still, it’s the basics that matter most. Speed, cost, and good-enough dev tools. AMD, here we come… Forcing Tesla on us is WAY over the line.

I believe single precision n-body code will be very fast on the gtx 680 because has a large number of SFU’s which can be used to calculate 1/sqrt(x).
I’d love to see the GK110 have double precision SFU’s.

You miss my point. My whole point is that CUDA does not have the same problem, because, in CUDA, if the compiler does not work the way you want, you hack the PTX file directly and you’re done. In AMD OpenCL, there’s nothing you can do, your hands are tied.

I got your point. Dealing with immature compilers sucks. if I have to go to compiled code, life is bad. I avoid doing that, successfully so far. I don;t really value that feature.

I’m also very curious to see how my code runs on the GTX 680. I’m holding out for the 4GB FTW edition from EVGA before I order though.

When nVidia replaced the GTX 260 with the GTX 460 it looked on paper like it would be an improvement but for some reason I only ever got half the expected performance. I couldn’t justify the cost or power consumption of the GTX 480. I looked very seriously at AMD and even got as far as porting my application to OpenCL (but not testing it). In the end nVidia redeemed themselves by releasing the GTX 470. When the 500 series came out it seemed prudent to skip the GTX 560 and go straight for the GTX 570 which has performed brilliantly.

On paper the GTX 680 looks very impressive. My application is normally limited by texture fill rate and memory bandwidth. I can’t wait to see if I can actually hit those limits on the GTX 680.

We look with horror on the increasing amputation of plain Nvidia boards. We have used GPU in our products for quite some years now, and we will not recommend our clients to buy Tesla boards. This is simply too “Enterprise” priced
Amongst other libraries we use Magma http://icl.cs.utk.edu/magma/news/news.html?id=289 for linear math. and we were quite happy to see that AMD is also supported now.
NVIDIA could use some competition which hopefully will lead to “unlocking” the full potential of a standard GTX board.

Why for instance is SLI reserved to gamers and “Locked” for Cuda developers? we sure could use the extra data transfer rate between boards.

We’ve asked about this in the past, and the comment from one of the NVIDIA employees was that the SLI bridge is not a high speed data link between boards. We didn’t get any more info about what it actually does, but basically, PCI-Express is much faster as a direct GPU-to-GPU link.

The SDK sample was not supposed to be used as benchmark. It spawns only 10000 threads, and it’s not enough to keep hi-end gpu busy.

I pointed to the nbody demo (DX-SDK version) because of this reported result:

I’m still wondering why the GTX680 scores so much higher than the HD7970 (and the GTX580). Special Function Units (SFU’s) and a whole lotta 1/sqrt(x) may be the case, but I suspect something else. External Image

Could it be they they purposely hobble cuda/openCL kernels, but not shaders for fear of hurting games?

It’s just broken benchmark. It measures performance at low occupancy.

Just ran a few tests with CUFFT and on 1D complex-complex the GTX680 is 5~10% slower than the GTX580. I tried sizes between 256 and 8192, power of 2s.
It’s in line with what I’ve seen on my own kernels.
No increase in memory bandwidth sucks though. AMD has 30% more bandwidth now.

On the other side, PCI Express 3.0 works very very well :-)
I pretty much get the peak :-o which is twice as fast as gen 2.
Those limited by PCI speed will see a nice improvement.

-Guillaume

I was wondering about this. What motherboard, CPU and RAM configuration do you have, and can you post the output from bandwidthTest?

yeah, the SLI dedicated link is not useful for anything relating to CUDA. the cudaPeer* functionality is basically using other SLI hardware on the chip for CUDA.

The schedule from GTC 2012 has a talk “Inside Kepler” listed with the abstract:

“In this talk, individuals from the GPU architecture and CUDA software groups will dive into the features of the compute architecture for “Kepler” – NVIDIA’s new 7-billion transistor GPU. From the reorganized processing cores with new instructions and processing capabilities, to an improved memory system with faster atomic processing and low-overhead ECC, we will explore how the Kepler GPU achieves world leading performance and efficiency, and how it enables wholly new types of parallel problems to be solved.”

7 billion transistors is even more than I expected! Hopefully those of you going to GTC will let us know how the compute-version of Kepler is supposed to work.

Are there already guidelines how to optimize kernels for the GTX680?
Things i read so far:

In short, we need a Kepler tuning guide… or is it available already somewhere?

Edit: Found the integer ops throughput is in the programming guide 5.4.1: shifts, compares, type conversion each have only ~1/4 of the effective throughput on GTX680 compared to GTX580(half as much per SMX, 8SMX vs. 16 SMs).
So it might be beneficial to replace bitshift with multiplies?

I believe this may be the latest version: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf

That updated PDF is a relief. The Kepler throughput numbers now look great… except for int32 mul/shift!

Not sure why that PDF didn’t show up in the latest 4.2 release.

This is a great article about Kepler and NVIDIAs likely strategy:

http://www.realworld…32212172023&p=1

Unfortunately, I have to agree with others in this thread, that the time of getting top notch compute performance out of the gaming cards is probably over. The first sign of this was when NVIDIA artificially throttled DP performance on the Fermi gaming cards. That was the first bitter pill. Now, we are just waiting for the other shoe to drop on the issue of Kepler.

Unfortunately for us, the compute crowd, there is a substantial difference between graphics workloads and compute workloads. Graphics needs little communication between threads while compute often needs a lot. By increasing performance for single threads and decreasing performance on communication between threads, NVIDIA has managed to increase graphics performance while cutting down compute performance, leaving an open space for expensive, compute-tuned cards in the future.

One indication of things to come is the branding of the current chip. The card has the GTXx80 name traditionally reserved for the flagship graphics product while the chip itself has the name traditionally reserved for a pared-down version. It seems pretty certain that NVIDIA does not intend to use the GTX brand for any bigger versions of the chip. And their non-gaming targeted products have always been much more expensive.

@RogerDahl: While I have to agree with you on most of your points, I’m still unsure how Nvidia will address the compute market.
Do you really think they can afford releasing a “limited edition” compute-specific chip (say a GK110) that we could not be in any Geforce?
That would mean: small production (a few thousand units?) → high production price → small benefits?