Fastest card for CUDA?

NVBrian · September 24, 2008, 3:23pm

I’ve done some CUDA programming several months ago, but I am not up-to-date on the newest GPUs. What is currently the fastest card for CUDA? Are there any benchmarks anywhere?

Last time I looked it looked like the 9800gtx was probably the fastest. What about the new 260 and 280 cards?

kristleifur · September 24, 2008, 3:30pm

Definitely the new cards. GTX 260 has good bang for buck, GTX 280 somewhat better if the extra $ don’t matter. There has been some discussion on the forum - check out the latest threads.

MisterAnderson42 · September 24, 2008, 3:58pm

Fastest depends on your application of course. My memory bound applications run much faster on the original 8800 GTX than on the 9800 GTX :)

Of course, the GTX 280 is faster than all previous cards in every way. If you wait for the Tesla 1000 series, it will be slightly faster (same core as GTX 280, but higher clocks and more memory.)

E.D_Riedijk · September 24, 2008, 5:35pm

but lower memory bandwidth on the tesla’s, so when people do not need more than 1 Gb, I suspect a GTX280 is the fastest option

MisterAnderson42 · September 24, 2008, 5:48pm

I just double checked this from the NVIDIA specs page:

GTX 280:

Processor Clock (MHz) 1296 MHz

Memory Bandwidth (GB/sec) 141.7

Tesla 1066:

Frequency of processor cores 1.296 Ghz

Memory Bandwidth 102 GB/sec

So I guess I was wrong about the clocks here too…

It is the Tesla S1070 that has the boosted clock:

Frequency of processor cores 1.44 GHz

So yeah, more than ~40% less memory bandwidth on Tesla 1000 series = a bad thing for memory bound apps. Oh well. I guess there is a price for that 4GiB of memory.

mascarpone · September 25, 2008, 3:54am

it seems counter intuitive!

E.D_Riedijk · September 25, 2008, 5:06am

Not really, NVIDIA tells all the time that they use the memory on Tesla’s below their spec.

alex_dubinsky · September 25, 2008, 6:48am

It’s to get fewer bit errors. If you’re paranoid about your production code (or actually notice the errors), you can downclock gaming cards for more stability.

kristleifur · September 25, 2008, 12:18pm

Now that is Rather Interesting. I suspect that I’ve seen a few bit errors though it could as well be my weird code. Have you tried this yourself or got any links?

MisterAnderson42 · September 25, 2008, 2:32pm

I’ve only ever seen something that seems like bit errors in an overheating GPU. I’ve never had any other evidence for them on any consumer or Tesla card even for week+ long runs.

But I think the bit error issue is not the reason here. NVIDIA lowered the clock on Tesla C870 to reduce bit errors, too, but the memory bandwidth suffered only a few percent.

I’m not DRAM expert, but I do remember something from computer architecture class about DRAM taking longer to address and read when you’ve got a larger total memory area. That could account for the 40% less bandwidth, and is what I was implying when I mentioned there was “a price to pay for having that 4 GiB of memory”.

E.D_Riedijk · September 25, 2008, 4:19pm

Hmm, then they have not been telling the truth at NVISION. But I can imagine that is indeed the case.

nasacort · September 25, 2008, 4:53pm

I agree, especially because the new cards have much better memory coalescing capability. For example, I used to access a 3D array (of float4) the usual way on a 8800GT:

variable accessed by thread (i,j,k) = array[Nx*(Ny*(k-1) + (j-1)) + (i-1)],

so that everything was coalesced according to the old rules. Now, I can access the array on a GTX280 using indirect addressing:

variable accessed by thread (i,j,k) = new_array[ pointer_array[Nx*(Ny*(k-1) + (j-1)) + (i-1)] - 1],

which breaks the old coalescing rule. Surprisingly, not only do I not get any speed penalty for doing so, but I actually get a small (0.5%) speedup! At the same time, my memory usuage went down by 45% because I now only have to allocate the much smaller, condensed array ‘new_array’.

alex_dubinsky · September 25, 2008, 5:48pm

You only really “see” bit errors when you’re getting millions of them. That’s why they’re not such a big concern. But if you have the time, write a memory burn-in kernel to specifically look for them. It’d be very interesting.

You mean the memory bandwidth is much lower than the mere decrease in frequency? Yes, that has an explanation. When you start adding multiple chips to the same address lines the signals take longer to stabilize (data lines are always point-to-point, so they’re fine). The solution is to wait. Normally you have idle time anyway as the DDR sends you data in bursts, but if they really pushed the limits that time wouldn’t be enough and the waiting (what you see as 1T, 2T addressing in a PC BIOS) becomes longer than the slack. You really want to be careful with the waiting, because an error in the addressing step is an equivalent to a hundred bit-errors. It’s possible NVIDIA origianlly wanted there to be no stall time, but then had to change things when parts weren’t working.

I’d really love to see someone do independent testing to see if the Teslas are any more reliable than the GeForces, or if they might even be worse.

MisterAnderson42 · September 26, 2008, 1:37pm

Not when I run the same deterministic simulation every time and diff the binary output files, even between different hardware. Plus, any random bit error is likely to send a particle in my simulation flying outside the box or cause an infinite loop in the kernel, both of which will show up as a crashed application.

alex_dubinsky · September 26, 2008, 6:57pm

That’s pretty reassuring. But keep in mind that you may not be using 100% of your memory and, in the cases where you’re not diffing, a bit error is likely to just change a float’s mantissa slightly and won’t cause an out-of-box particle.

A burn-in program would be very interesting (and useful!), maybe i’ll work on one. Besides checking memory, it’d also stress-test ALUs. What are your ideas for hitting as many parts of the GPU as possible and causing the most heat?

Topic		Replies	Views
Why Tesla? CUDA Programming and Performance	27	33730	November 20, 2008
Should I buy Tesla or GTX295 CUDA Programming and Performance	9	4755	January 22, 2010
Tesla vs GeForce archs What makes the tesla better? CUDA Programming and Performance	8	18342	September 14, 2009
Raw speed for CUDA apps What is the fastest card at present? CUDA Programming and Performance	7	8863	February 6, 2008
Tesla C1060 or GTX280 CUDA Programming and Performance	3	3158	February 9, 2009
One powerful GPU vs. several low-end GPU's Which is better? For "embarassingly" parallel CUDA Programming and Performance	9	13133	March 25, 2010
GTX280 vs Tesla C870 CUDA Programming and Performance	21	19037	August 13, 2008
Tesla C2070 Performance Comparing Tesla C2070 performance to Geforce GTX CUDA Programming and Performance	4	2563	March 24, 2011
best CUDA-enabled card for $100 (or so) CUDA Programming and Performance	17	3144	March 20, 2011
Disappointed performance using C2050 CUDA Programming and Performance	20	7774	September 2, 2010

Fastest card for CUDA?

Related topics