You can have performance per $ or performance per Watt and you compare for the application you need to do your job. So your title is BS and using caps is impolite.

Agreed. Compare using the magic of multiplication. :)

  • Equal power: Is a Titan better than 3.5 Ivy Bridge processors for your application?

  • Equal cost: Is a Titan better than 3 or 4 Ivy Bridge processors for your application?

Plus, in some clusters 1 hour of 1 cpu core is billed the same as 1 hour of gpu.

Tell you what, you get together the equivalent amount of CPUs to equal the Titan in Wattage, and you and I can have a competition to see who has the faster code.

We can use three metrics for testing, sorting 100 million floats, multiplying large dense matrices (lets say 10,000 x 10,000 dense floats) and a graph algorithm like BFS or Floyd-Warshall.

We can have a Google hangout and see who can run the faster correct code. What do you say SkybuK?

I will note that the GTX Titan is rated at 250W and that it includes 6 GB of high-speed memory along with the processor:

I already have a code which relies heavily on fft. I tried tets up to 800x800 matrices against a mpi version and 1 Titan 2070 was equivalent to 60-80 cores. Taking 12 core per cpu (AMD cpu) that means about 6-7 cpu with 90 W per cpu. If you consider the electricity spend by the infiniband as well.
I think both performance per Watt and performance per dollar are better in the Titan. I can do my job much better even with limited amount gpus at my disposal and most of my work is done now with CUDA. I still have programs which run on cpu it just depends on the problem. I am still using my mpi cpu code for problems which do not fit in 6 GB of RAM.

Also, while I know most on here do not play PC games, keep in mind that the Titan is really awesome for that purpose.

So in other words you can do at least 1.2 teraflops Sgemm() with the Titan, bitcoin mine at 380-450 MHS, then play Skyrim or FarCry 3 on maximum settings and maximum resolution with a FPS>60.

SkyBuk try playing FarCry3 using a CPU, and let me know how that goes.

The cited SGEMM performance number for GTX Titan looks much too low, but I don’t have one to do a quick test. Could you double check your data, please?


That was just a minimum guess for the Titan, as my current PC has a GTX 680 and Tesla K20c.

The GTX 680 Sgemm() is at about 1 teraflop and the K20c is about 1.3 teraflops for Sgemm().

Since I know that the Titan and the K20 have similar performance I low-balled the teraflop estimate by using my worst-case K20 numbers.

Those are low numbers because I have an old motherboard and am only get the half the bandwidth speed (PCI-e 2.0 x8), but that will be replaced soon.

The above numbers are from the CUDA 5.0 SDK cuBLAS MatrixMult sample.

If I use CUDA-Z utility I get 2.2 teraflops for the K20c single precision, and 1.24 teraflops for K20c double precision.

for the GTX 680 using CUDA-Z utility I get 1.98 teraflops for single precision, and 200 Gflops for double precision. The 680 is used for games and video out, so I never need it for the double capability.

The CUDA SDK sample apps are generally not designed for benchmarking purposes.

The host platform shouldn’t matter when measuring CUBLAS GEMM performance (my workstation here is five years old, obviously limited to PCI-e 2). I have a Tesla K20c. Using CUDA 5.5 and with dimensions of m=n=k=8192 I see the following SGEMM performance:

2650 GFLOPS transpose_a=N transpose_b=N (0.415 sec)
2670 GFLOPS transpose_a=N transpose_B=T (0.412 sec)
2180 GFLOPS transpose_a=T transpose_B=N (0.497 sec)
2210 GFLOPS transpose_a=T transpose_B=T (0.496 sec)

The above execution times are for the CUBLAS call followed by cudaThreadSynchronize(), as seen by the host code (i.e. all data is resident on the GPU and there are no copies). The GFLOPS numbers are based on the standard count of floating-point operations as 2MN*K for GEMM. Obviously the performance will vary considerably with the three dimensions, but for performance comparisons it is customary to state the performance for large, square, matrices as I have done here. I do not have CUDA 5.0 ready to try but would be surprised if the performance is much different from CUDA 5.5 for the cases above.

Here is a benchmark brief stating SGEMM and DGEMM performance for the K20X, the numbers seem to jibe with the performance I am measuring on the K20c (which is not quite as powerful as the K20X):

Right now we are experiencing a MAJOR HEATWAVE in The Netherlands, probably the worst one I have experienced ever. I suspect hot air from middle east and hot air from north america via sea currents. So two factors coming together forming a big heatwave.

It’s now clear that the Antec 1200 case with 3 inlets and 3 outlet fans + 1 from power supply, cannot cool an 85 watts AMD X2 3800+ (DUAL CORE) running at 2.0 GHZ per core. It can run at 2.0 GHZ but then it would fry the motherboard. The Winfast motherboard has temperature sensors and will shutdown the entire computer if temperature goes over 50 degrees celcius. Outside and Inside my appartment it’s now 28.5 degrees celcius according to my shitty clock on my desktop. I really have to go buy a better temperature meter lol… then again… the weather forecast more or less say same thing… real temperature might be 30 degrees or so. I might call my mom later too to verify but she not in the city where it’s a few degrees hotter but ok.

Back to the story… I had to under clock the CPU and I also decided to underclock the GPU just to be on the safe side. Only then could system temperature remain far below 50 degrees celcius.
Right now it balances around 40 degrees celcius. The GPU still goes hot to 52 degrees or so… and this is a GT 520… passively cooled, lowest watt dx 11 gpu probably… something like 30 watts or so ? and even this thing runs hot ?! weird.

Anyway… I don’t believe for a second that the Titan will run at 300 watts continously… and even if it did I do not believe that it can be cooled properly by any air cooling case under these heatwave conditions.

So at one point one must call BS on the whole thing… NVIDIA might as well create a 10.000 WATTS GPU and claim that it’s so much faster than the CPU… but in my mind… it plays no roll if it cannot be cooled properly.

A slight point of critique too: The haswell processor is out already… I guess the CUDA C programming guide was released before that processor came out… but it’s the latest from intel as well and probably a better candidate to compare against the TITAN.

And I shall end this post with a funny note:

“A dead Titan is NO GOOD TITAN”

and another one:

“A fried Titan is NO GOOD TITAN”

and a last one:

“A fried PC makes a BAD TITAN” :)

(Great… it’s just started raining and a little bit of thunderstorm… looks like it’s gonna be a good long rain shower… temperature already down 1.5 degrees. Quite an odd experience… feeling my cheeks burn from heat… while cold air blowing against it… the air is still hot… :) but getting cooler ;) wind s picking up… good thing too… )

Woops stilling getting used to this new forum… pressed wrong button, quote instead of edit to correct a typo.

Obvious troll.
If not trolling:

I’ve had 8 gpus running the same program in the one box, continuously, for 48 hours.

You are an obvious troll if you had 8 titans running in one box, your house would be on fire lol.

I am running fft code on a Tesla k20 with 10000x10000 matrix using almost all 5GB of the gpu ram. the power consumption shown by nvidia-smi is 137 W. I do not think there is any real (useful) cuda program which would make the power consumption at max (a.k.a use the gpu 100%).

Please tell me more about your set-up. I want to build a cuda computing server for our lab.

Will send a pic tomorrow if you really want. It’s actually 7 titans and one K20X. Runs at about 40C

Yes, pics please!! (And machine specs too, if possible)

Hey sorry was away. I’m just about to swap the K20X out for a titan (titan actually performs better for my current code). Will upload a pic when it’s all nice and implemented :)

It’s not cheap, 8 gpu motherboards are hard to come by. Further, I needed to maximize P2P transfer rates so it’s using Romley Arch. This is the tyan barebones (motherboard, case, fans, psu) that it’s currently using:
PLENTY of cooling (when it turns on it’s about 100db, idles at about 60 db!!!). Was a bit difficult to set up, and there were definitely a few teething problems (needs 3 power sockets, ideally from 3 different power boards, must use the proprietary onboard graphics, we were unable to get output from any of the Titans)

On top of that:
2* Intel Xeon E5-2620 2.00 Ghz 15MB Cache 7.20GT/sec LGA 2011 Six Core Processor
8* 4GB 240-Pin DDR3 SDRAM ECC Registered DDR3 1600 MHz Server Memory
1* 2TB 7200RPM 64MB CACHE 3.5IN SATA Enterprise Class HDD


Very nice, but why not SSD drive(s)? At least for the operating system?
It is amazing what a difference they make.

Any chance for a pic?