Double precision perfomance

I’m looking for the best hardware setup for [mostly double float] number crunching. The gtx titan and the hd7970 are the candidates. I prefer the nvidia drivers, tools and the titan has 6gb memory, but it’s mighty expensive and the 7970 seems to have it covered in most benchmarks. However, dgemm performance is very important to my application and from what I gathered the 7970 is capable of only ~600 gflops while the titan is at about two times as much. I wonder whether that matches your experience. Maybe the peak performance figures are misleading and one has to see the whole curve. I also wonder how much of the difference originates from the libraries (cublas/clblas).

I’m sorry if the above is a bit incoherent, any insight into double float [dgemm heavy or not] workloads is appreciated.

I cannot speak about the Titan, but I get 1.165 Teraflops for dgemm 64 bit double on the Tesla K20c. My understanding is that the Titan at least matches that number.

At least half of my code uses 64 bit numbers, and my performance seems even better than the supposed peak numbers.

Are there a linear algebra libraries for OpenCL which are as fast and reliable as cuBLAS and MAGMA?

Given the Titan’s performance I actually think it is reasonably priced. I do not know what figures you are looking at but according to PassMark software the Titan handily beats the 7970:

http://www.videocardbenchmark.net/high_end_gpus.html

I don’t know. Maybe clblas is of worse quality. But if it’s clblas that’s slow, then performance can improve in the future. If it’s the 7970 then it won’t.

This is where dgemm performance looks good:
http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3

A less favorable set of benchmarks:
http://www.tomshardware.com/reviews/geforce-gtx-titan-performance-review,3442-10.html

And this is terrible, but it’s opencl vs opencl, not opencl vs cuda:
http://compubench.com/compare.jsp?config_0=11905561&config_1=14470292

With double precision performance activated Titan is 20% faster than the Tesla k20 for double precision cufft based code I am running.