Geforce Titan comes with full DP capabilities?

Hi,

I expected the Geforce Titan to come nudered without any (1/24) DP capabilities but seems like it’s a full blown K20X:

http://www.tbreak.com/news/nvidia-officially-launches-geforce-gtx-titan-graphics-card/

I wonder if it will have the same compute capability? (CC 3.5)

Anyways, dear dear Nvidia, oh how I like this kind of surprises! :D

Anandtech has it at 1/3FP it would seem :
http://www.anandtech.com/show/6760/nvidias-geforce-gtx-titan-part-1

It’s 1/3 for the K20X aswell.

Out of 192 SP:s 128 can be used for 64-bit DP operations ( the other 64 are for ILP found at rutime*).

Since they are 32-bit that yields 128/2 = 64 64-bit ops / cc . Hence the 64/192 => 1/3 .

So it’s no different from K20X. If you check it’s numbers you have 1.3 DP and 3.95 SP => 1.3/3.95 ~ 1/3.

In the case of the Titan I just read that it downclocks in DP mode from ~837 down to 725.

Hence it’s 4500 SP GFLOPS and 1310 DP GFLOPS.

Anyhow, sweet move Nvidia, very sweet!

*Check good ol GF104 which is a less complicated reference

Right, I’m agreeing with you :)

If you read through the Anandtech article, you will see what has been disabled compared to K20X as far as compute is concerned.

I’m really glad to see that dynamic parallelism has been left enabled, but I’m not sure what means that “HyperQ’s MPI functionality” has been removed. Does this mean that HyperQ is gone completely?

I would assume that the following has been removed on Titan :

Source of the text above : http://blogs.nvidia.com/2012/08/unleash-legacy-mpi-codes-with-keplers-hyper-q/

Sure sounds more like a driver issue than an actual difference in silicon…

That just means that the won’t support Proxy on Geforce Titan. See https://devtalk.nvidia.com/default/topic/529136/cuda-programming-and-performance/hyperq-and-mpi/

As it is now, proxy is only supported on a few very special Cray machines.

For Tesla K20X in specifications is present 2688 FP32 and 896 FP64
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units#Tesla

But why does in GeForce Titan is present only 2688 FP32, and no any word about FP64?
http://en.wikipedia.org/wiki/GeForce_700_Series#GeForce_700_.287xx.29_series
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan/specifications

For example, the GeForce Titan has processing power 4.5 TFlops32 and 1.3 TFlops64. This ratio is about 1:3.

  1. Does this mean that the GeForce Titan (GK110) as in Tesla K20 has 2688 FP32-cores and 896 FP64-cores? (896:2688 = 1:3) And even FP64-cores can not emulate FP32-cores, otherwise it would be the ratio (896: 2688+896 = 1:4)
  2. Or because in any of the specifications is not mentioned FP64-cores for GeForce Titan, does this mean that there is only the 2688 FP32-cores that emulate FP64-core, but is 3 times slower?

P.S. Can be used on Kepler: FP32-cores to emulate FP64-cores, and conversely, FP64-cores to emulate FP32-cores?

There is no way I’m aware of to “emulate” double precision operations with single precision floating point instructions with only a 1/3 performance penalty. Libraries like dsfun can do something like it (but not quite full double precision), but the performance penalty is 1/7 or more.

I’m almost certain the Titan has the full double precision units activated, like the K20X. That certainly is what it looks like from the Anandtech benchmarks:

http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3

(The DGEMM benchmark makes this pretty clear.)

Ok, FP32 can not emulate FP64, but, can FP64-core emulate FP32-core?
If it can, that it means, that it would be the ratio (896: 2688+896 = 1:4)
Or at the one moment can work only FP32-cores or only FP64-cores? (For avoid the overheating)

I don’t think the FP64 cores can execute FP32 instructions, but both kinds cores can be active at the same time (remember, there are lots of instructions being processed at once in Kepler).

According to the reviews, if you enable the FP64 cores in the driver, the clock rate might be reduced to meet the thermal limits. It isn’t clear how often the downclock happens, though. CUDA devices often draw less power in general when running compute-only programs, as compared to graphics programs.