I’m really glad to see that dynamic parallelism has been left enabled, but I’m not sure what means that “HyperQ’s MPI functionality” has been removed. Does this mean that HyperQ is gone completely?
For example, the GeForce Titan has processing power 4.5 TFlops32 and 1.3 TFlops64. This ratio is about 1:3.
Does this mean that the GeForce Titan (GK110) as in Tesla K20 has 2688 FP32-cores and 896 FP64-cores? (896:2688 = 1:3) And even FP64-cores can not emulate FP32-cores, otherwise it would be the ratio (896: 2688+896 = 1:4)
Or because in any of the specifications is not mentioned FP64-cores for GeForce Titan, does this mean that there is only the 2688 FP32-cores that emulate FP64-core, but is 3 times slower?
P.S. Can be used on Kepler: FP32-cores to emulate FP64-cores, and conversely, FP64-cores to emulate FP32-cores?
There is no way I’m aware of to “emulate” double precision operations with single precision floating point instructions with only a 1/3 performance penalty. Libraries like dsfun can do something like it (but not quite full double precision), but the performance penalty is 1/7 or more.
I’m almost certain the Titan has the full double precision units activated, like the K20X. That certainly is what it looks like from the Anandtech benchmarks:
Ok, FP32 can not emulate FP64, but, can FP64-core emulate FP32-core?
If it can, that it means, that it would be the ratio (896: 2688+896 = 1:4)
Or at the one moment can work only FP32-cores or only FP64-cores? (For avoid the overheating)
I don’t think the FP64 cores can execute FP32 instructions, but both kinds cores can be active at the same time (remember, there are lots of instructions being processed at once in Kepler).
According to the reviews, if you enable the FP64 cores in the driver, the clock rate might be reduced to meet the thermal limits. It isn’t clear how often the downclock happens, though. CUDA devices often draw less power in general when running compute-only programs, as compared to graphics programs.