I have some questions about the exact theoretical single/double precision performance of the Titan Black: what FLOPS’ of single- and double-precisions are expected when the “double precision switch” is on and when it is off, respectively. It appears that the “switch” affects single precision performance as well (inversely). Is it only because of the thermal throttling on the core clock (or should it not affect FP32 performance at all)?
I see that for Titan Black:
- The FP64-to-FP32 ratios are 1:3 and 1:24 when the "switch" is On/Off respectively;
- There are 2880 single-precision CUDA cores. And according to the GK110/GK210 White Paper, the number of double-precision units is 1:3 to that of single-precision cores, so there are 960 of them;
- If we use the base core clock 889 MHz and boost 980 MHz respectively, the expected (FMA) FP32 and FP64 FLOPS could be ``` FP32 = 889*2880*2 = 5120640 MFLOPS = 5.12064 TFLOPS (base), and 980*2880*2 = 5.6448 TFLOPS (boost), FP64 = 889*960*2 = 1706880 MFLOPS = 1.70688 TFLOPS (base), and 980*960*2 = 1.8816 TFLOPS (boost), ``` which are the numbers mostly reported.
However, the information above do not solve all my confusions. Since the “switch” affects single-precision operations too, then are those expected FLOPS values for DP mode or non-DP mode?
- There was a hardware review article describe that Titan Black is like a K40 when the "switch" is on, and it only has comparable gaming performance to 780 Ti when the "switch" is off (could not find the article now);
- I heard that the DP units cannot do single precision operations;
- NVIDIA Control Panel on Windows says turning on the "switch" will reduce non-CUDA programs, e.g. games - I understand this as "non-double-precision operations will be slower" instead of literally "non-CUDA" like OpenCL;
- A self-written OpenCL script, while using single float, is slightly faster when the "switch" is off (the CUDA version of the script in single precision was broken at the time so not tested);
- Another self-written OpenCL script, which only does "dst[i] = src[i]" in the kernel (as a test to time the cache read/write operations, instead of testing the bandwidth), is also slightly faster when the "switch" is off whether with single or double floats - when the "switch" is on, the measured caching speed was 236.12 GB/s for both single and double, while it is 237.12 GB/s when the "switch" is off (GB for gigabyte not gibibyte);
- The CUDA version of the "dst[i] = src[i]" script shows no differences no matter the "switch" is on or off (tested only with double float) - it was always 239.16 GB/s. Maybe nvcc optimised and replaced it with "memcopy" nicely?
I currently have three “guesses”:
- The key is the core clock. When the "switch" is off, the double-precision units are all turned off. Single-precision cores will be used for double-precision arithmetic, which operates at 1:24. However, doing so generates less heat, therefore the cores can run faster and non-double-precision arithmetic operations (including double float assignment) will be slightly faster than when the "switch" is on. If this is true, then would there be expected core clock values to calculate the theoretical FLOPS when the "switch" is on/off?
- There are limitations somewhere else. Turning off the "switch" will ease other units like instruction caches/warp schedulers/dispatch units. If this is true, again, then are there hard limits that I can factorise into the theoretical FLOPS calculation for DP/non-DP modes?
- "Scooby-Doo", double-precision cores can do single-precision arithmetic, but not as fast as single-precision cores. The "switch" actually turns the double-precision cores into single-prevision cores. In case this is true: how do I calculate their FLOPS?