Theoretical SP/DP GFLOPS of Titan Black when DP mode On/Off?

Hi everyone,

I have some questions about the exact theoretical single/double precision performance of the Titan Black: what FLOPS’ of single- and double-precisions are expected when the “double precision switch” is on and when it is off, respectively. It appears that the “switch” affects single precision performance as well (inversely). Is it only because of the thermal throttling on the core clock (or should it not affect FP32 performance at all)?

I see that for Titan Black:

  1. The FP64-to-FP32 ratios are 1:3 and 1:24 when the "switch" is On/Off respectively;
  2. There are 2880 single-precision CUDA cores. And according to the GK110/GK210 White Paper, the number of double-precision units is 1:3 to that of single-precision cores, so there are 960 of them;
  3. If we use the base core clock 889 MHz and boost 980 MHz respectively, the expected (FMA) FP32 and FP64 FLOPS could be ``` FP32 = 889*2880*2 = 5120640 MFLOPS = 5.12064 TFLOPS (base), and 980*2880*2 = 5.6448 TFLOPS (boost), FP64 = 889*960*2 = 1706880 MFLOPS = 1.70688 TFLOPS (base), and 980*960*2 = 1.8816 TFLOPS (boost), ``` which are the numbers mostly reported.

However, the information above do not solve all my confusions. Since the “switch” affects single-precision operations too, then are those expected FLOPS values for DP mode or non-DP mode?

  1. There was a hardware review article describe that Titan Black is like a K40 when the "switch" is on, and it only has comparable gaming performance to 780 Ti when the "switch" is off (could not find the article now);
  2. I heard that the DP units cannot do single precision operations;
  3. NVIDIA Control Panel on Windows says turning on the "switch" will reduce non-CUDA programs, e.g. games - I understand this as "non-double-precision operations will be slower" instead of literally "non-CUDA" like OpenCL;
  4. A self-written OpenCL script, while using single float, is slightly faster when the "switch" is off (the CUDA version of the script in single precision was broken at the time so not tested);
  5. Another self-written OpenCL script, which only does "dst[i] = src[i]" in the kernel (as a test to time the cache read/write operations, instead of testing the bandwidth), is also slightly faster when the "switch" is off whether with single or double floats - when the "switch" is on, the measured caching speed was 236.12 GB/s for both single and double, while it is 237.12 GB/s when the "switch" is off (GB for gigabyte not gibibyte);
  6. The CUDA version of the "dst[i] = src[i]" script shows no differences no matter the "switch" is on or off (tested only with double float) - it was always 239.16 GB/s. Maybe nvcc optimised and replaced it with "memcopy" nicely?

I currently have three “guesses”:

  1. The key is the core clock. When the "switch" is off, the double-precision units are all turned off. Single-precision cores will be used for double-precision arithmetic, which operates at 1:24. However, doing so generates less heat, therefore the cores can run faster and non-double-precision arithmetic operations (including double float assignment) will be slightly faster than when the "switch" is on. If this is true, then would there be expected core clock values to calculate the theoretical FLOPS when the "switch" is on/off?
  2. There are limitations somewhere else. Turning off the "switch" will ease other units like instruction caches/warp schedulers/dispatch units. If this is true, again, then are there hard limits that I can factorise into the theoretical FLOPS calculation for DP/non-DP modes?
  3. "Scooby-Doo", double-precision cores can do single-precision arithmetic, but not as fast as single-precision cores. The "switch" actually turns the double-precision cores into single-prevision cores. In case this is true: how do I calculate their FLOPS?

Cheers

Turning on the switch causes the machine to behave as if all DP units are active. (1:3)
Turning the switch off causes the machine to behave as if only a subset of DP units are active (1:24)

In addition, IIRC, turning the switch on reduces the maximum core clock. I don’t remember what the clocks are on/off, google is your friend, or careful use of nvidia-smi may yield the info.

I believe this explains all the observations. SP cores don’t do DP arithmetic or vice-versa.

See note 5 here, which is reflecting that clocks are higher in “SP mode” (i.e. 1:24 mode):

[url]https://en.wikipedia.org/wiki/GeForce_700_series[/url]

Thank you, Robert! One reason I tried to find a document for Titan Black clock rates (but not found anywhere) was because my card is an oem whose clock rate is different from the original model. But I guess what is actually being tested is more important, so I will live with this.

I think the 889/980 numbers are correct for the reference Titan Black in non-DP mode:

[url]https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-black/specifications[/url]

I think they are lower in DP mode.

What they are for your card I don’t know. For a specific OEM model, best to consult with the manufacturer of the card.

Hi,

I know that I probably should start a new thread, but since you are talking about Titan Black specifically, I would like to ask a stupid question on this topic, as I looked into many threads in the forum also searched the web for relevant information, but couldn’t find a answer.

I recently purchased a second-hand Titan Black, and installed it in my Windows 10 machine, which had Visual Studio 2019 Community and CUDA 10.1 Update 1 already installed. I was hoping to use the DP performance in programming, but I couldn’t find any option to put the Titan Black in the DP mode, as tttins had mentioned. Any hint on what should do will be very helpful.

After some digging, I did figure out how to put it in the TCC mode, as I was fortunate to have an integrated graphics card, otherwise it would not work. For that I used “nvidia-smi -g 0 -fdm 1”, where I had to force the use TCC mode.

Another note is that the Titan Black only has two compute settings, “All On” and “Low_DP”, which I set the “All On” mode with “nvidia-smi --gom=0”. Is the “All On” mode same as the DP mode? From the CUDA-Z program, my DP performance is still low.

Thanks!

SL

On linux it is done in nvidia-settings (basically the linux GPU control panel). On windows it is done through the nvidia display driver control panel.

There is a setting to change. Based on the article linked below, I think it should be in “Manage 3D settings”

I’m pretty sure when you change it, the control panel will prompt you to reboot. It is not done with nvidia-smi. I don’t remember exactly what setting it is. It might be the “All On” setting. If you changed the setting, but didn’t reboot, that could explain things. Even if the control panel doesn’t prompt you to reboot, I would reboot anyway, after making a change to the setting.

This shows a picture of the control panel setting the way it looked “back in the day” for GTX Titan (not Titan Black, but I doubt the Titan black control panel method would have looked any different):

[url]https://www.anandtech.com/show/6760/nvidias-geforce-gtx-titan-part-1/4[/url]

I don’t know how it would look today on a win10 driver - but I would check under “Manage 3D settings” first.

Changes made in “nvidia-settings” appears to take effects immediately, while “nvidia-smi --gom” goes to pending. I am on a 18.04 Ubuntu with CUDA 10.1. Keeping the GPU busy while monitoring the clock speed tells me that clicking the “CUDA double precision” option in “nvidia-settings” will switch the current clock speed between 980 and 1058 MHz in real-time.

Robert, thanks for the quick reply.

I reset the Titan Black to WDDM mode in order to access the Nvidia Control Panel. After reboot, I was able to take a screen shot of the Nvidia Control Panel, where there is no such option “CUDA - Double precision” mentioned in the anandtech website.

For your information, the following is the output from running nvidia-smi.exe.

[s]Then the current win10 windows driver may have dropped support for that mode switch capability.

The mode setting should survive reboots/power cycles. Therefore if you wanted to set up a win7 system with an old driver that would have been “current” when Titan Black was released, you may have better luck at switching the mode. You can then move the card back to your win10 system. Alternatively something similar may be possible by switching from win10 → linux → win10.[/s]

External Media

Windows does not require reboot after Nvidia control panel changes, it appears. Trying to insert some snapshot, no sure whether it will work. My system is very new, was setup last week. Latest Windows 10, latest Nvidia driver (but no CUDA on Windows). So the option is just there in the NVIDIA Control Panel, if the images actually show up.

External MediaExternal Media

The two snapshots of CUDA-Z are captured before/after the DP option being selected and applied, without reboot. So you can see that they are 1:3 and 1:24 respectively.

Cheers

Annotation 2019-07-21 195114.png

Annotation 2019-07-21 200132.png

Great, seems like you have it all sorted. Thanks for documenting it for others. Just ignore all my comments in this thread.
[s]

I assume you meant 1:3 and 1:24 [/s]

What (windows) driver version are you using?

Yeah, just spotted that. It is 5 a.m. and I am tired like drunk.

Version 431.36. GeForce Game Ready Driver. This DP option in Windows has been here for a while, I remember like 1 year ago, it was here as well.

Thanks for the information, Robert & tttins,

I updated the default video driver available in the CUDA 10.1 U1 (425.25) with the latest driver (431.36), but I still couldn’t find the option for double precision in the Nvidia Control Panel. So I think my Titan Black may not be genuine. I heard the rumor that people could flash GTX 780Ti into Titan Black, or maybe I still mis-configured something.

I contacted the vendor, and will send it back for an exchange. I will post what I find out eventually. But I don’t want to complain too much, because it is a very old product after all.

Good luck, Shenggang. I love Titan Black for its DP and Bandwidth so much, I want to buy a few more of them (mine is also second-handed), except cannot install multiple GPUs easily as a non-miner.

You may try 2nd-handed Titan Z as well, it is like a dual-Titan, and 3000 RMB on Taobao, which is roughly twice the price and performance of a Titan Black. But your code will then need to be multi-GPU compatible of course. On eBay 2nd handed Titan Z is a bit more expensive.

BTW, 780Ti’s stock cooler says “GTX 780Ti” and Titan Black’s says “GTX Titan”, while third-party coolers appear to be costing lots of dollars, and they are different to 780 and Titan because of their different PCB. I think it is not easy to find a cheap cooler for Titan Black, let alone a stock cooler (I know that because mine came without one, was liquid cooled before). So the appearance of a Titan Black should be quite unique - if it has a cooler.

Thanks, tttins! I finally received the Titan Black, and now there is the option in the Nvidia Control Panel to turn the double precision on using the recently released driver (431.36), so I guess that the option has always been there, but some of the old Titan Blacks floating around may have been modified 780Ti. I didn’t pay too much attention on the look of the card last time, but I remembered I didn’t see any indication of 780Ti.

I completely agree with you that the early Titan products are the most cost effective for scientific computing and alike without going with the most expensive Tesla/Quadro products. When Titan V was released, I was considering to get one, but again most of the programs I am working with are still not well-accelerated by GPUs yet, so the early Titan products are good for testing for now. Besides Titan Z, used Quadro K6000/Tesla K40 are also available on TaoBao, which may be better for the larger memory size.

It shouldn’t be difficult to have more than one GPUs in a single system, as the X299/X399 boards all have sufficient lanes, but of course one would need a large power supply. This is the direction I intend to move to.

Also, do you worry about the unavailable ECC feature in the Titan products for production run?

How nice!

To be honest, I don’t really understand the importance of ECC as maybe I don’t fully understand it. I only understand that non-ECC would mistake a ‘0’ or ‘1’, putting aside the probability. Maybe due to my application, I have never noticed such incidence on my Titan Black - and I probably can’t notice when it actually happens. My work involves iterative refinements. It is the refinement cycles actually costing time, while a verifying calculation is very fast for me. However, to be fair, my major calculations are done on a cluster (where ECC is used) while my desktop is for developments only.

The used 1st generation Titans are cheap now only because their gaming performance is only on par with GTX 1060, whose used-price is similar. The next candidate will be Radeon VII, in more than 4 years.

Although you may know this: Tesla requires ‘Above 4G deconding’, a function not available on all motherboards. I just learned this after buying a $150 used Intel Xeon Phi with nowhere to install it. As you mentioned looking for X399, all MSI & Gigabyte X399 motherboards have this option, while ASRock and ASUS X399 appear not. This is almost like building a mining rig, including the power supply.

Yes, I could have simply returned the Titan Black for refund, but I figured it was a bargain, so I should try again to see if I could get a working one. Our company used to have a lot of C2050, but few scientific applications could actually use them over the years, so last 2 years we only purchased nodes with advanced CPUs. But things are slowly changing, especially we are now considering machine learning workloads.

I guess that ECC is not that critical in most cases, and I know people actually use Geforce products for machine learning. Of course, GPUs in high performance servers almost always use Tesla products, and some of the scientific applications I have access to claim to only support advance Tesla products. I once tried to put one of the C2050s in my desktop PC, but it failed to boot, and from your information using Tesla products on PC is tricky. I have two motherboards supporting Xeon E3, but I am not sure if they support Tesla or not.

Are you interested in Radeon VII for OpenCL development? It is a great product for DP compute, if your applications support OpenCL. I actually recently purchased a Radeon VII, but I haven’t done much with it yet. CUDA is still the more popular for scientific applications.

Although I appreciate the “open” and potability of OpenCl (I managed to run my program on iGPU as well), CUDA is in fact better supported and more popular.

Well, I am only working with GPU on only one project, and its program is self-written and simple. On the GPU, the program almost only does

for { # Loop for some thousand times, parallel for some thousands of A's
    A=FFT(A)*B
    A=IFFT(A)*C
}
return A

which is convolution using FFT and element-wise multiplication, whose convolution kernel and main matrix have an identical size. The performance is also memory bounded. I am using existing FFT libraries, cuFFT and clFFT. Tested on Tesla P100 (4.7 DP TFLOPS, 732 GB/s bandwidth) and Titan Black (1.7 DP TFLOPS, 336 GB/s), my program is ~20% and ~15% faster using OpenCL. The saved time is from less caching enabled by clFFT for my application.

So yeah… Radeon VII (3.5 DP TFLOPS, 1024 GB/s bandwidth with its HBM2 memory) will suit my needs best right now. Also for gaming propose on my personal desktop, because a video game (Nier: Automata) I purchased long time ago has well-known unsolvable problems with NVIDIA 780(Ti) and Titan (Black), so I have never played it. Radeon VII is sort of aiming at the same market as Titan, being a flagship and good for gaming, production and computation, while affordable (to me, as an individual).

About Tesla and servers, I remember that NVIDIA’s policies forbid the use of GeForce products on servers, since a year or two ago. This should not affects PC or workstations, like any so-called “servers” only shared by a few groups/people within a company or a research group in a university/lab, I think.

Machine learning (deep learning) is always advertised for new GeForce products, and AMD advertises half-precision performance too. 1:2 DP is only expected on NVIDIA Tesla and AMD FirePro/Instinct, while 2:1 HP is a must on new generations. In fact, the errors of my experimental measures are always too large to tell any difference between SP and DP calculations, but for scientific purpose and self-consistency…