Titan V slower than 1080ti tensorflow:18.08-py3 and 396.54 drivers

I am putting together another developer box and just got 2 Titan-V. I swapped them out in a machine with 2x 1080ti while waiting for the rest of the components to arrive.

I was surprised by the performance being less on my own work, so went to reproduce it using NGC containers and standard examples. I’m using latest nvidia-docker and tensorflow:18.08-py3 pull. The system is running Ubuntu 18.04 with 396.54 drivers.

In particular, the /workspace/nvidia-example/biglstm shows it easily.

2x Titan V, shows a wps of 10878 (averaged over several hundred iterations).

2x GTX1080ti, shows a wps of 12475 (averaged)

Which didn’t really make sense to me, but it is quite repeatable in my case.

Noticed some other anomalous performance, such as the plain cifar10 tutorial in Tensorflow. The 1080ti’s scale consistently from using 1 gpu, to significantly faster with 2x gpu. (about 1.37 per 100 iteration)

However, the Titan V is significantly faster using a single GPU (about 0.68 per 100 iterations), and significantly slower using 2x Titan V (about 1.27 per 100 iterations). The 2x Titan V is faster than 2x 1080ti, but it is slower than 1x Titan V. (weird)

I monitored all this using nvidia-smi dmon and I even touched the cards to see which were warming up just to be certain.

The machine is a little bit older, Skylake i7-6700k on a Asus Z170A mobo. With two cards, the PCIe are running 8x to each GPU, but this hasn’t been an issue with the 1080ti.

The 1080ti are EVGA SC models with hybrid coolers (liquid cooled GPU with a fan on the card’s voltage regulator section). They do stay quite cool and EVGA ships them with mild overclocking.

The Titan V’s are right out of the box, no tweaks or anything.

I would have expected different results, with the TitanV’s besting or at least equaling the 1080ti cards.

I would have also expected the simple cifar10 to scale better with 2x GPU’s, being about the same, or perhaps a little better. Although with the small data and batch sizes, that could be explained by the CPU overhead of managing data between the GPU’s.

I have no idea how to further dig into this, thought it might be an issue with the 396.54 drivers or Cuda 9.2.

Ok, I figured out some more of what’s going on.

Using the TF container under nvidia-docker on my machine, a very curious thing happens to the GPU clocks on the Titan-V. As soon as tensorflow creates the gpu references, the clocks on the GPU’s get pinned to no higher than 1335.

No matter what is set either on the host or the container in nvidia-smi, they are stuck no higher than 1335 as long as tensorflow is running.

Inside the container issuing the command:

nvidia-smi --applications-clocks=850,1612

will immediately increase the GPU clocks (as monitored by nvidia-smi dmon in another window). However, as soon as tensorflow begins using the GPU’s, they are throttled down to no higher than 1335. They never achieve thermal nor power limits in TensorFlow - not even close!!!

The GTX1080ti do not suffer this throttling effect, they will run at increased clocks until running into thermal or power limits.

100% repeatable.

Upon further searching, it appears that the driver is imposing a max limit on the Titan-V of 1335 GPU clock only for computer-workloads as a policy matter, as the Titan-V is considered a “consumer card”.

Hello,

We were able to reproduce and root cause the issue you were seeing. With CUDA 10 you should be able to see a performance increase in TitanV for similar workloads. The team at Nvidia understands the underlying concern that you’ve raised about locking GPU clocks while in Compute mode.

The Titan V was designed to deliver consistently accurate compute results and it will do so in many desktop environments. As a result, we set a conservative clock when running CUDA workloads. For users who want to push the TitanV past our spec, we’ll be enabling overclocking with a driver update that we’re aiming to post in November. In addition, we wanted you to note that the key advantage of TitanV over 1080Ti is the Tensor Cores.

Here are some examples, tools, and info that you can take a look at to fully take advantage of the Tensor Cores on TitanV.

Examples
· New mixed-precision model examples: https://developer.nvidia.com/deep-learning-examples
· GitHub: https://github.com/NVIDIA/DeepLearningExamples
· TensorFlow mixed-precision video: https://www.youtube.com/watch?v=i1fIBtdhjIg
Tools
· TensorFlow OpenSeq2seq: https://nvidia.github.io/OpenSeq2Seq/html/mixed-precision.html & arVix paper
Further information
· Mixed-precision blog: https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
· Mixed-precision best practices: https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
· Mixed-precision arVix paper: https://arxiv.org/abs/1710.03740

Please let us know if you have any questions.

Thanks Tom. I look forward to testing the November update. I’m hopeful the Titan-V will run safely and reliably above the 1335mhz that they limit to.

Indeed, we acquired the cards to work on porting our code to take advantage of fp16 Tensor cores. It will take time to port.

The 1080ti’s we have been using are hybrid-cooled EVGA cards. As a result of being very cool running, they seem happy to run at their highest clock rates all the time and don’t overheat. For some code, especially fp32 code that can’t take advantage of the Tensor Cores.

We’re looking also to test the RTX 2080ti - as soon as we can get them.

Hi Tom,
It seems that this update was not included into latest November release (410.78) as Titan V is still locked at 1335 Mhz for CUDA apps.
Any estimates regarding releasing this fix?

Hi Selim,

I am looking into this and will post back here when I have an answer.

Best,
Tom

Yes, seeing the same here.

Confirmed that RTX2080-TI Founder Editions run CUDA apps in full-GPU boost mode with current linux drivers.

I can confirm this.
I have latest CUDA drivers. Still the performance is looked to 1335 Mhz for Pytorch etc.

Hi Tom,

Other than waiting for another driver version, any workaround can be applied to get higher than 1335MHz limit? Thank you!

Hi deltack,

Unfortunately, none that I am aware of. I’m still waiting for an update from the engineers.

Please stay tuned.

Thanks for your patience,
Tom

Hi Tom,

Any news regarding the issue since last month?

@selim.sef

Our next driver release on 1/15/19 will have the update.

@deltack,

Here’s the nvsmi command that will uncap mclk on TitanV:
nvidia-smi.exe --cuda-clocks=OVERRIDE

To return to default:
nvidia-smi.exe --cuda-clocks=RESTORE_DEFAULTS

Best,
Tom

I have just been informed that version 417.35 has the update. Let us know if you have any issues.

https://www.nvidia.com/Download/driverResults.aspx/141167/en-us

Thanks Tom for the update! May I know if it’s going to be supported in Linux driver version 417.35 also?

Yes, the update made it into 415.25 for Linux.

https://www.nvidia.com/Download/driverResults.aspx/141448/en-ujavascript:void();s

Hi Tom, thanks for the info.
Checked with the 415.25 drivers in Linux and it works like a charm.
Now Titan V is able to drain more than 300W.

Hi Tom and selim.sef, I’ve had similar problems with my Titan V on Ubuntu 16.04. Even after updating to 415.25 driver, the Titan V is still locked at 1335 MHz. I have to apply “nvidia-smi --cuda-clocks=OVERRIDE” to make the GPU enter P2 state, but the clock speed just increased a little bit to ~1420 MHz, and the power draw is close to 250W. The GPU temperature is around 72C (should be thermal throttle).
I am just wondering how you managed to make the Titan V drain more than 300W, and what is the clock speed your Titan V is running after updating to 415.25?
By the way, I am still using CUDA 9.2. No sure if it would make a difference if I upgrade to CUDA 10.0.
Thanks a lot.

Hello Tom,

Is there an equivalent command for “nvidia-smi --cuda-clocks=OVERRIDE” on older drivers such as the 410.*? on Linux? We have some jobs on our Titan-Vs and can’t really upgrade the drivers at the moment, so any alternative would be nice.

Thanks!

Hi Synicix,

Sorry for the delay in responding. The answer is no, this command only supports newer drivers.