TITAN V max clock speed locked to 1,335 Mhz and underperforms TITAN Xp (Ubuntu 16.04, nvidia 390 & 396)

Hi, I have a question regarding to a clock speed behavior of TITAN V in ubuntu 16.04.
We’ve recently built an 8x TITAN V GPU server for running machine learning frameworks (TensorFlow, PyTorch, Keras) from supermicro and installed ubuntu 16.04, and nvidia-390 (390.64) from apt repository. CUDA 9.0 is installed (.deb installation method) with all 3 performance patches included.

The TITAN V oddly underperforms the TITAN Xp (desktop devbox, same OS & driver setup). The reason was that the TITAN V clock speed is locked to 1,335 Mhz and does not go up any higher.
We expected that the clock maxes out at 1,912 Mhz (official max clock) and progressively goes down as the temperature increases, but it was not the case.

However, our TITAN Xp devbox hits the official max 1,911 Mhz and stays around 1,800 Mhz with ~80c temp, so the CUDA codes we run (machine learning models) perform about 30% faster in the TITAN Xp system.

Is the behavior of TITAN V by design? The thermal issue is not the case here because the supermicro chassis keeps all 8 GPUs under 70c, and the clock is still 1,335 Mhz at ~40c range. We’ve also tested with gpu_burn http://wili.cc/blog/gpu-burn.html with the same issue.

Switching the driver from nvidia-390 (390.64) to nvidia-396 (396.24.02) did not changed the behavior. Manually setting the clock by

sudo nvidia-smi -ac 850, 1912

applies, but the actual clock is still capped at 1,335 Mhz.
The power usage can only hit 60~70 % of TDP (250W), while the TITAN Xp frequently hits over 250W.

The difference of the two systems is that the TITAN V server display is connected to the VGA port of supermicro motherboard, and the TITAN Xp devbox display is connected to the HDMI port of the GPU itself (installed in ASUS X99-E WS motherboard)

Would it be a driver-related problem or the expected boost clock policy? Any guidance would be appreciated. I suspect that CUDA version issue is less likely because 2 independent tests (TensorFlow code using System-installed CUDA 9.0, PyTorch code using anaconda environment with the bundled CUDA 9.1) got the same results.

Attatched nvidia bug report log of the TITAN V system and 2 pictures from the two systems with a command

nvidia-smi -q

I also find that this issue is relevant to https://devtalk.nvidia.com/default/topic/1028063/?comment=5229233.

nvidia-bug-report.log.gz (375 KB)

Please put the Titans under load and rerun nvidia-bug-report.sh

Attached the new nvidia-bug-report while running gpu_burn (and also screenshot)
nvidia-bug-report.log.gz (424 KB)

Ok, the Titan V stays in performance state P2.
Starting with Pascal, nvidia enforced a driver policy for consumer cards to reach only P2 on plain cuda workloads, the maximum P0 only on graphics workloads.
The Titans being “prosumer” cards it’s a bit puzzling why this policy now is also enforced on the Volta but not the Pascal. So technically everything is allright and only nvidia staff could give answers about policy (NDA required, possibly).

Yes, I’m also aware that the Geforce & TITAN cards (from Pascal) only allows up to P2 state (our P100 & V100 SXM2 (NV-Link) machines can reach P0). What is more puzzling to me is that the 3x TITAN Xp devbox machine (with X99 motherboard and i7-6850K CPU) is capable of maxing out the maximum boost clock speed on CUDA ops (~1900 Mhz, Just like running graphics ops (like 3D games) in Windows, for example), resulting in a real-world ML models actually running faster in TITAN Xp than TITAN V. This is a normal FP32 CUDA ops though. For what is worth, we haven’t tested a FP16 performance (using TITAN V’s Tensor Core) yet.

I’ve additionally attatched the nvidia-bug-report from the TITAN Xp devbox while running real-world TF & PyTorch python codes (GPU:0 runs around 1400Mhz due to a thermal limit. Others (GPU:1 & :2) runs at near max clock speed due to a lighter workload. All P2 perf state.)

If this is indeed the turbo boost 3.0 clock policy of CUDA ops of Pascal vs. Volta, What would be the point of TITAN V if the FP32 CUDA perf is this crippled?

Apparently a number of TITAN V gaming benchmarks (on Windows) show no problem running graphics workload and achieves near max turbo boost clock speed at lower temp. Might want to test FP32 CUDA performance on Windows myself but I currently do not have spare parts for this.
nvidia-bug-report.log.gz (3.03 MB)

So I did some more testing by comparing the TITAN V setup with our V100 system. My conclusion is that this CUDA clock speed limit policy of TITAN V is intentional.

The V100 max clock speed (in P0 state) is MEM: 877Mhz, Core: 1,530Mhz. If NVIDIA allows TITAN V CUDA clock speed to reach 1,912 Mhz, It will cannibalize the V100 sale since TITAN V would run much faster.

The question remains why then the clock limit policy of TITAN Xp has been removed. Maybe related to something like https://www.techpowerup.com/235701/nvidia-unlocks-certain-professional-features-for-titan-xp-through-driver-update when the competitor rolled out the new card. So for now it’s best to stick to TITAN Xp for FP32 CUDA performance.

I completely agree, especially the Xp reaching P0 clocks in P2 state is a crazy clocking policy.

Just ran into this issue myself. Swapped a couple of GTX1080ti for Titan-V cards and was surprised to see them clocked so slowly.

The GTX1080ti will run at full clocks, only slowing when thermal or power limits are reached, as expected.

The Titan-V, caps at 1335 and often will run only to 1200. As soon as TensorFlow or similar compute application connects to the card, the clocks throttle down in Linux.

I booted the machine over into Windows10 and most of the graphics benchmarks worked as expected, with boost clocks and all enabled.

As a result, there is essentially no benefit at all to Titan-V vs GTX1080ti other than as a development platform for fp16 code, that you intend to move to a production machine with Tesla V100 cards.

An official response from NVIDIA moderator for future reference:


Hoping to see that NVIDIA would unlock the clock speed for TITAN V like the Xp.

Another side note for anyone interested, we did see the 2x speed boost when training standard architectures like ResNet by using mixed precision training with apex https://github.com/NVIDIA/apex for PyTorch. But if a model contains techniques like dilated convolutions, the mixed precision training turned out to be same or somewhat slower. So a hit or miss experience as of now, and hoping for some more optimized codepath from future updates.

Update: Starting from 415.25 (Linux), TITAN V can reach P0 and 1800~1900Mhz CUDA clock speed. https://devtalk.nvidia.com/default/topic/1042047/container-tensorflow/titan-v-slower-than-1080ti-tensorflow-18-08-py3-and-396-54-drivers/post/5305096/#5305096