Hi, I have a question regarding to a clock speed behavior of TITAN V in ubuntu 16.04.
We’ve recently built an 8x TITAN V GPU server for running machine learning frameworks (TensorFlow, PyTorch, Keras) from supermicro and installed ubuntu 16.04, and nvidia-390 (390.64) from apt repository. CUDA 9.0 is installed (.deb installation method) with all 3 performance patches included.
The TITAN V oddly underperforms the TITAN Xp (desktop devbox, same OS & driver setup). The reason was that the TITAN V clock speed is locked to 1,335 Mhz and does not go up any higher.
We expected that the clock maxes out at 1,912 Mhz (official max clock) and progressively goes down as the temperature increases, but it was not the case.
However, our TITAN Xp devbox hits the official max 1,911 Mhz and stays around 1,800 Mhz with ~80c temp, so the CUDA codes we run (machine learning models) perform about 30% faster in the TITAN Xp system.
Is the behavior of TITAN V by design? The thermal issue is not the case here because the supermicro chassis keeps all 8 GPUs under 70c, and the clock is still 1,335 Mhz at ~40c range. We’ve also tested with gpu_burn http://wili.cc/blog/gpu-burn.html with the same issue.
Switching the driver from nvidia-390 (390.64) to nvidia-396 (396.24.02) did not changed the behavior. Manually setting the clock by
sudo nvidia-smi -ac 850, 1912
applies, but the actual clock is still capped at 1,335 Mhz.
The power usage can only hit 60~70 % of TDP (250W), while the TITAN Xp frequently hits over 250W.
The difference of the two systems is that the TITAN V server display is connected to the VGA port of supermicro motherboard, and the TITAN Xp devbox display is connected to the HDMI port of the GPU itself (installed in ASUS X99-E WS motherboard)
Would it be a driver-related problem or the expected boost clock policy? Any guidance would be appreciated. I suspect that CUDA version issue is less likely because 2 independent tests (TensorFlow code using System-installed CUDA 9.0, PyTorch code using anaconda environment with the bundled CUDA 9.1) got the same results.
Attatched nvidia bug report log of the TITAN V system and 2 pictures from the two systems with a command
I also find that this issue is relevant to https://devtalk.nvidia.com/default/topic/1028063/?comment=5229233.
nvidia-bug-report.log.gz (375 KB)