Tesla V100 GPU way too slow

Ant125 · December 20, 2017, 9:14pm

Hi,

I have a server with Ubuntu 16.04. I have installed CUDA 9.1 and cuDnn 7. My driver version is 387.26, which I think should be compatible with the V100 GPU; nvidia-smi correctly recognizes the GPU.

I am using it with pytorch 0.3. The problem is that it is way too slow; one epoch of training resnet18 with batch size of 64 on cifar100 takes about 1 hour. I believe this is only a fraction of the performance the V100 is capable of.

nvidia-smi shows that the GPU is being utilized, so I don’t understand what the problem could be.

Do you have any ideas?
Thank you!

njuffa · December 20, 2017, 9:21pm

What is the expectation how long this task should take? What is the foundation for that expectation?

I am not familiar with pytorch. The version number 0.3 suggests that this may well be alpha-quality software. How good is the GPU support in pytorch? Has it been optimized for the Volta architecture? Have you asked for assistance from the software vendor (e.g. forum or mailing list)?

If you built pytorch yourself, did you create a release (rather than a debug) build? Have you carefully scrutinized all available configuration settings?

When you run with the CUDA profiler, what does it indicate about potential bottlenecks in the application?

Ant125 · December 20, 2017, 9:27pm

Hi,

pytorch is indeed in the beta phase but it has good GPU support. I did ask for assistance on pytorch forum, but I haven’t received an answer yet.

When I run the same code (again with pytorch 0.3) on a Titan Xp, it takes about 10 minutes. So I am seeing a 6x reduction in performance for the V100 GPU.

At first I built pytorch myself, then I removed it and installed the conda packages. In both cases the execution time was the same.
I am not sure about configuration settings, but if I use the exact same code on the Titan Xp, it goes 6x faster, so I am assuming there is something wrong with my V100.

I haven’t run the cuda profiler; is there some test / benchmark you suggest running?

njuffa · December 20, 2017, 10:01pm

You would want to run the application that you are interested in profiling.

If the test case runs in 10 minutes on a Titan Xp, but 60 minutes on V100 (six times longer), that might indicate that pytorch doesn’t know what to do with Volta-architecture parts yet and therefore either uses a generic GPU path, or possibly mostly CPU-based computation.

For now this looks like a software configuration issue, I don’t see any indications that there is anything wrong with the V100. Did you build the system with the V100 yourself, or is this a system obtained from a system integrator that partners with NVIDIA?

u39kun · December 21, 2017, 6:02am

@Ant125, are you running your code with “torch.backends.cudnn.benchmark = True”? It makes a huge difference in terms of performance. You want to set this True to turn on the auto tuner that picks the best algorithm to use for CUDA/CUDNN. It is not enabled by default since it is not always the best thing to do, in case your network is pretty dynamic.

I’m using PyTorch 0.3.0, CUDA 9.0, CUDNN 7, and NVIDIA driver 387.34 with PyTorch’s pre-built conda package and I was surprised to find that Titan V was slower than 1080 Ti, and I asked on the PyTorch forum about what might be wrong, and I got to know about this special flag: [SOLVED] Titan V on PyTorch 0.3.0, CUDA 9.0, CUDNN 7.0 is much slower than 1080 Ti - #3 by yusaku - vision - PyTorch Forums

Just in case you are running into the same issue…

Ant125 · December 21, 2017, 3:33pm

Hi,

thank you! I have added the torch.backends.cudnn.benchmark=True line and it is 2x faster!

Still, I believe the V100 should be much faster, but at this point I guess it’s a pytorch configuration problem rather than a V100 problem. Thank you for your help!

njuffa · December 21, 2017, 4:06pm

Just to double check: The Tesla V100 is a GPU with passive cooling, meaning it must be installed in a server enclosure that provides adequate air flow across the GPU’s heat-exchanger fins. Otherwise the device will overheat, and first throttle (reduce its clock frequency) and ultimately turn off altogether to avoid permanent damage. When you monitor the V100 with nvidia-smi while running your application, what temperature does it report?

In contrast, the Titan Xp is an actively-cooled GPU that includes it own fan for cooling.

u39kun · December 21, 2017, 5:58pm

You are welcome, @Ant125.
BTW, over on this thread [url]https://discuss.pytorch.org/t/solved-titan-v-on-pytorch-0-3-0-cuda-9-0-cudnn-7-0-is-much-slower-than-1080-ti/11320/10[/url] someone posted benchmark numbers for V100 on an Amazon p3 instance. The numbers reported for V100 are better than Titan V or 1080 Ti.

Robert_Crovella · December 21, 2017, 6:13pm

I would expect Tesla V100 to be generally faster than Titan V in many situations. It has ~33% higher memory bandwidth.

Topic		Replies	Views
Gpu tesla t4 suddenly has slow processing no more that 1 % solved after reboot it CUDA Programming and Performance cuda , kernel , python , gpu-computing	2	580	June 24, 2024
TITAN V max clock speed locked to 1,335 Mhz and underperforms TITAN Xp (Ubuntu 16.04, nvidia 390 & 396) Linux	10	3525	October 14, 2021
Low performance for convolution in cuDNN on Tesla V100 cuDNN	5	2077	August 2, 2018
Underperforming Tesla/Titan CUDA Programming and Performance	3	729	March 8, 2019
Tesla V100 PCIE fails after some time on Ubuntu 18.04 Linux	1	1331	January 29, 2019
Tesla V100 is slower than RTX 2080ti CUDA Programming and Performance	6	2051	October 12, 2021
TF and Pytorch are slower on Windows than on linux CUDA Programming and Performance	7	3129	July 2, 2019
Titan V slower than 1080ti tensorflow:18.08-py3 and 396.54 drivers Frameworks tensorflow	21	10359	October 12, 2021
Bfloat16 has worse performance than float16 for conv2d in Pytorch CUDA Programming and Performance cuda , kernel , pytorch , python	4	2912	July 6, 2022
Nvidia Tesla P100 keeps throwing ECC errors CUDA Programming and Performance cuda , ubuntu , driver	2	538	July 2, 2024

Tesla V100 GPU way too slow

Related topics