Tesla V100 GPU way too slow

Hi,

I have a server with Ubuntu 16.04. I have installed CUDA 9.1 and cuDnn 7. My driver version is 387.26, which I think should be compatible with the V100 GPU; nvidia-smi correctly recognizes the GPU.

I am using it with pytorch 0.3. The problem is that it is way too slow; one epoch of training resnet18 with batch size of 64 on cifar100 takes about 1 hour. I believe this is only a fraction of the performance the V100 is capable of.

nvidia-smi shows that the GPU is being utilized, so I don’t understand what the problem could be.

Do you have any ideas?
Thank you!

What is the expectation how long this task should take? What is the foundation for that expectation?

I am not familiar with pytorch. The version number 0.3 suggests that this may well be alpha-quality software. How good is the GPU support in pytorch? Has it been optimized for the Volta architecture? Have you asked for assistance from the software vendor (e.g. forum or mailing list)?

If you built pytorch yourself, did you create a release (rather than a debug) build? Have you carefully scrutinized all available configuration settings?

When you run with the CUDA profiler, what does it indicate about potential bottlenecks in the application?

Hi,

pytorch is indeed in the beta phase but it has good GPU support. I did ask for assistance on pytorch forum, but I haven’t received an answer yet.

When I run the same code (again with pytorch 0.3) on a Titan Xp, it takes about 10 minutes. So I am seeing a 6x reduction in performance for the V100 GPU.

At first I built pytorch myself, then I removed it and installed the conda packages. In both cases the execution time was the same.
I am not sure about configuration settings, but if I use the exact same code on the Titan Xp, it goes 6x faster, so I am assuming there is something wrong with my V100.

I haven’t run the cuda profiler; is there some test / benchmark you suggest running?

You would want to run the application that you are interested in profiling.

If the test case runs in 10 minutes on a Titan Xp, but 60 minutes on V100 (six times longer), that might indicate that pytorch doesn’t know what to do with Volta-architecture parts yet and therefore either uses a generic GPU path, or possibly mostly CPU-based computation.

For now this looks like a software configuration issue, I don’t see any indications that there is anything wrong with the V100. Did you build the system with the V100 yourself, or is this a system obtained from a system integrator that partners with NVIDIA?

@Ant125, are you running your code with “torch.backends.cudnn.benchmark = True”? It makes a huge difference in terms of performance. You want to set this True to turn on the auto tuner that picks the best algorithm to use for CUDA/CUDNN. It is not enabled by default since it is not always the best thing to do, in case your network is pretty dynamic.

I’m using PyTorch 0.3.0, CUDA 9.0, CUDNN 7, and NVIDIA driver 387.34 with PyTorch’s pre-built conda package and I was surprised to find that Titan V was slower than 1080 Ti, and I asked on the PyTorch forum about what might be wrong, and I got to know about this special flag: https://discuss.pytorch.org/t/solved-titan-v-on-pytorch-0-3-0-cuda-9-0-cudnn-7-0-is-much-slower-than-1080-ti/11320/3

Just in case you are running into the same issue…

Hi,

thank you! I have added the torch.backends.cudnn.benchmark=True line and it is 2x faster!

Still, I believe the V100 should be much faster, but at this point I guess it’s a pytorch configuration problem rather than a V100 problem. Thank you for your help!

Just to double check: The Tesla V100 is a GPU with passive cooling, meaning it must be installed in a server enclosure that provides adequate air flow across the GPU’s heat-exchanger fins. Otherwise the device will overheat, and first throttle (reduce its clock frequency) and ultimately turn off altogether to avoid permanent damage. When you monitor the V100 with nvidia-smi while running your application, what temperature does it report?

In contrast, the Titan Xp is an actively-cooled GPU that includes it own fan for cooling.

You are welcome, @Ant125.
BTW, over on this thread https://discuss.pytorch.org/t/solved-titan-v-on-pytorch-0-3-0-cuda-9-0-cudnn-7-0-is-much-slower-than-1080-ti/11320/10 someone posted benchmark numbers for V100 on an Amazon p3 instance. The numbers reported for V100 are better than Titan V or 1080 Ti.

I would expect Tesla V100 to be generally faster than Titan V in many situations. It has ~33% higher memory bandwidth.