Processes Freeze with new Titan Xp GPU

Hi everyone,

I’m working with a Ubuntu 14.04 server that has one Nvidia Titan X and, since today, two Nvidia Titan Xp GPUs. However, after reinstalling the CUDA drivers, the old GPU still works like a charm, but the two new Xps seem to have some problem, since every process that makes use of them freezes (ctrl-c or ctrl-d do not work anymore, I have to kill my SSH session to stop them). In order to eliminate the possibility of any bugs in my code, I tested the GPUs with the following super-simple piece of PyTorch code:

import torch
x = torch.zeros(10).cuda()

This code does nothing but creating one tensor, and moving it to the GPU. When I run this with CUDA_VISIBLE_DEVICES=, I do not encounter any problems, but with CUDA_VISIBLE_DEVICES= my process freezes.

Unfortunately, I’m not exactly an expert for problems like this, and I don’t really know how to approach this issue. Could anyone tell me how I could possibly figure out what exactly is going on here?

Any help is appreciated!

Best regards,
Patrick

You may want to enlist the help of someone local who has configured hardware before, given that you are operating with fairly expensive hardware.

(1) Did you buy the Titan XPs in original packaging from a reputable vendor?
(2) Did you install the Titan XPs into the correct type slots (PCIe gen3 x16) in your server?
(3) Did you hook up the necessary PCIe power connectors to the Titan XPs?
(4) Does your installed CUDA driver have support for the Titan XP, which is a newer GPU?

For debugging purposes, you may want to try operating with one GPU at a time first. Replace the one working Titan X with one of the Titan XPs. Does that work?

Hi njuffa,

Many thanks for your reply!!

[1] Yes.
[2]+[3] I believe yes, but I’ll double-check that.
[4] Right now, we have version 384.90, which is supposed to support Titan Xp. Here is a list of all relevant installed packages:

$ dpkg -l | grep -E 'ii.*(nvidia|cuda)'
ii  cuda                                                  8.0.61-1                                            amd64        CUDA meta-package
ii  cuda-8-0                                              8.0.61-1                                            amd64        CUDA 8.0 meta-package
ii  cuda-command-line-tools-8-0                           8.0.61-1                                            amd64        CUDA command-line tools
ii  cuda-core-8-0                                         8.0.61-1                                            amd64        CUDA core tools
ii  cuda-cublas-8-0                                       8.0.61.2-1                                          amd64        CUBLAS native runtime libraries
ii  cuda-cublas-dev-8-0                                   8.0.61.2-1                                          amd64        CUBLAS native dev links, headers
ii  cuda-cudart-8-0                                       8.0.61-1                                            amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-8-0                                   8.0.61-1                                            amd64        CUDA Runtime native dev links, headers
ii  cuda-cufft-8-0                                        8.0.61-1                                            amd64        CUFFT native runtime libraries
ii  cuda-cufft-dev-8-0                                    8.0.61-1                                            amd64        CUFFT native dev links, headers
ii  cuda-curand-8-0                                       8.0.61-1                                            amd64        CURAND native runtime libraries
ii  cuda-curand-dev-8-0                                   8.0.61-1                                            amd64        CURAND native dev links, headers
ii  cuda-cusolver-8-0                                     8.0.61-1                                            amd64        CUDA solver native runtime libraries
ii  cuda-cusolver-dev-8-0                                 8.0.61-1                                            amd64        CUDA solver native dev links, headers
ii  cuda-cusparse-8-0                                     8.0.61-1                                            amd64        CUSPARSE native runtime libraries
ii  cuda-cusparse-dev-8-0                                 8.0.61-1                                            amd64        CUSPARSE native dev links, headers
ii  cuda-demo-suite-8-0                                   8.0.61-1                                            amd64        Demo suite for CUDA
ii  cuda-documentation-8-0                                8.0.61-1                                            amd64        CUDA documentation
ii  cuda-driver-dev-8-0                                   8.0.61-1                                            amd64        CUDA Driver native dev stub library
ii  cuda-drivers                                          384.66-1                                            amd64        CUDA Driver meta-package
ii  cuda-license-8-0                                      8.0.61-1                                            amd64        CUDA licenses
ii  cuda-misc-headers-8-0                                 8.0.61-1                                            amd64        CUDA miscellaneous headers
ii  cuda-npp-8-0                                          8.0.61-1                                            amd64        NPP native runtime libraries
ii  cuda-npp-dev-8-0                                      8.0.61-1                                            amd64        NPP native dev links, headers
ii  cuda-nvgraph-8-0                                      8.0.61-1                                            amd64        NVGRAPH native runtime libraries
ii  cuda-nvgraph-dev-8-0                                  8.0.61-1                                            amd64        NVGRAPH native dev links, headers
ii  cuda-nvml-dev-8-0                                     8.0.61-1                                            amd64        NVML native dev links, headers
ii  cuda-nvrtc-8-0                                        8.0.61-1                                            amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-8-0                                    8.0.61-1                                            amd64        NVRTC native dev links, headers
ii  cuda-repo-ubuntu1404                                  8.0.61-1                                            amd64        cuda repository configuration files
ii  cuda-runtime-8-0                                      8.0.61-1                                            amd64        CUDA Runtime 8.0 meta-package
ii  cuda-samples-8-0                                      8.0.61-1                                            amd64        CUDA example applications
ii  cuda-toolkit-8-0                                      8.0.61-1                                            amd64        CUDA Toolkit 8.0 meta-package
ii  cuda-visual-tools-8-0                                 8.0.61-1                                            amd64        CUDA visual tools
ii  libcuda1-384                                          384.90-0ubuntu0.14.04.1                             amd64        NVIDIA CUDA runtime library
ii  libcudnn7                                             7.0.3.11-1+cuda8.0                                  amd64        cuDNN runtime libraries
ii  libcudnn7-dev                                         7.0.3.11-1+cuda8.0                                  amd64        cuDNN development libraries and headers
ii  nvidia-384                                            384.90-0ubuntu0.14.04.1                             amd64        NVIDIA binary driver - version 384.90
ii  nvidia-384-dev                                        384.90-0ubuntu0.14.04.1                             amd64        NVIDIA binary Xorg driver development files
ii  nvidia-docker                                         1.0.1-1                                             amd64        NVIDIA Docker container tools
ii  nvidia-modprobe                                       384.66-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-icd-384                                 384.90-0ubuntu0.14.04.1                             amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                          0.6.2.1                                             amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                       384.66-0ubuntu1                                     amd64        Tool for configuring the NVIDIA graphics driver

I’ll try to remove all cards but one Titan Xp, and let you know whether anything changes.

Best,
Patrick

Hi njuffa,

Please excuse my late reply, but I have been away sick.

Finally, I can confirm all of the 4 points you mentioned above. Futhermore, I removed all GPUs except one of the new Titan Xps, but unfortunately, nothing changed :-/

Do you happen to have an idea what I could try next?

Thanks a million!

Best,
Patrick

When you operated the machine with just a single Titan Xp, was that installed in the very same slot that was previously occupied by the (working) Titan X?

This would narrow down whether the issue is with the PCIe slot (misconfigured or defective) or the GPU (seems unlikely given that it is factor fresh; although damage by improper handling, e.g. electrostatics, would be a possibility).

I can’t try the exact same slot, since the power cable for the Xp doesn’t fit, but I tried three different slots, and didn’t see any change. Also, I did think of electrostatics, and it seems unlikely that both of the brand new Xps (both of which I tried) are damaged.

I am puzzled as to what this means. Isn’t the Titan Xp (I haven’t actually seen one for real) using a standard PCIe power cable? And why does the cable fit in the other lots but not this one?

I have no further ideas as what to try and would suggest finding local (non-remote) help from a person that can actually inspect the system and work with it in hands-on fashion.

The slot where the Titan X used to be provides two 6-pin power connectors, but the Titan Xp needs one with 6 and one with 8 pins.

Thanks a million for help anyway!!! Does anyone else happen to have any ideas?

The slot where the Titan X used to be provides two 6-pin power connectors, but the Titan Xp needs one with 6 and one with 8 pins.

Thanks a million for help anyway!!! Does anyone else happen to have any ideas?

what is the output of:

nvidia-smi -a

dmesg |grep NVRM

?