Ubuntu freezes randomly with Titan Xp Collector's edition

Not sure if this is the right place to ask but I’ve been using the Titan Xp collector’s edition for a few days and have encountered multiple instances where the screen all of a sudden freezes and then goes no signal, at which point the LED on my titan xp starts to flash. Is this a problem with the driver? I’m using 390.12. with linux kernel 4.13

So yeah any help would be appreciated… What does it mean when the LED on the GPU blinks? I’d say it blinks every second or so. I really don’t think it’s software problem.

I can’t say I have ever observed something like this. Is this LED in the vicinity of the PCIe power cable connector(s) by any chance? If so, this could be an indication that there is an issue with the power supply to the card. Which might also explain the sudden loss of a signal from the card.

Double check whether PCIe power cable(s) are plugged in properly. Is this the only GPU in the system? What is the power rating (in watt) of the power supply unit in this machine?

I have a 750W PSU and I’m only using one GPU, so I think there is enough power…

The LED I’m talking about is part of the GPU (Galactic Empire edition) so its indicating something about the GPU.

The PSU came with a few 6+2 type connector cables. So I plugged one into the 8pin connector using a 6+2, and I simply used the 6pin part of another 6+2 connector for the 6pin part. (The two 6+2 connectors are completely separate, two independent cables connected to two different ports on the PSU)

The computer would crash under very light load… I’ve only been browsing the web mostly.

From your description, the wattage of the power supply should be sufficient, and the configuration of the cables sounds correct.

Crashing under very light load would be consistent with the GPU drawing power only through the PCIe slot (which can supply up to 75W), but not through the PCIe power cables. Are the connectors of the cables properly engaged? Typically they have a small tab that snaps into place once the connector is pushed all the way into the socket on the GPU. Pay special attention to the 6+2 connector, the two sections may have shifted relative to each other (although there should be tiny protrusion and a matching recess on the other part to help ensure this doesn’t happen).

Another thing you would want to check is whether the GPU is inserted properly into the slot connector (inserted completely straight and pushed all the way in) and secured firmly at the bracket on the exhaust side of the GPU. Also, make sure it is installed in a x16 slot. Most boards have multiple PCIe slots, such as one or two x16 and maybe two more x4. If there are two x16 slots in your system, try the other one from what you are using now.

Did you install the GPU in the system yourself, or was that worked performed by a third party? Is this a factory-fresh Titan Xp delivered in original packaging or a previously owned one?

Yeah I orderd the Titan Xp from NVIDIA website. I think the cables are connected correctly. Because when I first setup the PC I actually only connected the 8pin connector and when I tried to turn on the PC it gave me a message telling me to connect the other cable.

I put together everything myself.

So the CPU light on the motherboard also lights up when the computer freezes. According to the motherboard manual that light is supposed to indicate no cpu detected or fail.

Hm, that’s definitely weird. What’s the name and brand of the motherboard? Maybe some other forum participant has experience with that motherboard and can tell us more.

One hypothesis I considered is a defect in the GPU, which is why I inquired whether it was new when installed. But based on this motherboard error indicator, it does not seem likely that there was a problem with both the GPU and the CPU; one would rather suspect a problem with the motherboard.

Motherboards can get damaged in (at least) two ways when putting together a system:

(1) Mechanical stress (bending) when installing the CPU or the GPU can lead to hairline cracks in the signal traces.

(2) Electrostatic discharge can damage electronic components if proper precaution (e.g. grounding straps) isn’t taken.

But overall that kind of damage is unlikely. I built quite few system myself for about ten years and managed to damage a motherboard just once (killed the floppy controller). I am not sure what else to check. The usual thing to do is to double check anything that can be plugged in (power and I/O cabling, CPU, and add-on cards like the GPU). Are there any bent pins, dirty contacts, incompletely or incorrectly inserted connectors?

Resolving such issues remotely is like asking a car mechanic to troubleshoot a car over the phone; doesn’t work that well. If you have a knowledgeable friend, ask him or her to look over your system; a third party can sometimes spot things that are out of place almost immediately that the builder of the system keeps on overlooking (the idea is akin to having someone other than the writer proofread a document).

So I actually installed windows 10 and it seems to be working perfectly. So I think the hardware part is solid. I dual booted ubuntu and the same problem appeared. Random crashes with the GPU light flashing. Many forums suggest that ubuntu crashes are caused by drivers. What exactly is the recommended driver for Titan Xp collector’s edition on Ubuntu 16.04?
I also tried to update kernel to 4.15(from 4.13) and same problem persisted.

That does not make sense to me whatsoever based on the information that had been provided. But if everything works under Windows we can obviously assume the hardware works.

Anyhow, updating the kernel seems like a bad idea. There is a kernel component to GPU drivers whose interface needs to match kernel interface. As I recall, Linux kernel developers do not like to keep the kernel interfaces stable, so updating the kernel could mean the kernel interface gets out of sync with what is expected by the kernel portion of the driver. My suggestion would be to install the kernel version that is listed as supported for Ubuntu 16.04 in the Linux Installation Guide:

http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Make sure to download the NVIDIA driver package that matches the installed OS version, and follow NVIDIA’s installation instructions precisely. There are two driver installation methods, packaged or runfile. Pick one approach and stick with it, as switching between them tends to mess up the installation.

Hello everyone,
can HotDog2017 write, if he still has the issues? I just got a new computer with Titan Xp and also 750 watt power. And now my Ubuntu freezes randomly, if I use it for computations AND when I try to play simple games like Dota 2 via Steam.

I do not have Win 10 to check, if the issue will be resolved by switching the OS.

I haven’t encountered random freezes for about a week now. However I can’t tell you what solved the problem (or if it is solved at all). I ended up installing windows 10 and dual booted Ubuntu using the “install ubuntu alongside windows 10” option. and that seemed to have solved the problem… Another thing that happened was that there was a ubuntu update about a week ago which I installed.

Do you have an AMD cpu? Because I think the crashes was because of the CPU.

As far as CUDA operation is concerned, please make sure to use only the OS and kernel versions listed in the Linux Installation Guide for whatever CUDA version you have installed.

It is certainly possible that new Linux kernels in particular break a CUDA installation, by mucking with kernel interfaces (a general and perpetual problem with Linux that exists by design).

Because of possible incompatibilities, NVIDIA supports only the OS, kernel, and toolchain versions listed in the installation guides. You are on your own for all other configurations.

I have freshly installed Ubuntu 16.04. After that directly installed nvidia 384.111 drivers, cuda 9.0, cuDNN 7.05 and tensorflow (and also steam and dota 2 ;) )

Regarding CPU: it is intel 8700k.

I have not used different kernels or etc.

If I run some code on image identification this is how my nvidia-smi looks like. So for me it looks normal.

The freezes come unexpected. Either sometimes during computations or also during game.

born@bornexmachina:~$ nvidia-smi
Sun Feb 11 21:59:40 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp COLLEC… Off | 00000000:01:00.0 On | N/A |
| 45% 74C P2 206W / 250W | 11904MiB / 12180MiB | 62% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1080 G /usr/lib/xorg/Xorg 411MiB |
| 0 1511 G compiz 144MiB |
| 0 2153 G …-token=1A94804E091A866E6687A407C642BA22 171MiB |
| 0 2924 C /home/born/anaconda3/bin/python 11173MiB |
±----------------------------------------------------------------------------+

and some error occured but not a freeze:

2018

Sorry for double post, but the edit doesnt allow me to post it

2018

The only thing that is relevant is whether the kernel version installed on your machine matches the kernel version listed in the Linux Installation Guide for CUDA 9.0. It is my understanding that Ubuntu may update kernels automatically, but I don’t know that from first-hand experience because I stay away from Ubuntu as far as possible.

Good for you for installing Cuda 9.0 instead of 9.1… 9.1 makes everything so much harder.

I was responding to aborn.priv who stated that they installed cuda 9.0.