GTX1050Ti apparently causing system reboot

Set up a new system a few months ago running Sparky GameOver (which is basically a skin on Debian testing) with a GTX1050Ti. 95% of the time everything is fine, but heavy 3D graphics load causes the system to spontaneously reboot. This happens reproducibly with at least three different applications. Heavy CUDA load does not cause a reboot.

With the help of the supplier, I’ve already ruled out non-GPU related hardware issues, such as overheating, faulty memory and loose connections. The reboot is so hard and so sudden that nothing out of the ordinary gets logged.

This has been happening since I first installed the system with 440.36 driver. I’ve kept the drivers updated in the hope of a fix, and while some have kept the system running for longer before the reboot occurs, none have made it go away (440.59 was probably the best, 440.82 is a step back).

I’ve not seen any reports of such a severe problem anywhere, so my suspicions are now tending towards a hardware fault. But the fact that different versions of the driver have made a difference (positive or negative) means I still suspect it might be a driver bug. So any other reports one way or the other would help.

Spontaneous reboots are most often caused by an unstable power supply. Since the 1050 ti is bus-powered only, this might also point to some mainboard problem. Please check for a bios update first, try reseating the card in its slot if not already done, then check if replacing the psu helps.

Reseating the board was the last thing I checked before posting here. I’ve just done a BIOS update, and that appears to have changed the behaviour from a spontaneous reboot to a spontaneous shutdown (only tested it once so far).
I have to admit that the only time I’ve had a Linux machine suddenly power down like this before was one with a dodgy power supply. But as this was a brand new machine and always exhibited the problem that seemed unlikely. (Not sure if I’ve got another suitable PSU lying around to check with.)

ETA: Two further data points: One of the applications which intermittently causes problems is a game I’m writing in Unity3d (also happens with commercial Unity-developed games). But only when running the built game: I’ve never had anything more severe than an application lock-up when running the game within the Unity Editor. Secondly, using GPUTest running FurMark is the most reliable way of triggering a shutdown – it currently takes less than 10s from a cold start, and I’ve never got it to run for more than a minute. However, I’ve just tried running it under wine and got to 4min with absolutely nothing alarming happening (~70fps).

As feared, no other PSU around I can swap out for, but I do have a mothballed machine I can try swapping the card to, which should help narrow down the cause. Will have to wait for the weekend though.

I think I can rule out the card itself and drivers now (although the drivers obviously have some impact) – I transferred the card to the old machine, brought it up to the same base operating system (Debian testing) and driver (440.82) and stress-tested it with FurMark for 3min with nothing untoward happening (60fps, which is less than running under wine on the original machine, but this one has a much older CPU).

OK, I’m back to the drivers being at least part of the problem. At the suggestion of the supplier, I (reluctantly, and tediously) installed Windows on the machine that’s crashing under Linux. And the same test (FurMark from GpuTest 0.7.0) ran for over four minutes at 70fps.

So, in comparison to the system that spontaneously stops under heavy graphical load:

  1. The same system can operate under high CPU and CUDA load with no problems.
  2. How readily the system stops has varied between different releases of the drivers.
  3. The same hardware running a different OS performs the same graphics tasks with no problems.
  4. While this all points to the drivers, the exact same card, with the same drivers and kernel, in a different machine, is also fine.

There’s something really weird going on here.