Driver 440.31 locks up system with GeForce RTX 2070

On system:

Distributor ID: Ubuntu
Description: Pop!_OS 19.10
Release: 19.10
Codename: eoan

I experience an eventual full machine lock up when running the 440.31 driver to run CNN’s. The 435.21 driver does NOT exhibit the same lock up. The amount of time it takes is variable – it can run the GPU training anywhere from five minutes to an hour before locking up.

I’m attaching an nvidia-bug-report file that I made on the 435 version; which is working – unfortunately it seems like if I upgrade to 440 on this distribution, I am stuck there forever if I don’t do a complete system recovery. I verified this after running into the bug after upgrading a new laptop to 440, and then system recovering back to 435 where it worked as expected.

This jupyter file: consistently caused the issue on 440; and runs just fine on 435. Specifically, run up to and including this line:

learn.fit_one_cycle(10, slice(lr), pct_start=0.9)

Since the time to lock up is random, it might run just fine for 10 epochs. I was running:

while True:
    learn.fit_one_cycle(10, slice(lr), pct_start=0.9)

which never made it more than an hour in my testing.

I saw this behavior on two separate but identically speced systems. The other has been sent back to the manufacturer. It’d be helpful to know if this test works just fine on other rtx 2070’s with the 440 driver. My concern is that if it’s a somewhat unique combination of hardware causing this issue on 440s; it might not be fixed in future released of the driver.
nvidia-bug-report.log.gz (332 KB)

A log file from a working system doesn’t really help. You’ll have to upgrade to the non-working driver, provoke the freeze and create a new nvidia-bug-report.log after reboot.

I understand that a log file from a working system is less than ideal, but I do not want to upgrade and be stuck in a broken state. I was hoping the reproduction instructions would suffice. I am working with the laptop manufacturer to possibly get the appropriate log files.

Please check this: