Problem with detection of A6000 on Lenovo sr650 with Ubuntu 20.04

Hi all,

on our Linux server we have a A6000 installed which has been working for the last 6 months. We use that card to perform some very intensive computations within a TLC oriented domain.

During one of our last test sessions, our process suddenly crashed with the following output

what(): Out of memory. cudaHostAlloc() failed to allocate 1.66406 MiB with error 999 (cudaErrorUnknown)- Allocated already: 0 bytes in 0 arrays.

and the server had the content of the attached dmesg.log printed out. Then, from the command line we had this:

$ nvidia-smi
Unable to determine the device handle for GPU 0000:86:00.0: Unknown Error

The driver was 510.x series at that time. Issuing a

$ lspci | grep -i nvidia

this was the output

86:00.0 VGA compatible controller: NVIDIA Corporation Device 2230 (rev a1)

I then started a apt upgrade/reboot cycle, followed by a reinstall of the driver. At that point, since the card detection was troublesome, Ubuntu kept suggesting a 470.x series driver, instead of the previous 510.x.

Now we have this in the dmesg

[ 1149.110772] NVRM: GPU 0000:86:00.0: RmInitAdapter failed! (0x23:0xffff:1195)
[ 1149.110802] NVRM: GPU 0000:86:00.0: rm_init_adapter failed, device minor number 0

endlessly repeating. Since the card abruptely stopped working, do I have to take that for broken?
Please help me investigating this. Attached my nvidia-bug-report.

Thanks a lot.
nvidia-bug-report.log.gz (293.6 KB)
dmesg.log (4.1 KB)

Looks broken.
Please check if it works in another system, if not, replace.

Ok, we are going to try that ASAP.

I checked internally and that card was installed around beginning of January. Do you think that, even being under an almost continuous solid load, 3 months is a reasonable and acceptable life span?

We had a pair of Quadro RTX 5000 before that never had a glitch, but since they are going to be phased out by nVidia soon, we switched to the A6000 for support reasons.

Thanks a lot.

Since the A6000 is built for heavy workloads, I guess it was just bad luck.