Problem with detection of A6000 on Lenovo sr650 with Ubuntu 20.04

micantox · March 30, 2022, 2:39pm

Hi all,

on our Linux server we have a A6000 installed which has been working for the last 6 months. We use that card to perform some very intensive computations within a TLC oriented domain.

During one of our last test sessions, our process suddenly crashed with the following output

what(): Out of memory. cudaHostAlloc() failed to allocate 1.66406 MiB with error 999 (cudaErrorUnknown)- Allocated already: 0 bytes in 0 arrays.

and the server had the content of the attached dmesg.log printed out. Then, from the command line we had this:

$ nvidia-smi
Unable to determine the device handle for GPU 0000:86:00.0: Unknown Error

The driver was 510.x series at that time. Issuing a

$ lspci | grep -i nvidia

this was the output

86:00.0 VGA compatible controller: NVIDIA Corporation Device 2230 (rev a1)

I then started a apt upgrade/reboot cycle, followed by a reinstall of the driver. At that point, since the card detection was troublesome, Ubuntu kept suggesting a 470.x series driver, instead of the previous 510.x.

Now we have this in the dmesg

[ 1149.110772] NVRM: GPU 0000:86:00.0: RmInitAdapter failed! (0x23:0xffff:1195)
[ 1149.110802] NVRM: GPU 0000:86:00.0: rm_init_adapter failed, device minor number 0

endlessly repeating. Since the card abruptely stopped working, do I have to take that for broken?
Please help me investigating this. Attached my nvidia-bug-report.

Thanks a lot.
nvidia-bug-report.log.gz (293.6 KB)
dmesg.log (4.1 KB)

generix · March 30, 2022, 6:29pm

Looks broken.
Please check if it works in another system, if not, replace.

micantox · March 31, 2022, 7:15am

Ok, we are going to try that ASAP.

I checked internally and that card was installed around beginning of January. Do you think that, even being under an almost continuous solid load, 3 months is a reasonable and acceptable life span?

We had a pair of Quadro RTX 5000 before that never had a glitch, but since they are going to be phased out by nVidia soon, we switched to the A6000 for support reasons.

Thanks a lot.

generix · March 31, 2022, 12:51pm

Since the A6000 is built for heavy workloads, I guess it was just bad luck.

Topic		Replies	Views
A6000 is not recognized by nvidia-smi Linux	2	750	February 14, 2023
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux ubuntu , nvidia-smi	7	4060	March 12, 2024
Unable to determine the device handle for GPU 0000:21:00.0: GPU is lost. Reboot the system to recover this GPU Linux	4	930	January 18, 2022
Unable to update the NVIDIA driver Linux	3	2966	October 12, 2022
Unable to determine the device handle for GPU0000:18:00.0: Unknown Error Linux	0	782	May 27, 2023
GPU not detected by nvidia-smi Linux	0	232	July 31, 2024
Nvidia RTXA5000 NVIDIA-SMI 510.47.03, Driver Version: 510.47.03, CUDA Version: 11.6, Ubuntu 20.04.5 Fan ERR! Linux cuda , ubuntu	0	503	November 29, 2022
Nvidia-smi loss one of four cards Linux nvbugs	1	904	March 29, 2022
RTX A2000 becomes unavailable less than 5 minutes after the installation both in Linux and Windows (Unable to determine the device handle for ...)) Linux cuda , windows-driver	2	1978	July 20, 2022
Unable to determine the device handle for GPU 0000:01:00.0: Not Found Linux ubuntu , driver , linux-driver	4	24323	November 18, 2022

Problem with detection of A6000 on Lenovo sr650 with Ubuntu 20.04

Related topics