on our Linux server we have a A6000 installed which has been working for the last 6 months. We use that card to perform some very intensive computations within a TLC oriented domain.
During one of our last test sessions, our process suddenly crashed with the following output
what(): Out of memory. cudaHostAlloc() failed to allocate 1.66406 MiB with error 999 (cudaErrorUnknown)- Allocated already: 0 bytes in 0 arrays.
and the server had the content of the attached dmesg.log printed out. Then, from the command line we had this:
Unable to determine the device handle for GPU 0000:86:00.0: Unknown Error
The driver was 510.x series at that time. Issuing a
$ lspci | grep -i nvidia
this was the output
86:00.0 VGA compatible controller: NVIDIA Corporation Device 2230 (rev a1)
I then started a apt upgrade/reboot cycle, followed by a reinstall of the driver. At that point, since the card detection was troublesome, Ubuntu kept suggesting a 470.x series driver, instead of the previous 510.x.
Now we have this in the dmesg
[ 1149.110772] NVRM: GPU 0000:86:00.0: RmInitAdapter failed! (0x23:0xffff:1195)
[ 1149.110802] NVRM: GPU 0000:86:00.0: rm_init_adapter failed, device minor number 0
endlessly repeating. Since the card abruptely stopped working, do I have to take that for broken?
Please help me investigating this. Attached my nvidia-bug-report.