Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

$ nvidia-smi
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

This happens suddenly.
It is assumed to be caused by the boinc-application.

Ref:
http://www.gpugrid.net/forum_thread.php?id=5317

Also the nvidia-smi -l 60 (as shown below), stops with “Unknown Error”:

Thu May 19 12:13:56 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   80C    P0    N/A /  N/A |   2291MiB /  4096MiB |     97%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1412      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      3491      C   bin/python                       2285MiB |
+-----------------------------------------------------------------------------+
Unexpected NVML event
Error occurred while processing the event: Unknown Error

I found later driver version (that is not available in default MX Likux/Ubuntu):

Sun May 29 13:59:25 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   80C    P0    N/A /  N/A |     59MiB /  4096MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1418      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    118967      C   ...-linux-gnu__opencl_nvidia       53MiB |
+-----------------------------------------------------------------------------+

And it looks like the crashing is now less likely.
But nothing that I know why/how this is getting fixed.

This error (Unexpected NVML event) occured again:

Tue May 31 15:13:45 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   82C    P0    N/A /  N/A |     59MiB /  4096MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1485      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    159882      C   ...-linux-gnu__opencl_nvidia       53MiB |
+-----------------------------------------------------------------------------+
Unexpected NVML event
Error occurred while processing the event: Unknown Error

and it occur when the laptop went to the battery mode.

Some applications (especially boinc) seem to crash the laptop.

It looks actual results is the external 19.5V/65W power adapter that
can not produce enough power current so the laptop battery gets
drained even plugged to the wall adapter.
So I think solution would be to purchase 90W power adapter.
Especially the GPU load seems to be too high for the current 65W power adapter.
This seems like the HP ENVY 17 laptop 65W power adapter is not matching
with the maximum load situations.

Do you know if the Geforce MX450 or MX550 have lower power consumption
than this Geforce MX250 model?
Anyway, it could be impossible to change the GPU chip in this laptop, wouldn’t it?

When I use it with the 90W power adapter the result is as:

Sat Jun 18 15:56:49 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   93C    P0    N/A /  N/A |    180MiB /  4096MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     15623      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A     17273      C   bin/acemd3                        174MiB |
+-----------------------------------------------------------------------------+
Sat Jun 18 15:56:54 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... ERR!  | 00000000:02:00.0 Off |                  N/A |
|ERR!  ERR! ERR!     N/A / ERR! | GPU is lost          |    ERR!         ERR! |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

So I think crash is caused by overheating or by the driver that is not optimized for low power.
The driver is kicked out of the bus:

Jun 18 11:02:31 mx kernel: [130120.250658] NVRM: GPU at PCI:0000:02:00: GPU-793994bc-1295-4395-dc48-7dd3d7b431e2
Jun 18 11:02:31 mx kernel: [130120.250663] NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jun 18 11:02:31 mx kernel: [130120.250665] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
Jun 18 11:02:31 mx kernel: [130120.250687] NVRM: A GPU crash dump has been created. If possible, please run
Jun 18 11:03:58 mx kernel: [ 7.229407] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
Jun 18 11:10:14 mx kernel: [ 396.557669] NVRM: GPU at PCI:0000:02:00: GPU-793994bc-1295-4395-dc48-7dd3d7b431e2
Jun 18 11:10:14 mx kernel: [ 396.557674] NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jun 18 11:10:14 mx kernel: [ 396.557676] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
Jun 18 11:10:14 mx kernel: [ 396.557697] NVRM: A GPU crash dump has been created. If possible, please run
Jun 18 20:11:43 mx kernel: [ 6.301529] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver

Notebook dgpus are managed by the system bios. On overheating or -current, the whole notebooks usually shuts down.
Getting an XID 79 in high load situations like yours on a notebook most often points to a defective gpu.
If that happens on idle situations, a bios flaw regarding power managemnet might be the cause.

If the laptop was sold with this defective GPU, there is no more warranty, as it was not found earlier to be defective GPU, then it looks like there is no options left to get it fixed.

When the process starts, it crashes soon after:

Thu Jun 23 20:04:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   60C    P8    N/A /  N/A |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11294      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+
Thu Jun 23 20:04:08 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| N/A   62C    P0    N/A /  N/A |     35MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11294      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    208003      C   ...le/amicable_OpenCL_v_3_02       29MiB |
+-----------------------------------------------------------------------------+
GPU 00000000:02:00.0: Detected Critical Xid Error