Server stucks Driver 15.1 When VM shutdown with 2 or more vGPU

Hi,

We recently have installed 15.1 Drivers With LINUX KVM drivers. Our chassis is a SuperMicro and GPU is an A16. In the documentation, is explained how to run multiple vGPU, in a single VM with an A16 GPU with q and c series. We made some test attaching from 2 to 10 vGPUs to a single VM and it works fine. But when the VM is shutdown, sometime we show in dmesg the following messages.
dmesg_vgpu.txt (16.8 KB)

After this failure, if we try to run another VM the server comes stuck and the only thing that we can do is to reset it physically.

When this occurs, the vGPU process in the server does not end. I had a walkthrough, so I didn’t need to reset the server, killing the vGPU process and reseting the GPU with nvidia-smi -r. This brings some problems, if im running another VM with a vGPU we need to stop it while it is working, so it has to start from the beginning. This makes the server not production ready, because we need to shut down VMs in production, and we shouldn’t stop production VMs.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.07    Driver Version: 525.85.07    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A16          On   | 00000000:CE:00.0 Off |                    0 |
|  0%   45C    P8    16W /  62W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A16          On   | 00000000:CF:00.0 Off |                    0 |
|  0%   43C    P8    15W /  62W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A16          On   | 00000000:D0:00.0 Off |                    0 |
|  0%   38C    P8    16W /  62W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A16          On   | 00000000:D1:00.0 Off |                    0 |
|  0%   36C    P8    15W /  62W |      0MiB / 15356MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+