Weird Guest State (unable to run workloads without reinstalling drivers) - MIG KVM Passthrough

Hello everyone,

I was doing some resiliency tests on the A30 using this implementation of MaxFlops, in a system configured with the vGPU (vgpu-ubuntu-525_525.85.07) host driver, KVM and 4 VMs with MIG pass-through each. I maxed out the number of blocks in MaxFlops. Individually the workloads ran fine, but when I tried running them concurrently, the system kinda crashed, each VM displayed a different error: (and this is using the gpuAssert/cudaGetErrorString in the aforementioned MaxFlops implementation)

  • Unknown error
  • CUDA-capable device(s) is/are busy or unavailable
  • out of memory (this one is weird since all partitions are equally sized, and it was running perfectly by itself)

After which I couldn’t run any CUDA workloads in the VM’s. Upon restarting the VM’s, the “CUDA-capable device(s) is/are busy or unavailable” error persevered, and I was still unable to run any CUDA workloads. Reinstalling the guest drivers and rebooting one more time solved the issue.

Has this behavior been observed before, or is my system just wonky?