Multi GPU Passthrough, extremly slow GPU initialization

Hello everybody,

We have a server (Supermicro AS -2124GQ-NART) with four Tesla A100 and we integrated the server in our opennebula environment.
We pass the GPU via PCI-Passthrough as descirbed here (PCI Passthrough — OpenNebula 5.12.8 documentation) to the vm.
Unfortunatly there is a scenario where the GPUs are extremly slow, when mapping one,two or three GPUs to one VM.
When the remaining GPU is already in use our vm takes a very long time to load some data to the GPU.

We tested this behaviour in different compositions, here is one of them:

Two VMs, both got 2 GPUs and we loaded zeros to a gpu with pytorch:

import torch

which takes a fair amount of time:

time python
python 2.12s user 1.02s system 99% cpu 3.145 total

strace -c python

% time seconds usecs/call calls errors syscall
62.97 0.112609 28 4092 516 ioctl
17.84 0.031907 5 6099 brk

When we switch to the second vm, the initialization of our gpu is extremly slow:

time python
python 2.33s user 543.02s system 68% cpu 13:13.97 total

strace -c python

% time seconds usecs/call calls errors syscall

99.98 541.038915 132219 4092 516 ioctl
0.01 0.042287 7 6104 brk

Did we miss something? We use the same setup with a different server wich uses some older Tesla GPUs (V100 + K40) and it works just fine.
When we pass all four GPUs into one VM there seems to be no problem.

Can anyone help?