Jul 27 16:39:09 emano kernel: NVRM: Xid (PCI:0000:1a:00): 79, pid=1370, GPU has fallen off the bus.
One of the gpus is shutting down. Since it’s not always the same one, I guess they’re not damaged but either overheating or lack of power occurs. Please monitor temperatures, check PSU, try limiting clocks using nvidia-smi -lgc.