We have two 2080 Ti on the same motherboard (without NVLink) for deep learning applications. Issue happens when both GPUs are training. One GPU becomes unresponsive after ~10 minute into training. This consistently happens to one of the GPUs, even after we swap the two GPUs into each other’s PCI-e slots.
Screenshot of error, log here: https://drive.google.com/drive/folders/1PUpZarOy0Bs4W7EQUFdXYn6siTD-1FwZ?usp=sharing