We have two 2080 Ti on the same motherboard (without NVLink) for deep learning applications. Issue happens when both GPUs are training. One GPU becomes unresponsive after ~10 minute into training. This consistently happens to one of the GPUs, even after we swap the two GPUs into each other’s PCI-e slots.
Screenshot of error, log here: shared - Google Drive