I just spent four days chasing this problem, so I thought I’d post this for anyone who runs across this in the future. I have an HP Z420 workstation with BIOS v3.06 and running RHEL 6.4. I have a Quadro 600 as the display GPU and a K20c GPU as the compute GPU. When running several GPU programs in a row, it would crash and display a hardware error:
928-Fatal PCIe error
PCIe error detected Slot 5
Slot 5 is the Quadro 600 display card, the K20c being used for computation is in slot 2 (closer to the CPU, better airflow). Both GPUs are v2.0 cards, and v3.0 slots are supposed to work with v2.0 cards. I found that by going into the BIOS and restricting PCIe slots 2 and 5 to v2.0 speeds (5 GB/s) instead of the default v3.0 supported by the Z420, the GPU would run reliably.