If you modify “global void matrixMul(float*,float*,float*,int,int)” function from matrixMul_kernel.cu, from matrixMul project of SDK samples to run in forever loop (for(;;)), compile and run this, you will get error like this:
If I understand correctly, this happen due to some 5 second run time limit in windows. If I start a multiply instances of this modified matrixMul, this errors are printed only for few of started instances, then system freezes forever. (As I understand) The BMC ( Baseboard Management Controller ) of mainboard detect an error in processor and(or) PCIe link, stops appropriated devices and halt the system (this is why it freezes forever). After, in SEL (System Error Log) of this board, I can view errors like this:
Exactly same situation is with Linux, (AFAIK) where is no 5 second limitation, and error message “the launch timed out and was terminated.” is not printed, but system still freeze forever with exactly same errors in SEL.
I wrote about this 5 second timeout because I think this timeout is required for something and possible system behavior like this should be expected if you do things like this, but I do not know about it.
What you think? Am I using software in abnormal way? Or this is hardware error? Or some of onboard software fails (BIOS/BMC/etc…)?
Software: Nvidia drivers 169.21, CUDA (SDK/Toolkit) 1.1 x86_64, OS Windows 2003 x64.
Hardware: Intel S5000XVN (latest firmware), 2x Xeon E5345, EVGA 8800 GTX, PSU 1000W.