8800 GTX + S5000XVN + >20 instances of matrixMul = system hang

6O6EP · January 31, 2008, 8:29pm

Hello,

If you modify “global void matrixMul(float*,float*,float*,int,int)” function from matrixMul_kernel.cu, from matrixMul project of SDK samples to run in forever loop (for(;;)), compile and run this, you will get error like this:

If I understand correctly, this happen due to some 5 second run time limit in windows. If I start a multiply instances of this modified matrixMul, this errors are printed only for few of started instances, then system freezes forever. (As I understand) The BMC ( Baseboard Management Controller ) of mainboard detect an error in processor and(or) PCIe link, stops appropriated devices and halt the system (this is why it freezes forever). After, in SEL (System Error Log) of this board, I can view errors like this:

Processor /Processor 1 Stat (#0x90) The BMC on S5000PSL has reported a critical assertion event for Processor 1 Stat. The event has the following information: IERR, Socket designation string from SMBIOS table is not found. There is no recommended action defined for this event.

Processor /Processor 2 Stat (#0x91) The BMC on S5000PSL has reported a critical assertion event for Processor 2 Stat. The event has the following information: IERR, Socket designation string from SMBIOS table is not found. There is no recommended action defined for this event.

Critical Interrupt #0x08 The SMI Handler on S5000PSL has reported a critical assertion event for Critical Interrupt sensor 8. The event has the following information: uncorrectable bus error. There is no recommended action defined for this event.

OEM Reserved /SMI Timeout (#0x85) The BMC on S5000PSL has reported a critical assertion event for SMI Timeout. The event has the following information: it has been asserted. No action is required.

(I was unable to found error message related to PCIe Link, I cleared SEL, this error looks like “PCIe/Link#4: uncorrectable bus error”)

Exactly same situation is with Linux, (AFAIK) where is no 5 second limitation, and error message “the launch timed out and was terminated.” is not printed, but system still freeze forever with exactly same errors in SEL.

I wrote about this 5 second timeout because I think this timeout is required for something and possible system behavior like this should be expected if you do things like this, but I do not know about it.

What you think? Am I using software in abnormal way? Or this is hardware error? Or some of onboard software fails (BIOS/BMC/etc…)?

Software: Nvidia drivers 169.21, CUDA (SDK/Toolkit) 1.1 x86_64, OS Windows 2003 x64.

Hardware: Intel S5000XVN (latest firmware), 2x Xeon E5345, EVGA 8800 GTX, PSU 1000W.

Thank-you.

jordyvaneijk · February 1, 2008, 1:17pm

Hello,

If you modify “global void matrixMul(float*,float*,float*,int,int)” function from matrixMul_kernel.cu, from matrixMul project of SDK samples to run in forever loop (for(;;)), compile and run this, you will get error like this:

If I understand correctly, this happen due to some 5 second run time limit in windows. If I start a multiply instances of this modified matrixMul, this errors are printed only for few of started instances, then system freezes forever. (As I understand) The BMC ( Baseboard Management Controller ) of mainboard detect an error in processor and(or) PCIe link, stops appropriated devices and halt the system (this is why it freezes forever). After, in SEL (System Error Log) of this board, I can view errors like this:

Exactly same situation is with Linux, (AFAIK) where is no 5 second limitation, and error message “the launch timed out and was terminated.” is not printed, but system still freeze forever with exactly same errors in SEL.

I wrote about this 5 second timeout because I think this timeout is required for something and possible system behavior like this should be expected if you do things like this, but I do not know about it.

What you think? Am I using software in abnormal way? Or this is hardware error? Or some of onboard software fails (BIOS/BMC/etc…)?

Software: Nvidia drivers 169.21, CUDA (SDK/Toolkit) 1.1 x86_64, OS Windows 2003 x64.

Hardware: Intel S5000XVN (latest firmware), 2x Xeon E5345, EVGA 8800 GTX, PSU 1000W.

Thank-you.

[snapback]318151[/snapback]

Maybe I don’t get it, But what is the question?

You modified the matrixMul kernel invocation, in the file matrixmul.cu file like this

matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

to this

for(;; ) { //Better to do while(true)?

matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);

}

or what?

6O6EP · February 1, 2008, 3:54pm

jordyvaneijk, The question is in “why system fails when I run this”.

I wrote “If you modify matrixMul() function from matrixMul_kernel.cu to run in forever loop”, this is not “modify runTest() function form matrixMul.cu to start matrixMul() in forever loop”. If you do modification as you wrote, I think nothing happen. You must replace all code in matrixMul() to “while(true);”, and then try to start 10-20 instances of this app.

6O6EP1 · February 25, 2008, 12:13am

First of all I am sorry for my English.

On past week I got replacement for mainboard and got some video cards manufactured by Gigabyte, Asus and Biostar (8500GT’s and 8600GT’s) for testing. Error remains with all of this cards. I tried to swap CPU’s and with one CPU (with each from two I have) and one memory stick. I can replace CPU’s, but I think it is really hard to get two defective cpu at a time, so maybe problem is in nvidia driver or in mainboard BIOS?

Can anybody run this on similar hardware to see if it fails or not?

/* Note: Make sure that all important work have been saved 

before you will run this. */

/* Note for S5000XVN users: After 1-10 minutes of running this, 

System LED on mainboard (or, if Intel chassis is used - on front 

chassis panel) will be set to solid amber color. On first system 

boot after you press Reset, you will see message "NMI has been 

received - System Halted", then you will press reset, and on 

second boot in CMOS "Error Manager" you will see "8110 major 1" 

or 8110 with 8111. Then you can examine SEL for error messages, 

and probably you will see messages saying "uncorrectable bus error". */

#include <windows.h>

#include <stdio.h>

#include <cutil.h>

DWORD WINAPI ThreadFunc(LPVOID lpParam);

__global__ void LoopForever();

#define MAX_THREADS 120

int main(int argc, char** argv)

{

	DWORD dwThreadId[MAX_THREADS];

	HANDLE hThread[MAX_THREADS];

	int i=0;

	

	for(; i < MAX_THREADS; i++)

  hThread[i] = CreateThread(NULL, 0, ThreadFunc, NULL, 0, &dwThreadId[i]);

	

    WaitForMultipleObjects(MAX_THREADS, hThread, TRUE, INFINITE);

   for(i=0; i<MAX_THREADS; i++)

        CloseHandle(hThread[i]);

        

    return 0;

}

DWORD WINAPI ThreadFunc(LPVOID lpParam) {

	CUT_DEVICE_INIT();

	LoopForever<<< 256, 64 >>>();

	return 0; // missing return statement warning..

}

__global__ void LoopForever()

{

	while(1);

}

CPU’s spec number: SLAC5.

Thank-you.

6O6EP1 · February 25, 2008, 3:10am

I wrote MAX_THREADS=120, but actually result is same with 20.

6O6EP1 · February 29, 2008, 8:31pm

up

wumpus · March 1, 2008, 11:03am

I’ve had some “hard system hang” problems with kernels with eternal loops in them as wel (in Linux)l, but was unable to reproduce it in a reliable manner. As the loops in question were due to a programming error, the problem went away when I added a better terimination criterium :)

6O6EP1 · March 9, 2008, 11:47am

I found that not only CUDA can freeze the system (but only this CUDA code I wrote can do it fast - 5-60 sec. vs 10 min - 20 hours with normal 3D apps). It seems, simultaneous run of >2 any heavy 3D applications (for example VideocardStabilityTest and Quake4) can cause this errors. There are 3 type of this errors:
(1) System freezes. After reboot there are messages like “SMI Timeout”, “uncorrectable bus error” and “Processor N Internal Error (IERR)” in System Error Log (onboard and OS independent event log).
(2) Windows STOP error: nv4.sys (or nv4_dsp.sys, nvidia display driver however it named) THREAD_STUCK_IN_DEVICE_DRIVER 0x000000ea (0xfffffadfe7529040, 0xfffffadfe75d4010, 0xfffffadfdb634ae0, 0x0000000000000029).
(3) First and second error simultaneously (yes, it is possible:)). In this case I do not see BSOD. After reboot, Windows informs about unexpected shutdown, and logs it into Windows Even Log.

It looks like a bug, possible in driver.

To Moderators:
If it is possible, can you move this thread to “NVIDIA Forums > nZone > Hardware > ForceWare Drivers”? It will be more appropriate place for this thread.

6O6EP1 · June 5, 2008, 12:40pm

If anyone interested in hearing how this story ended.

BIOS (88;63;46).
The problem with BIOS was fixed. Probably problem called Â«(WHQL) Common Scenario “Stress With IO” failsÂ» - this was in fix section in release notes for BIOS. No more errors like “uncorrectable bus error” or Â«SMI TimeoutÂ».

Driver.
Example that I posted earlier in this thread now normally work (do not crash system) only in XP32 (driver ver. 175.16), and normally do not work on Windows 2003 x64/Windows XP x64. On x64 windows I tried all drivers form 169.21 to 175.16 - nv4_mini.sys THREAD_STUCK_IN_DEVICE_DRIVER.

Sarnath · June 9, 2008, 6:08am

THanks for posting! Glad to know it was fixed!

606EP— Nice name. I recollected the thread on seeing your name… :-)