Computer restarts on parallel CUDA initialization on multiple GPUs

We have the problem with usage of more than one GPU in parallel on Windows 10 OS. Our program decompose initial numerical problem into a set of independent problems, and then it run N numerical engines (EXEs). “N” is number of GPUs in the system. One GPU is assigned to an instance of numerical engine, and it call DLL which initialize the GPU and perform calculations on it.

As both, interface and engine codes are very complex, we created a very simple code where we have the same behavior.

The first code is code for the exe which get number of GPUs and call “GPUstart.exe”. The code is given below.

#include <windows.h>
#include <stdio.h>
#include
#include <stdlib.h>
#include
#include <cuda_runtime_api.h>
#include <cublas_v2.h>

typedef long long tipI;

using namespace std;

int main(){
int nGPUs;
cudaError_t ce = cudaGetDeviceCount(&nGPUs);
if (cudaGetDeviceCount(&nGPUs) != cudaSuccess){
return 3;
}
cout << endl << "Number of GPUs: " << nGPUs << endl << endl;

// 31.10.2017.	
STARTUPINFO* si = new STARTUPINFO[nGPUs];
PROCESS_INFORMATION* pi = new PROCESS_INFORMATION[nGPUs];

for (int GPUindex = 0; GPUindex < nGPUs; GPUindex++){
	// set the size of the structures
	ZeroMemory(&si[GPUindex], sizeof(si[GPUindex]));
	si[GPUindex].cb = sizeof(si[GPUindex]);
	ZeroMemory(&pi[GPUindex], sizeof(pi[GPUindex]));

	// start the program up
	string appName = "GPUstart.exe";
	char* argv;
	char  indexCH[256], nCH[256];
	itoa(GPUindex, indexCH, 10);
	itoa(nGPUs, nCH, 10);
	string strArg = "\"" + appName + "\" " + indexCH + " " + nCH;
	argv = (char*)(strArg.c_str());


	BOOL FLAG = CreateProcess(
		appName.c_str(),
		argv,
		NULL,
		NULL,
		FALSE,
		0,
		NULL,
		NULL,
		&si[GPUindex],
		&pi[GPUindex]
	);
}
// Close process and thread handles. 
for (int GPUindex = 0; GPUindex < nGPUs; GPUindex++){
	WaitForSingleObject(pi[GPUindex].hProcess, INFINITE);
	WaitForSingleObject(pi[GPUindex].hThread, INFINITE);
	CloseHandle(pi[GPUindex].hProcess);
	CloseHandle(pi[GPUindex].hThread);
}
delete[] si;
delete[] pi;

system("pause");

return 0;

}

Code for “GPUstart.exe” is given below. As it can be noticed, it just call a function from “IGPU.dll”.

#include <windows.h>
#include <stdio.h>
#include
#include <stdlib.h>
#include
#include <cuda_runtime_api.h>
#include <cublas_v2.h>
#include

typedef long long tipI;

using namespace std;

typedef void (* picsgpu)(int* pGPUindex, int* pNprocs);

int main(int* argc, char* argv){

HINSTANCE hDLL;
hDLL = LoadLibrary("IGPU.dll");

if (hDLL == NULL){
	cout << endl << "Can not load \"IGPU.dll\"" << endl;
	system("pause");
	return 1;
}

picsgpu func = (picsgpu)GetProcAddress(hDLL, "icsgpu");
if (func == NULL){
	cout << endl << "Can not load function \"icsgpu\" from \"IGPU.dll\"" << endl;
	system("pause");
	return 2;
}

int index = atoi(argv[1]);
int N = atoi(argv[2]);

func(&index, &N);

FreeLibrary(hDLL);

return 0;

}

Finally, in IGPU.dll we just call two CUDA finctions: cudaSetDevice and cudaDeviceReset, and make trace file. The code is given below.

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#include
#include <cuda_runtime_api.h>
#include <cublas_v2.h>
#include

typedef long long tipI;

using namespace std;

extern “C”
__declspec(dllexport) void __cdecl icsgpu(int* pGPUindex, int* pNprocs){ // 28.03.2017.

char nmbr[256];
itoa(*pGPUindex, nmbr, 10);
string fileName = "_trace_";
fileName.append(nmbr);
fileName.append(".txt");
FILE* trace_FPR = fopen(fileName.c_str(), "w");



	int GPUindex = *pGPUindex;
	int nGPUs = *pNprocs;


	trace_FPR = fopen(fileName.c_str(), "a");
	fprintf(trace_FPR, "\n icsgpu - begin: GPUindex = %d, nGPUs = %d \n", GPUindex, nGPUs);
	fclose(trace_FPR);

	if (cudaSetDevice(GPUindex)){
		trace_FPR = fopen(fileName.c_str(), "a");
		fprintf(trace_FPR, "\n cudaSetDevice - error \n");
		fclose(trace_FPR);
	}
	trace_FPR = fopen(fileName.c_str(), "a");
	fprintf(trace_FPR, "\n cudaSetDevice - success \n");
	fclose(trace_FPR);


	if (cudaDeviceReset()){
		trace_FPR = fopen(fileName.c_str(), "a");
		fprintf(trace_FPR, "\n cudaDeviceReset - error \n");
		fclose(trace_FPR);
	}

	trace_FPR = fopen(fileName.c_str(), "a");
	fprintf(trace_FPR, "\n icsgpu - end: GPUindex = %d \n", GPUindex);
	fclose(trace_FPR);


return;	

}

After we run the code on machine with 8 GTX-680 GPUs and Windows 10 OS, the machine become unresponsive and it restarts after few minutes. All trace files (created for different DLLs, i.e. different GPUs) are similar. There is only one line:

for 0th GPU: “icsgpu - begin: GPUindex = 0, nGPUs = 8”

for 1st GPU: “icsgpu - begin: GPUindex = 1, nGPUs = 8”

So, program stuck on function “cudaSetDevice”. Please advise.

Your description is unclear: “the machine become unresponsive and it restarts after few minutes”.

(1) If the machine becomes unresponsive, your application dies, and the display briefly goes black before the GUI returns after a few minutes, your GPU likely contains a kernel that triggers the operating system’s watchdog timer. The watchdog timer kills any application that blocks the GUI for an extended period of time, typically two seconds. Compute kernels will block GUI updates, so look for kernels with more than 2 second execution time.

(2) If the machine actually restarts (reboots), then you likely have an insufficiently sized PSU. Eight GPUS draw a lot of power when then they run flat out. If the PSU cannot supply that power, the supply voltage drops, the PWRGOOD signal to the motherboard goes to zero, and a reboot is initiated. For reliable operation, your PSU should be sized such that the total rated power of all system components does not exceed 60% of the PSU’s power rating. With 8x GTX 680 (195W per NVIDIA specs), total system power is likely around 1800W, so you need 3000W at the PSU. 80 PLUS Platinum rated PSU recommended.

Yes, I was unclear about machine reboot. It become unresponsive, start with reboot after few seconds and GUI returns after few minutes.
There is no power issue on the machine. GPUs are place in Cubix Xpander and power supply is 3000W. Also, if we remove 4 GPUs or even 6 GPUs from the box we have the same problem. So, with 2 GPUs the same problem still occurs.

Do you have physical access to the machine when this occurs? If your issue were a software problem leading to reboot, I would expect Windows to halt with a blue screen and wait for operator input before it reboots. For what it is worth, I haven’t seen anything like this happen based on GPU drivers since 2010 or thereabouts.

If you stand right next to it and you actually see the machine rebooting (powers down, then up; shows BIOS startup screen, Windows startup screen) the moment any CUDA-based application kicks in, my best guess is you have some sort of issue with the hardware triggered by heavy GPU use. Anything from connectors that are not fully plugged in, to cracked traces on some PCB, to defective semiconductor components.

I don’t know what a Cubix Xpander is. Presumably some sort of PCIe slot extender? If so, my advice would be not to use such hardware. PCIe signal quality suffers with each connector, the regular slot alone counts as one electrical load for that reason with regard to signal traces. As for the power supply, there are more things to worry about than just total wattage. When in doubt, ask a power supply expert (which I am not). Avoid Y-splitters and converters in PCIe power cables.

One check you might want to do is to plug a single GPU into a regular PCIe slot with no expander hardware connected. Does the system work in that configuration? If so, start adding components one by one until the system stops working. The last component added is at fault. If the problem only occurs when using the Cubix Xpander, but not when GPUs are plugged into the regular PCIe slots on the motherboard, consider contacting Cubix for assistance (BTW, looking at their website now, I only see Xpanders with up to 1500W PSUs; are you using two Xpanders?).

Thanks for your comments!

I have access to the machine. Error log which is created after I run my program is “The computer has rebooted from a bugcheck. The bugcheck was 0x00000133”.
We are using this Xpander:
https://www.cubix.com/wp-content/uploads/2016/08/Xpander-Fiber-8-5URP.pdf

I am pretty sure (actually completely sure) that the problem is not in hardware. Namely, we run very demanding GPU programs on this machine without any problem.
We didn’t have any problems with Windows 7 (but in Win 7 we used only 7 GPUs, which is maximum for this OS), even with the code given above.
Finally, and the most important here… If I change the code given in my first post in such a manner than the first exe does not call another EXEs but directly create 8 threads and call “IGPU.dll” in these threads everything works fine (both functions, cudaSetDevice and cudaDeviceReset works without any problem).
Problem appears only in the case I described: the first exe call N another EXEs and each of them call dll for cuda initialization.
Can you compile the the I posted above on your machine and run it. If you have Win 10 and more than 1 GPU I am sure that you will see effect which I descried.

I don’t have Windows 10 and I typically don’t run random code posted on the internet. Is “bug check” the modern-day equivalent of the old Windows blue screen? If so, I would expect the bug check procedure to dump a detailed log file somewhere, maybe that can help you target the source of the error.

As I said, I haven’t seen any blue screens caused by NVIDIA drivers in many years. Which doesn’t mean such a thing could not happen. You can always file a bug with NVIDIA, but given that any bug handling there starts with reproducing the problem in-house, which in your case requires a special $10K external box, I wouldn’t expect this to be a quick process.

I tested the same code on different machines with Win 10 OS, and the described problem occurs only on the machine with GPU Xpander.
On the other side when Win 7 is installed on the machine with GPU Xpander, I have no problem.
So, as far as we tested problem occurs only on the machines with Windows 10 and GPU Xpander.

Does the use of the Xpander require driver software?

Since Windows 10 uses a different driver model than Windows 7 (WDDM 2.0 vs WDDM 1.x) it stands to reason that there will be differences in the NVIDIA driver stack between these two operating systems. If however the Xpander is a PCI extender that is completely transparent to software, it should not matter whether the GPUs are plugged in directly to the motherboard or connected via the Xpander.

This is obviously not the case, so it seems the Xpander isn’t transparent to existing driver software and possibly injects another software layer somewhere.

If the vendor of the GPU Xpander doesn’t have any advice for you (based on issues encountered by previous customers), you could always file a bug report with NVIDIA. I assume you are already running with the latest driver package available.

the given blue screen error code is a DPC watchdog violation

I am not sure if the timeouts for this watchdog timer are user configurable.

Christian

I spent some time trying to find a way to increase watchdog timer period, and it seems to me that it is not user configurable.

I have never seen a DPC watchdog timer timeout before. A quick internet search seems to indicate that this is because they only occur on Windows 8.x and Windows 10, which I don’t use.

I see some indications that these timeouts were particularly common when Windows 10 first came out, so my only massively handwave-y advice would be to install the latest updates for Windows 10 and the latest NVIDIA drivers. I will not even try to speculate how such a timeout could be connected to the use of the Xpander.