Uncorrectable ECC error

I have set up the code below using dynamic parallelism trying to reproduce a problem I had on a larger project and I bumped into the issues below.

I have a four Kepler K20c, OS Windows 7, CUDA 5.0. If I set GPU #0, then I receive

GPUassert: uncorrectable ECC error encountered C:/Users/Matlab_User/Desktop/StackOverflow/StackOverflow/kernel.cu 57

where line 57 is the allocation of d_x. If I set GPU #1-#3, then I receive

GPUassert: unspecified launch failure C:/Users/Matlab_User/Desktop/StackOverflow/StackOverflow/kernel.cu 64

for the kernel launch.

I have the following questions:

  1. What could be the issue of the ECC problem? Hardware problem on GPU #0?
  2. What could be the issue of the unspecified launch failure?

Thanks a lot in advance.

#include <stdio.h>
#include <conio.h>

#define K 6
#define BLOCK_SIZE 256

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
	  if (abort) { getch(); exit(code); }
   }
}

int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }

__global__ void child_kernel(double* P1, const double cc_diff1) 
{
	int m = threadIdx.x;

	P1[m] = K*K-(cc_diff1-(m-K))*(cc_diff1-(m-K));
}

__global__ void parent_kernel(const double* __restrict__ x, int M)
{
	int i = threadIdx.x + blockDim.x * blockIdx.x;
   
	if(i<M) {

		double cc_points1=2.*x[i];
		double r_cc_points1=rint(cc_points1);
		const double cc_diff1 = cc_points1-r_cc_points1;

		double* P1 = new double[13];

		dim3 dimBlock(2*K+1,2*K+1); dim3 dimGrid(1,1);
	
		child_kernel<<<dimGrid,dimBlock>>>(P1,cc_diff1);

		for(int m=0; m<2*K+1; m++) 
			printf("%f %f\n",P1[m],K*K-(cc_diff1-(m-K))*(cc_diff1-(m-K)));
	}
}

int main() {

	const int M = 190;
	
	gpuErrchk(cudaSetDevice(1));	
	
	double* x = (double*)malloc(M*sizeof(double));
	for (int i=0; i<M; i++)
		x[i] = i;

	double* d_x; gpuErrchk(cudaMalloc((void**)&d_x,M*sizeof(double)));

	gpuErrchk(cudaMemcpy(d_x,x,M*sizeof(double),cudaMemcpyHostToDevice));

	dim3 dimBlock(BLOCK_SIZE,1); dim3 dimGrid(iDivUp(M,BLOCK_SIZE));
	parent_kernel<<<dimGrid,dimBlock>>>(x,M);
	gpuErrchk(cudaPeekAtLastError());
	gpuErrchk(cudaDeviceSynchronize());
	
	return 0;
}

Two unexpected errors of this nature in the same system would lead me to suspect a mismatch in the software components (leading to a bogus error report), although the second error may be a follow-on to the first. My observation is that once an uncorrectable ECC error is detected, the driver blocks all further CUDA calls. But that is for a single GPU system, I don’t have experience with multi-GPU machines.

Does nvidia-smi actually show an uncorrectable ECC error? If not, I would take that as confirmation of the “corrupt driver stack causing bogus error” hypothesis. If this were my system, I would try updating to the very latest driver package, and cold booting the machine.

My SM_35 GK208 GT 740m laptop GPU doesn’t like that code either. I only have 1 NVIDIA GPU on this laptop though. I just changed line 51 to: gpuErrchk(cudaSetDevice(0));

GPUassert: unknown error C:/Users/User/Desktop/jfsebastian/test.cu 64

Edit: You probably meant d_x instead of x on line 62, that would do it. Thought it didn’t change the outcome because I had done a Debug build and I was running the Release… >.<

I’ve been bitten by putting the host variable instead of the device one in the kernel launch a few times… lol

@vacaloca: Thank you very much for pointing out that mistake to me. It was a late night assault to my original problem. Fixing the error removed the unspecified launch failure for the other cards, but the ECC error for GPU #0 remained.

@njuffa: Thank you for the suggestions. I have done several other tests today and this is what I found out:

  1. The ECC issue of yesterday was deterministic. Every time I launched the simple code I received the error.
  2. Today I restarted the machine and the error disappeared. But I remembered that yesterday I was doing some other operations before launching the simple code. So, I started stressing the machine with some other operations. For example, opening a Matlab window and running the GPU calculation of the SVD of large matrices by Accelereyes Jacket, running another code leading to out of memory etc. I noticed that after having stressed the machine, my code started to produce completely random results.
  3. My previous driver was 327.23. I then installed the newest possible one, namely 332.21, but after that I was not able to launch any CUDA API. Accelereyes Jacket also reported an ECC error. I then reinstalled the 327.23. I must say that in this case I have not performed a cold reboot.
  4. I started monitoring with the nvidia-smi, which did not report any ECC error. I then started to stress again the machine with the SVD and noticed that the GPU temperature raised to 55° and above. I then run the simple code after completion of the SVD (i.e., with the GPU temperature of about 55°) and my code again started to produce random results.
  5. With a "cold restart" of the machine, the GPU temperature is around 30°. However, even after much time having stressed the system, the GPU temperature reaches a plateau of around 50°. I have finished stressing the computer half an hour ago, and nvidia-smi still reports 52°. If I now run again the simple code, I receive again random results.
  6. I have the "exotic" feeling, but I have no evidence of this and I would not be able to explain why, that the randomness of the results in the above scenarios affects only codes with dynamic parallelism.

On (2): I used Jacket quite a while back, and I remember that the stability was questionable, sometimes going from version to version even on the same code – usually at any point I received an out of memory error or other context errors I had to reboot my machine to get correct results (or even being able to run the code again. Are the code errors limited to the GPU that had the ECC error, or are the wrong results happening on all of the GPUs?

On (3): I installed a different driver a few weeks ago and experienced similar issues – CUDA reported error (30). Uninstalling the driver and re-installing clean fixed the issue.

Not sure if you’re able to test in Linux, but there is a stress test code here that runs in Linux:
http://wili.cc/blog/gpu-burn.html Perhaps it could be signs of hardware issues after GPU gets warm/hot?

I think you might want to separate out the issues one by one. Vacaloca’s response seems to indicate that the dynamic parallelism code may not work for other reasons, so it may simply be a bug in the code that causes the unspecified launch failures. I would suggest to use other CUDA applications to test for now.

A temperature of 55 deg C does not strike me as unusual. The fact that nvidia-smi does not report any ECC errors would bolster my hypothesis that the software stack could be messed up, as does the fact that with the latest driver package CUDA apparently stops working altogether. Upgrading the driver underneath an older CUDA version should work, by design, and I am not aware of issues in that regard.

You might want to proceed by temporarily removing the suspect GPU #0 from the system until driver and/or CUDA runtime issues are resolved. My approach would be to re-install the entire software stack: First install CUDA 5.5, then install the latest WHQL (non-beta) drivers.

Once the software stack is stable and you can successfully run CUDA apps again, add the suspect GPU back to the system. Make sure this GPU is correctly seated in the PCIe slot, all power cables are connected correctly, and air flow is unobstructed. I would assume the system uses a sufficiently beefy power supply, given that it has four K20c?

[Later:] I just upgraded my local Windows 7 machine (with a Quadro, not a Tesla) to the 332.21 WHQL driver (release date 2014/1/8) and compute applications seem to be working just fine.

We have two servers with two K20’s in both and we have struggled with ECC errors for months. After server reboot or card reset they work ok for a while, but then suddenly every operation ends with ecc error.

All four cards showed these errors, we even replaced two with totally new cards and even they showed the same symptoms. So the problem for us was in the server hardware or nvidia driver. As it turned out, after consulting server manufacturer and nvidia engineers, it was the driver. Using 319 series driver works (319.49 if I recall correctly), but the latest 330 does not work. We are using redhat linux. Hope this helps.

-M

Let me first thank you all very much for the very useful replies.

I agree with njuffa: there are two separate problems, the algorithmic one and the uncorrectable ECC error along with random results. Fixing the former is a necessary step to dealing with the latter.

As noticed by vacaloca, there was a macroscopic mistake in my original code, which I fixed. This, in turn, fixed the unspecified launch failure on the other cards, but the ECC error remained. Now, I’m using this smaller version of my test code:

#include <stdio.h>
#include <conio.h>

#define K 6
#define BLOCK_SIZE 256

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
	  if (abort) { getch(); exit(code); }
   }
}

int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); }

__global__ void child_kernel(double* P1) 
{
	int m = threadIdx.x;

	P1[m] = (double)m;
}

__global__ void parent_kernel(double* __restrict__ x, int M)
{
	int i = threadIdx.x + blockDim.x * blockIdx.x;
   
	if(i<M) {

		double* P1 = new double[13];

		dim3 dimBlock(2*K+1,2*K+1); dim3 dimGrid(1,1);
	
		__syncthreads();
		child_kernel<<<dimGrid,dimBlock>>>(P1);
		cudaDeviceSynchronize();

		for(int m=0; m<2*K+1; m++) printf("%f %f\n",P1[m],(double)m);

	}
}

int main() {

	const int M = 19000;
	
	//gpuErrchk(cudaSetDevice(0));	
	
	double* x = (double*)malloc(M*sizeof(double));
	for (int i=0; i<M; i++)
		x[i] = (double)i;

	double* d_x; gpuErrchk(cudaMalloc((void**)&d_x,M*sizeof(double)));

	gpuErrchk(cudaMemcpy(d_x,x,M*sizeof(double),cudaMemcpyHostToDevice));

	dim3 dimBlock(BLOCK_SIZE,1); dim3 dimGrid(iDivUp(M,BLOCK_SIZE));
	parent_kernel<<<dimGrid,dimBlock>>>(d_x,M);
	gpuErrchk(cudaPeekAtLastError());
	gpuErrchk(cudaDeviceSynchronize());
	
	getch();
	
	return 0;
}

The purpose of the code is to calculate the same array both by a for loop in the parent kernel and in the child kernel. So, the printf in the parent kernel should return the same values for each printed row. I do not see any evident synchronization issue in the above code, but of course I may be wrong. Perhaps, someone may with to be so kind to point me any mistake.

Concerning the latter point, from time to time I still receive the uncorrectable ECC error message, but not at every launch. When I do not receive this error message, sometimes the code above returns correct results and sometimes those are completely random (but this could be due to a problem of mine, as mentioned above).

The experience witnessed by mpartio seems to indirectly confirm the original hypothesis by njuffa that there is a problem with the driver, at least concerning the ECC error. I think that the best I can do is to update my system to CUDA 5.5 and reinstall the driver again. In the meatime, I would like to be sure about the testing code.

Again, thank to everyone.

The reason for the random values provided by the dynamic parallelism code was due to exceeding the launch pending limit. This was answered by Robert Crovella at http://stackoverflow.com/questions/21109792/wrong-results-of-a-cuda-dynamic-parallelism-code. I will now try to investigate a bit more the reproducibility of the uncorrectable ECC error.