Resetting GPU without reboot?

Hi

I am in the process of porting a big code to CUDA. While I was working on one of the routines, I probably did something stupid (I couldn’t figure out what it was though). Running the Program somehow left my GPUs in a bad state so that every subsequent run (even without the bad part of the code) produced garbage results even in totally trivial kernel calls like copying data from a double array to a single precision float array. All this without any error message of CUDA.
After a reboot everything was fine again (as long as I keep the bad code out of the program).

I have now several questions:
Are there any rules of thumb what kind of problem can cause that behavior (as hint what I have to look for in my code since I was not yet able to figure out what the error is)?
Is there a better way to reset the GPU than rebooting the whole computer (I actually don’t have root rights, but have to call the HPC admin of our University ;-) )?
Are there chances that these problems will get less with the next GPU generation?

Best regards
ceearem

May be, try logging as a different user and try…

I have traced the problem down to one specific line of code.

Here is the full kernel. In the brackets following the if(false) statement is a a first version of the kernel which runs fine. I want to coalesce the atomicAdds now (access conflicts are rare since there are about 20times more array elements than threads). The second part runs fine as well,except if I reintroduce the atomicAdd (which is commented out right now). Directly above I also check if I have an access outside of the array dimension, but that is not the case.

If anyone got an idea please let me know.

Ceearem

__global__ void make_rho_kernel(int* flag,int read_threads_at_same_time)

{

 int i,l,m,n,nx,ny,nz,mx,my,mz,a,b;

// clear 3d density array

// loop over my charges, add their contribution to nearby grid points

  // (nx,ny,nz) = global coords of grid pt to "lower left" of charge

  // (dx,dy,dz) = distance to "lower left" grid pt

  // (mx,my,mz) = global coords of moving stencil pt

 // int nzxy=blockIdx.x*gridDim.y+blockIdx.y;

// i=pppm_grid_ids[nzyx*blockDim.x+threadIdx.x];

  int nelements=nupper-nlower+1;

  int* idx=(int*) sharedmem;

  int* sdensity_brick_int=&idx[blockDim.x];

  PPPM_FLOAT* srho_coeff=(PPPM_FLOAT*) &sdensity_brick_int[nelements*blockDim.x];

  if(threadIdx.x<order*(order/2-(1-order)/2+1))

  srho_coeff[threadIdx.x]=rho_coeff[threadIdx.x];

  __syncthreads();

	

 i=blockIdx.x*blockDim.x+threadIdx.x;

if(false)

 {

  if(i < nlocal) {

	PPPM_FLOAT dx,dy,dz,x0,y0,z0;

	nx = part2grid[i];

	ny = part2grid[i+nmax];

	nz = part2grid[i+2*nmax];

	dx = nx+shiftone - (_x[i]-_boxlo.x)*delxinv;

	dy = ny+shiftone - (_x[i+nmax]-_boxlo.y)*delyinv;

	dz = nz+shiftone - (_x[i+2*nmax]-_boxlo.z)*delzinv;

	

	z0 = delxinv*delyinv*delzinv * _q[i];

	for (n = nlower; n <= nupper; n++) 

	{

	  mz = n+nz;

	  y0 = z0*rho1d(n,dz,srho_coeff);

	  for (m = nlower; m <= nupper; m++) 

	  {

		my = m+ny;

		x0 = y0*rho1d(m,dy,srho_coeff);

		for (l = nlower; l <= nupper; l++) 

		{

			  mx = l+nx;

			int mzyx=((mz-nzlo_out)*(nyhi_out-nylo_out+1)+my-nylo_out)*(nxhi_out-nxlo_out+1)+mx-nxlo_out;

		   	

		   	a=int(x0*rho1d(l,dx,srho_coeff)*density_intScale);

		   	b=(atomicAdd(&density_brick_int[mzyx],a)|a);

		   	if(((b)&(0x7c000000))&&(not((b)&(0x80000000)))) 

		   	{

		   	  flag[1]++;

		   	  if((b)&(0x60000000)) flag[0]++;

		   	}

		   	__syncthreads();

		}

	  }

	}

  }

  return;

  }

i=blockIdx.x*blockDim.x+threadIdx.x;

  {

	PPPM_FLOAT dx,dy,dz,x0,y0,z0;

	if(i < nlocal) 

	{

	  nx = part2grid[i];

	  ny = part2grid[i+nmax];

	  nz = part2grid[i+2*nmax];

	  dx = nx+shiftone - (_x[i]-_boxlo.x)*delxinv;

	  dy = ny+shiftone - (_x[i+nmax]-_boxlo.y)*delyinv;

	  dz = nz+shiftone - (_x[i+2*nmax]-_boxlo.z)*delzinv;

	  z0 = delxinv*delyinv*delzinv * _q[i];

	}

	else

	{

	  nx=ny=nz=1; dx=dy=dz=0.1;

	}

	__syncthreads();

	for (n = nlower; n <= nupper; n++) 

	{

	  mz = n+nz;

	  y0 = z0*rho1d(n,dz,srho_coeff);

	  for (m = nlower; m <= nupper; m++) 

	  {

		//my = m+ny;

		x0 = y0*rho1d(m,dy,srho_coeff);

		if(i<nlocal)

		{

	   	  idx[threadIdx.x]=((mz-nzlo_out)*(nyhi_out-nylo_out+1)+my-nylo_out)*(nxhi_out-nxlo_out+1)+nx+nlower-nxlo_out;

		  for (l = nlower; l <= nupper; l++) 

		  {

		   	sdensity_brick_int[threadIdx.x*nelements+l-nlower]=int(x0*rho1d(l,dx,srho_coeff)*density_intScale);

		  }

		}

	   	else idx[threadIdx.x]=-1;

		__syncthreads();

			//  if((idx[threadIdx.x]>=(nzhi_out-nzlo_out+1)*(nyhi_out-nylo_out+1)*(nxhi_out-nxlo_out+1))||

			//	 (idx[threadIdx.x]<0)) flag[2]++;

		

		for(int ii=0;ii<blockDim.x;ii+=read_threads_at_same_time)

		{

			int kk=threadIdx.x/nelements;

		

			if((threadIdx.x<nelements*read_threads_at_same_time)&&(kk+ii<blockDim.x)&&(idx[ii+kk]>-1))

			{

			  a=sdensity_brick_int[ii*nelements+threadIdx.x];

			  if((idx[ii+kk]+threadIdx.x-kk*nelements>=(nzhi_out-nzlo_out+1)*(nyhi_out-nylo_out+1)*(nxhi_out-nxlo_out+1))||

				 (idx[ii+kk]+threadIdx.x-kk*nelements<0)) flag[2]++;

			  //b=(atomicAdd(&density_brick_int[idx[ii+kk]+threadIdx.x-kk*nelements],a)|a);

		   	  b=a;	   

		   	  if(((b)&(0x7c000000))&&(not((b)&(0x80000000)))) 

		   	  {

		   		flag[1]++;

		   		if((b)&(0x60000000)) flag[0]++;

		   	  }

			}

		}

		__syncthreads();	   //*/ 

	  }

	}

  }

}

What’s your device capability (Model, compute capability level) ???

Just a quick guess. If your mzyx variable ends up being negative or too large, you’ll be writing to shared memory pointers you don’t own via shared atomics.
It’s possible the shared memory pointer could even be outside the 16K range of the physical shared memory.
This is obviously bad. I don’t think it should kill your GPU, but you never know. I’d put a quick range check on that mzyx to see if that’s the case.

Hi

my devices are 280GTX and I am using Cuda 2.3 with drivers 190.18 on a Red Head Linux (CentOs 5.something I think).

Regarding the range check of mzyx, the upper half of the code (before the return statement half way through the code) works as intended, so the mzyx is ok at that point. In the second part I tried to achieve the same result but with caching the results of the innermost loop of each thread first in the shared memory and writing it out then in a coalesced manner. Since I get the bad behaviour only if I try to do the atomicAdd in that part I checked the index there with:

[code]

if((idx[ii+kk]+threadIdx.x-kknelements>=(nzhi_out-nzlo_out+1)(nyhi_out-nylo_out+1)*(nxhi_out-nxlo_out+1))||

             (idx[ii+kk]+threadIdx.x-kk*nelements<0)) flag[2]++;

[\code]

where (nzhi_out-nzlo_out+1)(nyhi_out-nylo_out+1)(nxhi_out-nxlo_out+1) is the size of array density_brick_int. Flag[2] never returned anything else than zero. So that should be fine.

Best regards

Ceearem

Hi I found the problem (the line //my=m+ny; should be my=m+ny; obviously since otherwise my will not be defined in the lower part).
There was also an additional error with my flag retrieving (only downloaded two instead of 3 elements, omitting the important part about flag[2]) … anyway I am still interested if there is another possibility to reset the GPUs after such an unfortunate incident besides rebooting the computer. The GPU is used in compute-exclusive mode btw. A reloading of the driver module didn’t help either.

Best regards
Ceearem

P.S. I got reboot rights for further development, but still this is not how one should do these things, since the computer is shared and only remotely accessed.

this was fixed in a later 190.xx driver. maybe this one? http://www.nvidia.com/object/linux_display_amd64_190.32.html

Oh very nice if it works. My admin will be most pleased to hear that. [He has quiet some reservations against technology, where one user of a server can possibly play havoc with calculations of other people - and I really am trying hard to convince him that CUDA is the coolest thing in the HPC world right now. I guess a fix to that problem will help a lot ;-) ]

Ill try with the non fixed code and check if it does not make the GPU stuck anymore tomorrow [need the admin to install the newer driver].

best regards

Ceearem

Today I tested the new driver, and it indeed solved the problem, i. e. after running the broken code (which obviously produces garbage results) subsequent runs with the correct code produce correct results - in contrast to the old driver there these second correct runs failed as well.
Thanks for your hint to the new driver.

I am more interested in “reset the video without reboot pc”, because I see similar problems that bothers me a lot.

when I tried to program with cuda before, I mistakenly wrote data to some wrong location at video card, then the rendering of my pc became messed up. Every time I have to reboot my pc to reset it, any other solution without reboot would be great.

my driver version 190.38

video board:quadro FX 1700

tks

reset can be done by:

In windows go to device manager -> display adapters, click on card which you want reset, right click, disable hardware and with keyboard shortcuts (CTRL+ …) without screen again enable card - card is reseted…

I’m using this guide - it is fastest way to reset GPU :-)