Resetting GPU without reboot?

ceearem · September 23, 2009, 9:45am

Hi

I am in the process of porting a big code to CUDA. While I was working on one of the routines, I probably did something stupid (I couldn’t figure out what it was though). Running the Program somehow left my GPUs in a bad state so that every subsequent run (even without the bad part of the code) produced garbage results even in totally trivial kernel calls like copying data from a double array to a single precision float array. All this without any error message of CUDA.
After a reboot everything was fine again (as long as I keep the bad code out of the program).

I have now several questions:
Are there any rules of thumb what kind of problem can cause that behavior (as hint what I have to look for in my code since I was not yet able to figure out what the error is)?
Is there a better way to reset the GPU than rebooting the whole computer (I actually don’t have root rights, but have to call the HPC admin of our University ;-) )?
Are there chances that these problems will get less with the next GPU generation?

Best regards
ceearem

Sarnath · September 23, 2009, 10:56am

May be, try logging as a different user and try…

ceearem · September 23, 2009, 11:22am

I have traced the problem down to one specific line of code.

Here is the full kernel. In the brackets following the if(false) statement is a a first version of the kernel which runs fine. I want to coalesce the atomicAdds now (access conflicts are rare since there are about 20times more array elements than threads). The second part runs fine as well,except if I reintroduce the atomicAdd (which is commented out right now). Directly above I also check if I have an access outside of the array dimension, but that is not the case.

If anyone got an idea please let me know.

Ceearem

__global__ void make_rho_kernel(int* flag,int read_threads_at_same_time)

{

 int i,l,m,n,nx,ny,nz,mx,my,mz,a,b;

// clear 3d density array

// loop over my charges, add their contribution to nearby grid points

  // (nx,ny,nz) = global coords of grid pt to "lower left" of charge

  // (dx,dy,dz) = distance to "lower left" grid pt

  // (mx,my,mz) = global coords of moving stencil pt

 // int nzxy=blockIdx.x*gridDim.y+blockIdx.y;

// i=pppm_grid_ids[nzyx*blockDim.x+threadIdx.x];

  int nelements=nupper-nlower+1;

  int* idx=(int*) sharedmem;

  int* sdensity_brick_int=&idx[blockDim.x];

  PPPM_FLOAT* srho_coeff=(PPPM_FLOAT*) &sdensity_brick_int[nelements*blockDim.x];

  if(threadIdx.x<order*(order/2-(1-order)/2+1))

  srho_coeff[threadIdx.x]=rho_coeff[threadIdx.x];

  __syncthreads();

	

 i=blockIdx.x*blockDim.x+threadIdx.x;

if(false)

 {

  if(i < nlocal) {

	PPPM_FLOAT dx,dy,dz,x0,y0,z0;

	nx = part2grid[i];

	ny = part2grid[i+nmax];

	nz = part2grid[i+2*nmax];

	dx = nx+shiftone - (_x[i]-_boxlo.x)*delxinv;

	dy = ny+shiftone - (_x[i+nmax]-_boxlo.y)*delyinv;

	dz = nz+shiftone - (_x[i+2*nmax]-_boxlo.z)*delzinv;

	

	z0 = delxinv*delyinv*delzinv * _q[i];

	for (n = nlower; n <= nupper; n++) 

	{

	  mz = n+nz;

	  y0 = z0*rho1d(n,dz,srho_coeff);

	  for (m = nlower; m <= nupper; m++) 

	  {

		my = m+ny;

		x0 = y0*rho1d(m,dy,srho_coeff);

		for (l = nlower; l <= nupper; l++) 

		{

			  mx = l+nx;

			int mzyx=((mz-nzlo_out)*(nyhi_out-nylo_out+1)+my-nylo_out)*(nxhi_out-nxlo_out+1)+mx-nxlo_out;

		   	

		   	a=int(x0*rho1d(l,dx,srho_coeff)*density_intScale);

		   	b=(atomicAdd(&density_brick_int[mzyx],a)|a);

		   	if(((b)&(0x7c000000))&&(not((b)&(0x80000000)))) 

		   	{

		   	  flag[1]++;

		   	  if((b)&(0x60000000)) flag[0]++;

		   	}

		   	__syncthreads();

		}

	  }

	}

  }

  return;

  }

i=blockIdx.x*blockDim.x+threadIdx.x;

  {

	PPPM_FLOAT dx,dy,dz,x0,y0,z0;

	if(i < nlocal) 

	{

	  nx = part2grid[i];

	  ny = part2grid[i+nmax];

	  nz = part2grid[i+2*nmax];

	  dx = nx+shiftone - (_x[i]-_boxlo.x)*delxinv;

	  dy = ny+shiftone - (_x[i+nmax]-_boxlo.y)*delyinv;

	  dz = nz+shiftone - (_x[i+2*nmax]-_boxlo.z)*delzinv;

	  z0 = delxinv*delyinv*delzinv * _q[i];

	}

	else

	{

	  nx=ny=nz=1; dx=dy=dz=0.1;

	}

	__syncthreads();

	for (n = nlower; n <= nupper; n++) 

	{

	  mz = n+nz;

	  y0 = z0*rho1d(n,dz,srho_coeff);

	  for (m = nlower; m <= nupper; m++) 

	  {

		//my = m+ny;

		x0 = y0*rho1d(m,dy,srho_coeff);

		if(i<nlocal)

		{

	   	  idx[threadIdx.x]=((mz-nzlo_out)*(nyhi_out-nylo_out+1)+my-nylo_out)*(nxhi_out-nxlo_out+1)+nx+nlower-nxlo_out;

		  for (l = nlower; l <= nupper; l++) 

		  {

		   	sdensity_brick_int[threadIdx.x*nelements+l-nlower]=int(x0*rho1d(l,dx,srho_coeff)*density_intScale);

		  }

		}

	   	else idx[threadIdx.x]=-1;

		__syncthreads();

			//  if((idx[threadIdx.x]>=(nzhi_out-nzlo_out+1)*(nyhi_out-nylo_out+1)*(nxhi_out-nxlo_out+1))||

			//	 (idx[threadIdx.x]<0)) flag[2]++;

		

		for(int ii=0;ii<blockDim.x;ii+=read_threads_at_same_time)

		{

			int kk=threadIdx.x/nelements;

		

			if((threadIdx.x<nelements*read_threads_at_same_time)&&(kk+ii<blockDim.x)&&(idx[ii+kk]>-1))

			{

			  a=sdensity_brick_int[ii*nelements+threadIdx.x];

			  if((idx[ii+kk]+threadIdx.x-kk*nelements>=(nzhi_out-nzlo_out+1)*(nyhi_out-nylo_out+1)*(nxhi_out-nxlo_out+1))||

				 (idx[ii+kk]+threadIdx.x-kk*nelements<0)) flag[2]++;

			  //b=(atomicAdd(&density_brick_int[idx[ii+kk]+threadIdx.x-kk*nelements],a)|a);

		   	  b=a;	   

		   	  if(((b)&(0x7c000000))&&(not((b)&(0x80000000)))) 

		   	  {

		   		flag[1]++;

		   		if((b)&(0x60000000)) flag[0]++;

		   	  }

			}

		}

		__syncthreads();	   //*/ 

	  }

	}

  }

}

parallelis · September 23, 2009, 7:01pm

What’s your device capability (Model, compute capability level) ???

SPWorley · September 23, 2009, 7:30pm

Just a quick guess. If your mzyx variable ends up being negative or too large, you’ll be writing to shared memory pointers you don’t own via shared atomics.
It’s possible the shared memory pointer could even be outside the 16K range of the physical shared memory.
This is obviously bad. I don’t think it should kill your GPU, but you never know. I’d put a quick range check on that mzyx to see if that’s the case.

ceearem · September 24, 2009, 4:20am

Hi

my devices are 280GTX and I am using Cuda 2.3 with drivers 190.18 on a Red Head Linux (CentOs 5.something I think).

Regarding the range check of mzyx, the upper half of the code (before the return statement half way through the code) works as intended, so the mzyx is ok at that point. In the second part I tried to achieve the same result but with caching the results of the innermost loop of each thread first in the shared memory and writing it out then in a coalesced manner. Since I get the bad behaviour only if I try to do the atomicAdd in that part I checked the index there with:

[code]

if((idx[ii+kk]+threadIdx.x-kknelements>=(nzhi_out-nzlo_out+1)(nyhi_out-nylo_out+1)*(nxhi_out-nxlo_out+1))||

             (idx[ii+kk]+threadIdx.x-kk*nelements<0)) flag[2]++;

[\code]

where (nzhi_out-nzlo_out+1)(nyhi_out-nylo_out+1)(nxhi_out-nxlo_out+1) is the size of array density_brick_int. Flag[2] never returned anything else than zero. So that should be fine.

Best regards

Ceearem

ceearem · September 24, 2009, 4:05pm

Hi I found the problem (the line //my=m+ny; should be my=m+ny; obviously since otherwise my will not be defined in the lower part).
There was also an additional error with my flag retrieving (only downloaded two instead of 3 elements, omitting the important part about flag[2]) … anyway I am still interested if there is another possibility to reset the GPUs after such an unfortunate incident besides rebooting the computer. The GPU is used in compute-exclusive mode btw. A reloading of the driver module didn’t help either.

Best regards
Ceearem

P.S. I got reboot rights for further development, but still this is not how one should do these things, since the computer is shared and only remotely accessed.

tmurray · September 24, 2009, 5:36pm

this was fixed in a later 190.xx driver. maybe this one? [url=“http://www.nvidia.com/object/linux_display_amd64_190.32.html”]http://www.nvidia.com/object/linux_display_amd64_190.32.html[/url]

ceearem · September 24, 2009, 11:03pm

Oh very nice if it works. My admin will be most pleased to hear that. [He has quiet some reservations against technology, where one user of a server can possibly play havoc with calculations of other people - and I really am trying hard to convince him that CUDA is the coolest thing in the HPC world right now. I guess a fix to that problem will help a lot ;-) ]

Ill try with the non fixed code and check if it does not make the GPU stuck anymore tomorrow [need the admin to install the newer driver].

best regards

Ceearem

ceearem · September 25, 2009, 11:09am

Today I tested the new driver, and it indeed solved the problem, i. e. after running the broken code (which obviously produces garbage results) subsequent runs with the correct code produce correct results - in contrast to the old driver there these second correct runs failed as well.
Thanks for your hint to the new driver.

kewumgh · October 5, 2009, 4:33pm

I am more interested in “reset the video without reboot pc”, because I see similar problems that bothers me a lot.

when I tried to program with cuda before, I mistakenly wrote data to some wrong location at video card, then the rendering of my pc became messed up. Every time I have to reboot my pc to reset it, any other solution without reboot would be great.

my driver version 190.38

video board:quadro FX 1700

tks

gorbi · June 11, 2011, 1:02pm

reset can be done by:

In windows go to device manager → display adapters, click on card which you want reset, right click, disable hardware and with keyboard shortcuts (CTRL+ …) without screen again enable card - card is reseted…

I’m using this guide - it is fastest way to reset GPU :-)

Topic		Replies	Views
Is there a way to reset a GPU?... ...that is, without rebooting Linux CUDA Programming and Performance	7	3380	October 20, 2010
how to reset device? CUDA Programming and Performance	5	5896	June 3, 2009
Is it possible to reset GPU w/o rebooting? CUDA Programming and Performance	2	1765	November 3, 2009
GPU in a bad state - only power cycle helps CUDA Programming and Performance	6	2401	March 24, 2011
Cuda reset utility CUDA Programming and Performance	0	5235	November 22, 2011
reset cuda chip CUDA Programming and Performance	0	13966	June 17, 2009
Reset GPU withour restart after CUDA crash CUDA Programming and Performance	4	3385	May 16, 2013
How to proceed / reset device from an error 700 or 716 in cuda driver API? CUDA Programming and Performance	10	2283	October 12, 2021
Rest crashed CUDA card CUDA Programming and Performance	5	6873	February 23, 2009
how to avoid computer reboot CUDA Programming and Performance	0	636	August 22, 2011

Resetting GPU without reboot?

Related topics