GPU Hangs

I have a moderately complex set of kernels working, so not quite a newb anymore, maybe a newb++.

I just added a relatively simple kernel to my project that is causing my system to hang when it writes to a global array. If I comment out that line, no hang, uncomment, it hangs. Here is the line:

objectLocXY[(curGroup*objectsPerGroup*2)+(curObjectInGroup*2)]=baseLocX[curGroup]

For debugging, I disabled that line of code (it is the only write in the kernel) and instead wrote the following to the same array and xferred it back to host and dumped it out:

objectLocXY[absoluteThreadNum]=(curGroup*objectsPerGroup*2)+(curObjectInGroup*2)

I have verified the following:

  1. Address of array is getting passed to device properly (as in: the address being passed to the kernel matches the address originally returned by cudaMalloc)

  2. Alloc is of proper size

  3. Index being updated in original code is within bounds of array

The only thing I haven’t exhaustively checked is that no 2 threads write to the same element, although walking through the logic with a calculator showed correct calculations.

So here is my question: If 1,2 and 3 are all truly ok, is the next logical thing to check that 2 threads are writing to the same element? Would that hang the system? If that wouldn’t hang the system, thoughts on other things to check?

Hangs are almost always caused by out-of-bounds global writes or infinite loops. Have you tried running your code in EmuDebug mode?

Thanks, that helps, I’ll focus on those 2 things.

(Haven’t use EmuDebug yet, I probably should, I started out with compiled in debug statements that return values to host through debugging arrays, now that it’s in place it’s easy to continue using it instead of using other tools.)

Compile with -deviceemu and use valgrind to check for out of bounds accesses. (If you don’t have a Linux box, you really should because valgrind is awesome for CPU code anyway.)

Thanks, I’ll try it.

Found the problem which I will share just in case anyone runs into same type of situation. The problem was not the array I was writing to, it was the array I was reading from previously in the code, the address I passed in for the read array was bogus.

The thing that made this tricky to figure out was that it would only hang if the value that was read in previously was then used to write to the other global array.

int tempLocX;

int debugVal;

debugVal=-999999;

tempLocX=bogusReadArray[someIndex];

if (various conditions) {

	goodWriteArray[someOtherIndex]=tempLocX;

} else {

	goodWriteArray[someOtherIndex]=debugVal;

}

It was only hanging if the line of code “goodWriteArray[someOtherIndex]=tempLocX;” actually was executed during run-time based on parms passed in (in other words, it wasn’t just optimizing out the write because it knew it would never get executed). This is what threw me, I figured a problem with read array would have shown up regardless of whether I attempted to use the value that had been read in.

Anyone know why it would hang only when the read in value from the problem array was used to write to the other array?

Because the compiler optimizes away the read if it finds that the value is not used afterwards.

I checked for that. I simplified the code previously listed and forgot to add the reference to tempLocX in the 2nd half of the conditional. The compiler can’t know how the conditional will evaluate because it doesn’t know the values I was passing to the routine. Here is the actual code prior to resolution:

//============================================================

==================

// Gen Particles - Set Burst Location

__global__ void knlParticleCalcGenBurst(int* objID, int* objLocX, int* objLocY

					,int* partStatus, int* partLocXY

					,int firstObj, int lastObj

					,int numObjects

					,int numBurstsPerObj

					,int numPartsPerBurst

					,int numObjGroups

					,int curGroupNum

					,int curBurstNum

					)

					

{

	

	// Vars

	int threadNum;

	int relObjIndex;

	int absObjIndex;

	int tempLocX;

	int tempLocY;

	int tempIndex;

	int objPerThread;

	

	int debugVal;

	

	threadNum=((blockIdx.x*BLK_THREADS)+threadIdx.x);

	

	objPerThread=(numObjects/TOT_THREADS)+1;

	

	relObjIndex=threadNum;

	absObjIndex=firstObj+relObjIndex;

	

	

	// ## Debug - Clear every element to verify writing to this array is not a problem

	if (threadNum==0) {

		for (int i=0;i<16000;i++) {

			partLocXY[i]=0;

		}

	}

	__syncthreads();

	

	

	if (relObjIndex>=0 && relObjIndex<numObjects) {

		tempLocX=BURST_INACTIVE;

		tempLocY=BURST_INACTIVE;

		

		debugVal=objLocX[numObjects];		  <------- added to test if this array was the problem

		

		// If this obj is part of group gening bursts right now then set burst location

		if ((relObjIndex % numObjGroups)==curGroupNum) {

			// Get coordinates for center of burst

			tempLocX=objLocX[absObjIndex];			  <------ bad address was passed for this array

			tempLocY=objLocY[absObjIndex];

		}

		

		tempIndex=(relObjIndex*numBurstsPerObj*2)+(curBurstNum*2);

		if (tempIndex>=0 && tempIndex<16000) {

			partLocXY[tempIndex]=tempLocX;			  <----- this causes failure

			//partLocXY[tempIndex]=-999999;			 

			//partLocXY[tempIndex]=debugVal;			<----- this causes failure

		} else {

			partLocXY[threadNum]=tempIndex;

			partLocXY[threadNum*2]=tempLocX;		  <----- this doesn't cause failure (code is never executed due to conditional, but compiler can't know that)

		}

	

	} // if

}

The behavior I was getting was that either tempLocX or debugVal in the upper half of the conditional were causing hang, but no hang occured if other values used, for every test the else stayed constant (because at that point I suspected it was the other array that was a problem and I wanted to make sure the compiler did not optimize out the reference to tempLocX).