If you were a program and you would only run sometimes... your problem would be??

Hey all,
This is all in the context of running a code on ubuntu 9.04 x86_64

I’m getting incredibly frustrated with the code I’m working on. The program, after compiling it with “nvcc main.cu -o main --maxrregcount=32”, will usually just fail with an ‘unspecified launch failure’ after the first kernel call.

However, sometimes it will run successfully to completion. This is all on the same compile, and just re-entering the command ./main repeatedly. It seems to be ~ 1 out of every 10 times it will work.

Now I know this is a very vague description, but short of sending you the entire code I’m not sure what information is useful.
All of my pointers are declared first set = to NULL, I have error checks after every cudamalloc or kernel call.

Any ideas?

Thanks very much for the help, I’m lost right now, and getting pretty frustrated.
Adam

The ULF could be caused by a segfaulting kernel. Try compiling your program for emulation mode and running it through Valgrind to make sure you’re not accessing bad memory which occasionally overwrites some crucial memory.

This is killing me.

I’m getting variables that actually are changing values when passed to functions, in both gdb (deviceemu) and cuda-gdb.

The types being passed and received are the exact same type.

See the following for an example:

inside gdb:

1228			macro+=getmicro_t((*neut).energy,Emesh_d,loc)*mat_list_d[i+NUCLIDE_MAX*(*neut).cell].density;

(gdb) s

getmicro_t (E=2, Emesh_d=0x7f3aa3ef2100, loc=1) at main2.cu:1155

1155		if (E<=E0_d[loc])

(gdb) n

1161			unsigned long i=binarySearch(Emesh_d,Emesh_offsets_d[loc],Emesh_offsets_d[loc]-1, E);	

(gdb) p Emesh_offsets_d[loc]

$5 = 31839

(gdb) p Emesh_offsets_d[loc+1]-1

$6 = 32066

(gdb) s

binarySearch (sortedArray=0x7f3aa3ef2100, first=0, last=31838, key=2)

	at main2.cu:1055

1055	   	   unsigned long mid = (first + last) / 2;  // compute mid point.

The code calling this function, and the function are below:

__constant__ unsigned long Emesh_offsets_d[NUCLIDE_MAX];

//the populating of Emesh_offsets_d:

unsigned long* offset_temp=NULL;

	offset_temp=(unsigned long*)malloc(((*xs_data).listlength+1)*sizeof(unsigned long));

	unsigned long sum=0;

	for (int i=0; i<(*xs_data).listlength; i++)

	{

		offset_temp[i]=sum;

		sum+=(*xs_data).cross_sec[i].mesh_length;

	}

	offset_temp[(*xs_data).listlength]=sum;

	cudaMemcpyToSymbol(Emesh_offsets_d,offset_temp,((*xs_data).listlength+1)*sizeof(unsigned int),0,cudaMemcpyHostToDevice);

	cudaThreadSynchronize();

function call:

unsigned long i=binarySearch(Emesh_d,Emesh_offsets_d[loc],Emesh_offsets_d[loc]-1, E);	

__device__ unsigned long binarySearch(float* sortedArray, unsigned long first, unsigned long last, float key)

{

	while (first <= last) 

	{

   	   unsigned long mid = (first + last) / 2;  // compute mid point.

	   if (key > sortedArray[mid])

		   first = mid + 1;  // repeat search in top half.

	   else if (key < sortedArray[mid])

		   last = mid - 1; // repeat search in bottom half.

	   else

		   return mid;	 // found it. return position /////

	}

	return last+1;	// failed to find key

}

EDIT: Also, when I’m stepping through the binarySearch func in gdb, get this, it doesn’t return to the function that called it, but skips ahead to one that calls it way later in the code! What is going on here? Voodoo???

Any help is amazingly appreciated, and if you’re in the DC area, I’ll buy you a beer

This is symptomatic of stack corruption, a routine that corrupts the stack can make returning from a subroutine impossible or into a branch to a random address.

Interesting.

Running through valgrind helped. Helped in that now my code runs more things than it normally would. (I dont fully understand whats happening when I try and analyze a larger model, but I can attack that later.)

For now, I want to understand what just happened.

Forget about the corrupted stack issue for a moment, I went back to a version of the code from a few days ago so that ‘feature’ wasn’t present in this build.

But what I did was, I ran valgrind, it told me about memory leaks and unitialized values.

Now, it certainly looked to me like the unitialized values were initialized (I even checked them at runtime before and after the offending line and they were just fine), but I just put an initializing statement and boom, most of the problems went away.

For memory leaks, I have a specific question: if I free() the host data before the device data (that has already been cudaMemcpyied) is done being accessed, why does that lead to problems? Is this a device-emu thing only?

I’m in a rush right now, I can give better data later today though. I’m curious what just happened.

Thanks,

Adam