Unspecified Kernel launch Failure (Memory allocation)

So, my problem is i have a kernel, which is being called successively within a while loop, that runs fine for the first ~700 times but fails with this error. Within the kernel i have a few device functions i call to do calculations. The failure is cause by a bad memory access violation which is due to a bad memory allocation when decalaring an object with a constructor as follow

__host__ __device__ Matrix::Matrix(const Matrix &MAT)
{
//	cout<<" >>> copy constructing >>>\n";

	num_row=MAT.num_row;
	num_col=MAT.num_col;
	num_elem=MAT.num_elem;
	pbody=new cudap[num_elem];
	if(pbody==0){
#if __CUDA_ARCH__
		printf("\n***Error: Matrix memory allocation Failed: Constructor***\n\n");
#else
		cerr<<"*** Error: Matrix memory allocation failed ***\n";
		exit(1);
#endif
	}

	//copying
	for(int i=0;i<num_elem;i++)
		*(pbody+i)=(*(MAT.pbody+i));
}

In some cases this operator will fail as well:

__host__ __device__ Matrix Matrix::operator*(const Matrix &B)
{
	//create resultant matrix
	Matrix RESULT(num_row,B.num_col);
	int r=0; int c=0;

	//check for proper dimensions
	if (num_col!=B.num_row)
	{cout<<"*** Error: incompatible dimensions 'Matrix::operator*()' *** ";exit(1);}

	for(int i=0;i<RESULT.num_elem;i++){
		r=i/B.num_col;
		c=i%B.num_col;
		for (int k=0; k<num_col;k++){
			*(RESULT.pbody+i)+= *(pbody+k+num_col*r)*(*(B.pbody+k*B.num_col+c));
		}
	}
	return RESULT;
}

What would cause this behavior?

The new operator and device malloc work off of the device heap, which is by default limited to 8MB. If you hit this, the returned pointer will be NULL which you are testing for (good!). If that is what you are seeing (a NULL pointer) you can expand the device heap using an appropriate API function.

Refer to the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations

why wouldn’t an appropriate destructor deallocate the memory using more dynamic allocation?

Well, you’ve shown the constructor. Did you write a destructor that does that?

If you read the manual section I linked, it states:

“The memory allocated by a given CUDA thread via malloc() remains allocated for the lifetime of the CUDA context, or until it is explicitly released by a call to free(). It can be used by any other CUDA threads even from subsequent kernel launches.”

The above comment applies to new/delete, as well. Memory allocated via new remains until it is explicitly released with a corresponding delete, or when the cuda context gets detroyed, which typically means application termination.

__host__ __device__ Matrix::~Matrix()
{
	delete [] pbody;
}

This is the destructor.

I suppose the question i have now is if you have device coda that declares a data type does each thread instantiate one and destruct their respective variable? I did read that i suppose there’s a memory leak in the code somewhere.

Where is the object cudap defined? Do you have a delete operator defined in that object?

it is a typedef float/double cudap depending on build version

The problem may have nothing to do with the code you’ve shown. (It just happens to be the most likely indicator of a leak problem, perhaps because its the first, most frequent, or largest dynamic allocator.) Since you’re calling the kernel repeatedly, I would start by making sure the only usage of new is within constructors, and every class constructor has a properly defined destructor with corresponding delete. Every instantiated object should go out of scope at the kernel termination, assuming all your kernel calls are completing normally. If you do have usage of new or malloc outside of that, you’ll have to manually verify that the relevant allocations are getting freed.

Yes, if my thread code looks like this:

Matrix a;

And that line of code is not in some conditional construct, then every thread in the block will create it’s own local “a” Matrix object. Whenever that goes out of scope or is explicitly deleted, the destructor should be called. It will be separate for each instantiaton of “a”, local to each thread.

Thanks for all the help, txbob.