Out of registers or stack? No exceptions, just returning 0.0.

Hello,

I’ve written a small code to run on my Tesla C2075. I’m experiencing some problems, though.

__device__ float MTwo(float u)
{
	if (u >= 0.0 && u <= 2.0)
	{
		return (1.0 - abs(u - 1.0));
	}
	return 0.0;
}

__device__ float MThree(float u)
{
	return ( ((u * 0.5) * MTwo(u)) + (((3.0 - u) * 0.5) * MTwo(u - 1.0)) );
}

__device__ float MFour(float u)
{
	return ( ((u / 3.0) * MThree(u)) + (((4.0 - u) / 3.0) * MThree(u - 1.0)) );
}

__device__ float MFive(float u)
{
	if (u < 0.0 || u > 5.0) return 0.0;
	return ( ((u * 0.25) * MFour(u)) + (((5.0 - u) * 0.25) * MFour(u - 1.0)) );
}

These are being called by:

__global__ void QTestCalculate(double *QTestArray, const double Charge, const int *Ks, const int *Ns, const double *u, const int n_size)
{
	int index = blockIdx.x * blockDim.x + threadIdx.x;
	if (index < n_size)
	{
		int k3 = index % Ks[0];
		int k2 = ((index % (Ks[0] * Ks[1])) - k3) / Ks[0];
		int k1 = (index - k3 - (k2 * Ks[0])) / (Ks[0] * Ks[1]);
		float Q1 = 0.0, Q2 = 0.0, Q3 = 0.0;

		for (int n = Ns[0]; n <= 1; n++)
		{
			Q1 += MFive(float(u[0]) - float(k1) - float(n * Ks[0]));
		}

		for (int m = Ns[1]; m <= 1; m++)
		{
			Q2 += MFive(float(u[1]) - float(k2) - float(m * Ks[1]));
		}
		
		for (int o = Ns[2]; o <= 1; o++)
		{
			Q3 += MFive(u[2] - float(k3) - float(o * Ks[2]));
		}

		QTestArray[index] = (Charge * Q1 * Q2 * Q3);
	}
	__syncthreads();
}

My issue is this:
I can (from a different kernel) enter a precalculated value of u into MFive (even MSeven, which is not shown here). So I know that my M____ functions calculate the correct answere. I’ve validated what comes into the MFive() call, and know it should return a valid value other than 0.0.

I can also when compiling without the -arch=sm_20 flag get values for Q1, Q2 and Q3 and return those individually. I cannot multiply Q1 * Q2, Q2 * Q3 or Q1 * Q3 or all of the above, then the result is 0.0.

When using the -arch=sm_20 flag, it Q1, Q2, Q3 is all zeroes.

When I used A spline order of 7, (M function is a Cardinal B-Spline), I can get returns (valid values),
as far as MThree = return u; If I were to change it to: MThree = return u / 1.0; I get 0.0.

So, I’m guessing that I’m running out of registers or stack or something …

Is there any way to improve on this? Avoid the problem? I’m kind of new to CUDA, so I would appreciate any input on how to better solve this.

Best regards,
Bjørnar Jensen.

Are you checking error codes from the host portion of your application ? Try running your application under cuda-memcheck see if there are memory access errors or CUDA API errors in the application. You can also use cuda-gdb or Nsight Eclipse Edition on Linux/Mac, or Nsight Visual Studio Edition on Windows, to debug your application, by stepping through the application and examining the values of the variables.

Hello,

I explicitly checked for errors after communication only.
However, I figured it out. The kernel is indeed limited by registers.

So, currently I’m avoiding the issue by reducing the block size. But I’ll have to figure out a smart way to reuse some variables to increase the load (currently only 50%) on the Tesla.

How do people get around this when rewriting (naturally) recursive algorithms in general?

Best regards,
Bjørnar.