 # Out of registers or stack? No exceptions, just returning 0.0.

Hello,

I’ve written a small code to run on my Tesla C2075. I’m experiencing some problems, though.

``````__device__ float MTwo(float u)
{
if (u >= 0.0 && u <= 2.0)
{
return (1.0 - abs(u - 1.0));
}
return 0.0;
}

__device__ float MThree(float u)
{
return ( ((u * 0.5) * MTwo(u)) + (((3.0 - u) * 0.5) * MTwo(u - 1.0)) );
}

__device__ float MFour(float u)
{
return ( ((u / 3.0) * MThree(u)) + (((4.0 - u) / 3.0) * MThree(u - 1.0)) );
}

__device__ float MFive(float u)
{
if (u < 0.0 || u > 5.0) return 0.0;
return ( ((u * 0.25) * MFour(u)) + (((5.0 - u) * 0.25) * MFour(u - 1.0)) );
}
``````

These are being called by:

``````__global__ void QTestCalculate(double *QTestArray, const double Charge, const int *Ks, const int *Ns, const double *u, const int n_size)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < n_size)
{
int k3 = index % Ks;
int k2 = ((index % (Ks * Ks)) - k3) / Ks;
int k1 = (index - k3 - (k2 * Ks)) / (Ks * Ks);
float Q1 = 0.0, Q2 = 0.0, Q3 = 0.0;

for (int n = Ns; n <= 1; n++)
{
Q1 += MFive(float(u) - float(k1) - float(n * Ks));
}

for (int m = Ns; m <= 1; m++)
{
Q2 += MFive(float(u) - float(k2) - float(m * Ks));
}

for (int o = Ns; o <= 1; o++)
{
Q3 += MFive(u - float(k3) - float(o * Ks));
}

QTestArray[index] = (Charge * Q1 * Q2 * Q3);
}
}
``````

My issue is this:
I can (from a different kernel) enter a precalculated value of u into MFive (even MSeven, which is not shown here). So I know that my M____ functions calculate the correct answere. I’ve validated what comes into the MFive() call, and know it should return a valid value other than 0.0.

I can also when compiling without the -arch=sm_20 flag get values for Q1, Q2 and Q3 and return those individually. I cannot multiply Q1 * Q2, Q2 * Q3 or Q1 * Q3 or all of the above, then the result is 0.0.

When using the -arch=sm_20 flag, it Q1, Q2, Q3 is all zeroes.

When I used A spline order of 7, (M function is a Cardinal B-Spline), I can get returns (valid values),
as far as MThree = return u; If I were to change it to: MThree = return u / 1.0; I get 0.0.

So, I’m guessing that I’m running out of registers or stack or something …

Is there any way to improve on this? Avoid the problem? I’m kind of new to CUDA, so I would appreciate any input on how to better solve this.

Best regards,
Bjørnar Jensen.

Are you checking error codes from the host portion of your application ? Try running your application under cuda-memcheck see if there are memory access errors or CUDA API errors in the application. You can also use cuda-gdb or Nsight Eclipse Edition on Linux/Mac, or Nsight Visual Studio Edition on Windows, to debug your application, by stepping through the application and examining the values of the variables.

Hello,

I explicitly checked for errors after communication only.
However, I figured it out. The kernel is indeed limited by registers.

So, currently I’m avoiding the issue by reducing the block size. But I’ll have to figure out a smart way to reuse some variables to increase the load (currently only 50%) on the Tesla.

How do people get around this when rewriting (naturally) recursive algorithms in general?

Best regards,
Bjørnar.