Hello,

I’ve written a small code to run on my Tesla C2075. I’m experiencing some problems, though.

```
__device__ float MTwo(float u)
{
if (u >= 0.0 && u <= 2.0)
{
return (1.0 - abs(u - 1.0));
}
return 0.0;
}
__device__ float MThree(float u)
{
return ( ((u * 0.5) * MTwo(u)) + (((3.0 - u) * 0.5) * MTwo(u - 1.0)) );
}
__device__ float MFour(float u)
{
return ( ((u / 3.0) * MThree(u)) + (((4.0 - u) / 3.0) * MThree(u - 1.0)) );
}
__device__ float MFive(float u)
{
if (u < 0.0 || u > 5.0) return 0.0;
return ( ((u * 0.25) * MFour(u)) + (((5.0 - u) * 0.25) * MFour(u - 1.0)) );
}
```

These are being called by:

```
__global__ void QTestCalculate(double *QTestArray, const double Charge, const int *Ks, const int *Ns, const double *u, const int n_size)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < n_size)
{
int k3 = index % Ks[0];
int k2 = ((index % (Ks[0] * Ks[1])) - k3) / Ks[0];
int k1 = (index - k3 - (k2 * Ks[0])) / (Ks[0] * Ks[1]);
float Q1 = 0.0, Q2 = 0.0, Q3 = 0.0;
for (int n = Ns[0]; n <= 1; n++)
{
Q1 += MFive(float(u[0]) - float(k1) - float(n * Ks[0]));
}
for (int m = Ns[1]; m <= 1; m++)
{
Q2 += MFive(float(u[1]) - float(k2) - float(m * Ks[1]));
}
for (int o = Ns[2]; o <= 1; o++)
{
Q3 += MFive(u[2] - float(k3) - float(o * Ks[2]));
}
QTestArray[index] = (Charge * Q1 * Q2 * Q3);
}
__syncthreads();
}
```

My issue is this:

I can (from a different kernel) enter a precalculated value of u into MFive (even MSeven, which is not shown here). So I know that my M____ functions calculate the correct answere. I’ve validated what comes into the MFive() call, and know it should return a valid value other than 0.0.

I can also when compiling without the -arch=sm_20 flag get values for Q1, Q2 and Q3 and return those individually. I cannot multiply Q1 * Q2, Q2 * Q3 or Q1 * Q3 or all of the above, then the result is 0.0.

When using the -arch=sm_20 flag, it Q1, Q2, Q3 is all zeroes.

When I used A spline order of 7, (M function is a Cardinal B-Spline), I can get returns (valid values),

as far as MThree = return u; If I were to change it to: MThree = return u / 1.0; I get 0.0.

So, I’m guessing that I’m running out of registers or stack or something …

Is there any way to improve on this? Avoid the problem? I’m kind of new to CUDA, so I would appreciate any input on how to better solve this.

Best regards,

Bjørnar Jensen.