I hope someone can share an insight into to the problem I am having. A small program takes an array of doubles (100000 elements), adds together a subset of this array and assign the result to an element of another array. This seems like a very basic task, however my GT630 exerts strange behavior. What happens is that it works (but rather slow) for small values of intervalLength, but as soon as intervalLength becomes larger, around 300, the code fails.
What’s more interesting is that the problem seems to be not in the addition, but in the assigning of the results back to the output array. If the last line in the code below
output_dev[threadIdx.x] = totalSum;
is changed to
output_dev[threadIdx.x] = input_dev[0];
then the code runs lightningly fast - at least 100 times faster, and works for any large value of intervalLength. Also, if a line
totalSum=1;
preceeds the asignment then the code also runs fast and without errors. Some experimentation also showed that if the sum is calculated an a series of statements as opposed to using the loop, the code also works fine.
I am using GT630 4GB with 96 CUDA threads, launching 96 threads in one block.
The code:
extern "C" __global__ void TestCompute(double* input_dev, int input_devLen0, int* args_dev, int args_devLen0, double* output_dev, int output_devLen0)
{
int intervalLength = args_dev[0];
double totalSum = input_dev[num];
if (num < input_devLen0)
{
for (int k = 0; k <= input_devLen0; k++)
{
totalSum = 0.0;
for (int i = 0; i < intervalLength; i++)
{
if (input_devLen0 > i)
{
totalSum += input_dev[i];
}
}
if (output_devLen0 > threadIdx.x)
{
output_dev[threadIdx.x] = totalSum; // input_dev[0];
}
}
}
}
Many Thanks!