Hello,

I need to calculate the mean values for a collection of vectors, certainly not the most difficult problem. However, the code below seems to hit a ceiling in working correctly when the number of total elements is larger than 16M elements, i.e. 64MB. I usually use around 12000 vectors, each 2048 long. I have tried permutations of longer but fewer vectors and vice versa. The incorrect output appears after 16M elements.

Up to that limit a unit test on the CPU produces the same output, above that there is a minor error. Printing either output (CPU and GPU), the floating point result appears to be the same at the standard precision. The comparison, however, returns false. Using vectors of length 2048, the difference between CPU and GPU output appears from the 8204th vector, but only every second vector from then onward.

I am puzzled, perhaps the code hits some compute capability limits. But unless I get the same results on GPU and CPU, or at least understand the difference, I cannot use the GPU version. I am using a K20m on Linux Ubuntu 12.04.

I am looking forward to receiving some comments, perhaps a fix for something I have overlooked.

The kernel (certainly not the optimum, but adequate as it will be extended):

```
#define SHM_LENGTH 256
__global__ void vectorCalculateMeans_Kernel(
float* vectorValues,// [numVectors][vectorLength]
float* vectorMeans, // [numVectors]
int vectorLength
)
{
extern __shared__ float shm[];//[SHM_LENGTH]
// point to the start of the vector for the block in question
vectorValues += blockIdx.x * vectorLength;
//this thread index
int threadIndex = threadIdx.x;
//initialise accumulator
shm[threadIndex] = 0.0f;
//load and accumulate into SHM summing over all segments
int numSegments = (vectorLength + SHM_LENGTH - 1) / SHM_LENGTH;
for (int segment = 0, vectorIndex = threadIndex; segment < numSegments; ++segment)
{
if (vectorIndex < vectorLength)
{
//load values into shm in parallel, coallecent reads
shm[threadIndex] += vectorValues[vectorIndex];
}
vectorIndex += SHM_LENGTH;
}
__syncthreads();
//sum all SHM with decreasing stride
for (int stride = SHM_LENGTH/2; stride > 0; stride >>= 1)
{
if (threadIndex < stride)
{
shm[threadIndex] += shm[threadIndex + stride];
}
__syncthreads();
}
if (threadIndex == 0) vectorMeans[blockIdx.x] = shm[0] / float(vectorLength);
}
```

The kernel call:

```
bool vectorCalculateMeans(
float* vectorValues, //[numVectors][vectorLength]
float* vectorMeans, //[numVectors]
int numVectors,
int vectorLength,
cudaStream_t stream
)
{
dim3 grid(numVectors);
dim3 block(SHM_LENGTH);
unsigned long shmMemSize = SHM_LENGTH * sizeof(float);
vectorScaling_CalculateMeans_Kernel<<< grid, block, shmMemSize, stream >>>(vectorValues, vectorMeans, vectorLength);
CHECK_CUDA_ERRORS();
return true;
}
```

The CPU reference test:

```
bool cpuCalcuateMeans(
float* vectorValues, // [numVectors][vectorLength]
float* vectorMeans, // [numVectors]
int numVectors,
int vectorLength
)
{
//for every vector
for(int n = 0; n < numVectors; n++)
{
float sum = 0.0f;
for(int vectorIndex = 0; vectorIndex < vectorLength; vectorIndex++)
{
sum += vectorValues[vectorIndex];
}
vectorMeans[n] = sum / float(vectorLength);
//next vector
vectorValues += vectorLength;
}
return true;
}
```