Reduction kernel issues with data larger than 16M floats

Hello,

I need to calculate the mean values for a collection of vectors, certainly not the most difficult problem. However, the code below seems to hit a ceiling in working correctly when the number of total elements is larger than 16M elements, i.e. 64MB. I usually use around 12000 vectors, each 2048 long. I have tried permutations of longer but fewer vectors and vice versa. The incorrect output appears after 16M elements.

Up to that limit a unit test on the CPU produces the same output, above that there is a minor error. Printing either output (CPU and GPU), the floating point result appears to be the same at the standard precision. The comparison, however, returns false. Using vectors of length 2048, the difference between CPU and GPU output appears from the 8204th vector, but only every second vector from then onward.

I am puzzled, perhaps the code hits some compute capability limits. But unless I get the same results on GPU and CPU, or at least understand the difference, I cannot use the GPU version. I am using a K20m on Linux Ubuntu 12.04.

I am looking forward to receiving some comments, perhaps a fix for something I have overlooked.

The kernel (certainly not the optimum, but adequate as it will be extended):

#define SHM_LENGTH 256

__global__ void vectorCalculateMeans_Kernel(
    float*  vectorValues,// [numVectors][vectorLength]
    float*  vectorMeans, // [numVectors]
    int     vectorLength
    )
{
    extern __shared__ float shm[];//[SHM_LENGTH]

    // point to the start of the vector for the block in question
    vectorValues += blockIdx.x * vectorLength;

    //this thread index
    int threadIndex = threadIdx.x;

    //initialise accumulator
    shm[threadIndex] = 0.0f;

    //load and accumulate into SHM summing over all segments
    int numSegments = (vectorLength + SHM_LENGTH - 1) / SHM_LENGTH;
    for (int segment = 0, vectorIndex = threadIndex; segment < numSegments; ++segment)
    {
        if (vectorIndex < vectorLength)
        {
            //load values into shm in parallel, coallecent reads
            shm[threadIndex] += vectorValues[vectorIndex];
        }
        vectorIndex += SHM_LENGTH;
    }

    __syncthreads();

    //sum all SHM with decreasing stride
    for (int stride = SHM_LENGTH/2; stride > 0; stride >>= 1)
    {
        if (threadIndex < stride)
        {
            shm[threadIndex] += shm[threadIndex + stride];
        }
        __syncthreads();
    }

    if (threadIndex == 0) vectorMeans[blockIdx.x] = shm[0] / float(vectorLength);
}

The kernel call:

bool vectorCalculateMeans(
    float*          vectorValues, //[numVectors][vectorLength]
    float*          vectorMeans,  //[numVectors]
    int             numVectors,
    int             vectorLength,
    cudaStream_t    stream
    )
{
    dim3 grid(numVectors);
    dim3 block(SHM_LENGTH);

    unsigned long shmMemSize = SHM_LENGTH * sizeof(float);

    vectorScaling_CalculateMeans_Kernel<<< grid, block, shmMemSize, stream >>>(vectorValues, vectorMeans, vectorLength);

    CHECK_CUDA_ERRORS();

    return true;
}

The CPU reference test:

bool cpuCalcuateMeans(
    float* vectorValues, // [numVectors][vectorLength]
    float* vectorMeans,  // [numVectors]
    int    numVectors,
    int    vectorLength
    )
{
    //for every vector
    for(int n = 0; n < numVectors; n++)
    {
        float sum = 0.0f;
        for(int vectorIndex = 0; vectorIndex < vectorLength; vectorIndex++)
        {
            sum += vectorValues[vectorIndex];
        }

        vectorMeans[n] = sum / float(vectorLength);

        //next vector
        vectorValues += vectorLength;
    }
    return true;
}

Small differences in sum-reductions or similar operations are not uncommon when comparing CPU and GPU results. Floating point addition is not an associative operation, so differing order of operations may make for different results, and clearly your CPU and GPU implementations do not have the exact same order of operations.

There are various papers that cover this, and numerous questions all over the web discussing similar issues. Here are a couple of the papers:

http://www.lsi.upc.edu/~robert/teaching/master/material/p5-goldberg.pdf
http://docs.nvidia.com/cuda/floating-point/index.html#abstract

I can anticipate some of your questions (why does it only happen after 8204 vectors? etc.) but probably wouldn’t be able to respond without a complete example. That would be a code that I could copy, paste, compile and run, and observe the issue, without having to add anything or change anything.

Hello,

many thanks for the reply; I was suspecting something like that. Order of addition in floating point operations does matter.

And as you have anticipated, I wonder why does it happen after 8204 vectors. This would suggest that the code is executing in a different order for different launches of thread blocks? Should I launch the kernel in stages?

Thanks for the links, I will read more about it and change my success criteria for the difference being within certain bounds. I need to find sensible values.

Kind regards,
peter

are all vectors equivalent? it may be property of specific vector values

Hello,

indeed the values are not quite the same for different vectors. They increase in value for every vector:

float* vectorValues = h_cpuValues.data();
for(int n = 0; n < numVectors; n++)
{
    std::fill(vectorValues, vectorValues+vectorLength, float(n+1));
    vectorValues += vectorLength;
}

This means that there may be a step in the floating point representation of the value of 8204 which makes the result more prone to the way in which the sums are built.

I have tried also with filling the vectors with random values. The errors seem to equalise at times.

Kind regards,
peter

2000*8200=16.400.00, while float-32 has 24-bit mantiss. indeed, first 8200 sums can be represented exactly by float32 values, so there is no error. next 8200 sums are rounded to even number