Reduction kernel issues with data larger than 16M floats

PeterW · June 6, 2017, 1:37pm

Hello,

I need to calculate the mean values for a collection of vectors, certainly not the most difficult problem. However, the code below seems to hit a ceiling in working correctly when the number of total elements is larger than 16M elements, i.e. 64MB. I usually use around 12000 vectors, each 2048 long. I have tried permutations of longer but fewer vectors and vice versa. The incorrect output appears after 16M elements.

Up to that limit a unit test on the CPU produces the same output, above that there is a minor error. Printing either output (CPU and GPU), the floating point result appears to be the same at the standard precision. The comparison, however, returns false. Using vectors of length 2048, the difference between CPU and GPU output appears from the 8204th vector, but only every second vector from then onward.

I am puzzled, perhaps the code hits some compute capability limits. But unless I get the same results on GPU and CPU, or at least understand the difference, I cannot use the GPU version. I am using a K20m on Linux Ubuntu 12.04.

I am looking forward to receiving some comments, perhaps a fix for something I have overlooked.

The kernel (certainly not the optimum, but adequate as it will be extended):

#define SHM_LENGTH 256

__global__ void vectorCalculateMeans_Kernel(
    float*  vectorValues,// [numVectors][vectorLength]
    float*  vectorMeans, // [numVectors]
    int     vectorLength
    )
{
    extern __shared__ float shm[];//[SHM_LENGTH]

    // point to the start of the vector for the block in question
    vectorValues += blockIdx.x * vectorLength;

    //this thread index
    int threadIndex = threadIdx.x;

    //initialise accumulator
    shm[threadIndex] = 0.0f;

    //load and accumulate into SHM summing over all segments
    int numSegments = (vectorLength + SHM_LENGTH - 1) / SHM_LENGTH;
    for (int segment = 0, vectorIndex = threadIndex; segment < numSegments; ++segment)
    {
        if (vectorIndex < vectorLength)
        {
            //load values into shm in parallel, coallecent reads
            shm[threadIndex] += vectorValues[vectorIndex];
        }
        vectorIndex += SHM_LENGTH;
    }

    __syncthreads();

    //sum all SHM with decreasing stride
    for (int stride = SHM_LENGTH/2; stride > 0; stride >>= 1)
    {
        if (threadIndex < stride)
        {
            shm[threadIndex] += shm[threadIndex + stride];
        }
        __syncthreads();
    }

    if (threadIndex == 0) vectorMeans[blockIdx.x] = shm[0] / float(vectorLength);
}

The kernel call:

bool vectorCalculateMeans(
    float*          vectorValues, //[numVectors][vectorLength]
    float*          vectorMeans,  //[numVectors]
    int             numVectors,
    int             vectorLength,
    cudaStream_t    stream
    )
{
    dim3 grid(numVectors);
    dim3 block(SHM_LENGTH);

    unsigned long shmMemSize = SHM_LENGTH * sizeof(float);

    vectorScaling_CalculateMeans_Kernel<<< grid, block, shmMemSize, stream >>>(vectorValues, vectorMeans, vectorLength);

    CHECK_CUDA_ERRORS();

    return true;
}

The CPU reference test:

bool cpuCalcuateMeans(
    float* vectorValues, // [numVectors][vectorLength]
    float* vectorMeans,  // [numVectors]
    int    numVectors,
    int    vectorLength
    )
{
    //for every vector
    for(int n = 0; n < numVectors; n++)
    {
        float sum = 0.0f;
        for(int vectorIndex = 0; vectorIndex < vectorLength; vectorIndex++)
        {
            sum += vectorValues[vectorIndex];
        }

        vectorMeans[n] = sum / float(vectorLength);

        //next vector
        vectorValues += vectorLength;
    }
    return true;
}

Robert_Crovella · June 6, 2017, 3:23pm

Small differences in sum-reductions or similar operations are not uncommon when comparing CPU and GPU results. Floating point addition is not an associative operation, so differing order of operations may make for different results, and clearly your CPU and GPU implementations do not have the exact same order of operations.

There are various papers that cover this, and numerous questions all over the web discussing similar issues. Here are a couple of the papers:

http://www.lsi.upc.edu/~robert/teaching/master/material/p5-goldberg.pdf
http://docs.nvidia.com/cuda/floating-point/index.html#abstract

I can anticipate some of your questions (why does it only happen after 8204 vectors? etc.) but probably wouldn’t be able to respond without a complete example. That would be a code that I could copy, paste, compile and run, and observe the issue, without having to add anything or change anything.

PeterW · June 6, 2017, 4:16pm

Hello,

many thanks for the reply; I was suspecting something like that. Order of addition in floating point operations does matter.

And as you have anticipated, I wonder why does it happen after 8204 vectors. This would suggest that the code is executing in a different order for different launches of thread blocks? Should I launch the kernel in stages?

Thanks for the links, I will read more about it and change my success criteria for the difference being within certain bounds. I need to find sensible values.

Kind regards,
peter

BulatZiganshin · June 6, 2017, 5:06pm

are all vectors equivalent? it may be property of specific vector values

PeterW · June 6, 2017, 8:01pm

Hello,

indeed the values are not quite the same for different vectors. They increase in value for every vector:

float* vectorValues = h_cpuValues.data();
for(int n = 0; n < numVectors; n++)
{
    std::fill(vectorValues, vectorValues+vectorLength, float(n+1));
    vectorValues += vectorLength;
}

This means that there may be a step in the floating point representation of the value of 8204 which makes the result more prone to the way in which the sums are built.

I have tried also with filling the vectors with random values. The errors seem to equalise at times.

Kind regards,
peter

BulatZiganshin · June 6, 2017, 8:44pm

2000*8200=16.400.00, while float-32 has 24-bit mantiss. indeed, first 8200 sums can be represented exactly by float32 values, so there is no error. next 8200 sums are rounded to even number

Topic		Replies	Views
float reduction, cpu and cuda answers differ CUDA Programming and Performance	4	3437	April 1, 2008
Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP! CUDA Programming and Performance	16	10648	December 8, 2009
Basic reduction with CUDA CUDA Programming and Performance	1	585	March 22, 2018
Reduction random errors Reduction kernel turns weird values CUDA Programming and Performance	2	845	February 7, 2011
Result of reduction in GPU do not match with the CPU's, also GPU's result vary with blocksize Legacy PGI Compilers	4	1004	June 23, 2020
missing precision in floats? CUDA Programming and Performance	4	3941	April 15, 2009
Reduction sample in SDK CUDA Programming and Performance	4	3718	February 23, 2010
CUDA reduction precision issues CUDA Programming and Performance cuda	1	470	December 13, 2022
My reduction code is not really fast.. CUDA Programming and Performance	0	8710	April 11, 2011
Precision error - CUDA kernel Vs CPU Reduction kernel with fabs and differnce operations on float CUDA Programming and Performance	3	2108	May 6, 2011

Reduction kernel issues with data larger than 16M floats

Related topics