Basic reduction with CUDA

I am a beginner programmer in CUDA. As a beginner exercise, I thought I’d follow and write some reduction algorithms. What my program should do is exactly as follows:

It takes a large float array as input. The input array is simply an array starting at 0 and incrementing by 1.

It carries out a partial reduction on the array. The result should be 1/512 th of the initial array and each element of the result is the sum of 512 elements of the input array.

The main program does the same summation in the CPU and compares the results.

This is my code so far:

#include <stdio.h>
#include <math.h>

#define THREAD_COUNT 512

//-----------------this is basically the first reduction algorithm given in the reference----------
__global__ void gbest_power_sums1(float* result, float* powers){

        __shared__ float sdata[THREAD_COUNT];
        unsigned int tid = threadIdx.x;
        long i = blockIdx.x*blockDim.x + threadIdx.x;
        sdata[tid] = powers[i];

        unsigned int s;
        for(s=1; s<blockDim.x; s *= 2){
                if(tid % (2*s) == 0){
                        sdata[tid] += sdata[tid + s];

        if(tid == 0)
                result[blockIdx.x] = sdata[0];

int main(){

        unsigned long input_size = 1<<18;

        unsigned long output_size = input_size / THREAD_COUNT;

//------------prepare the input----------------------------------------------------------------
        float* input;
        float* input_dev;
        input = (float*) malloc(sizeof(float) * input_size);
        cudaMalloc(&input_dev, sizeof(float) * input_size);

        long i;
        for(i=0; i<input_size; i++){
                input[i] = i;
        cudaMemcpy(input_dev, input, sizeof(float) * input_size, cudaMemcpyHostToDevice);

float* result;
        float* result_dev;
        result = (float*) malloc(sizeof(float) * output_size);
        cudaMalloc(&result_dev, sizeof(float) * output_size);

//-------------------------carry out reduction----------------------------------------------------

        gbest_power_sums1<<<output_size, THREAD_COUNT>>>(result_dev, input_dev);
        cudaMemcpy(result, result_dev, sizeof(float) * output_size, cudaMemcpyDeviceToHost);

        cudaError_t error = cudaPeekAtLastError();
        if(error != cudaSuccess){
                printf("%s\n", cudaGetErrorString(error));

//--------------------calculate the result in CPU for verification-------------------------------

        int j;
        for(i=0; i<output_size; i++){
                float actual_result = 0.0f;
                for(j=0; j<THREAD_COUNT; j++){
                        actual_result += input[i*THREAD_COUNT + j];
                if(fabs(actual_result - result[i]) > 0.1){
                        printf("Error at i=%d, expected:%f, got:%f\n", i, actual_result, result[i]);

        printf("test passed!\n");
        return 0;


The code works but only up to the size of 2^17 of input array. When I try 2^18 results calculated by the GPU differs from that calculated by the CPU. The printed output I get from an array of size 2^18 is:

Error at i=65, expected:17170172.000000, got:17170176.000000

There is a difference of 4.0 here. I don’t understand why. Any help would be appreciated.

You’re using float datatype, which has approximately 5-6 digits of decimal significant figures. Beyond that results are just “noise”. Furthermore, since float addition is not associative, the order of actual floating point additions matters for a reproducible result. So if your CPU and GPU code don’t do the same operations in the same order (they don’t) then differences are possible, depending on the data.