Basic reduction with CUDA

pramoduh · March 22, 2018, 12:56am

I am a beginner programmer in CUDA. As a beginner exercise, I thought I’d follow http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf and write some reduction algorithms. What my program should do is exactly as follows:

It takes a large float array as input. The input array is simply an array starting at 0 and incrementing by 1.

It carries out a partial reduction on the array. The result should be 1/512 th of the initial array and each element of the result is the sum of 512 elements of the input array.

The main program does the same summation in the CPU and compares the results.

This is my code so far:

#include <stdio.h>
#include <math.h>

#define THREAD_COUNT 512

//-----------------this is basically the first reduction algorithm given in the reference----------
__global__ void gbest_power_sums1(float* result, float* powers){

        __shared__ float sdata[THREAD_COUNT];
        unsigned int tid = threadIdx.x;
        long i = blockIdx.x*blockDim.x + threadIdx.x;
        sdata[tid] = powers[i];
        __syncthreads();

        unsigned int s;
        for(s=1; s<blockDim.x; s *= 2){
                if(tid % (2*s) == 0){
                        sdata[tid] += sdata[tid + s];
                }
                __syncthreads();
        }

        if(tid == 0)
                result[blockIdx.x] = sdata[0];
}

int main(){

        unsigned long input_size = 1<<18;

        unsigned long output_size = input_size / THREAD_COUNT;

//------------prepare the input----------------------------------------------------------------
        float* input;
        float* input_dev;
        input = (float*) malloc(sizeof(float) * input_size);
        cudaMalloc(&input_dev, sizeof(float) * input_size);

        long i;
        for(i=0; i<input_size; i++){
                input[i] = i;
        }
        cudaMemcpy(input_dev, input, sizeof(float) * input_size, cudaMemcpyHostToDevice);
//-----------------------------------------------------------------------------------------------

float* result;
        float* result_dev;
        result = (float*) malloc(sizeof(float) * output_size);
        cudaMalloc(&result_dev, sizeof(float) * output_size);

//-------------------------carry out reduction----------------------------------------------------

        gbest_power_sums1<<<output_size, THREAD_COUNT>>>(result_dev, input_dev);
        cudaMemcpy(result, result_dev, sizeof(float) * output_size, cudaMemcpyDeviceToHost);

        cudaError_t error = cudaPeekAtLastError();
        if(error != cudaSuccess){
                printf("%s\n", cudaGetErrorString(error));
                exit(0);
        }

//--------------------calculate the result in CPU for verification-------------------------------

        int j;
        for(i=0; i<output_size; i++){
                float actual_result = 0.0f;
                for(j=0; j<THREAD_COUNT; j++){
                        actual_result += input[i*THREAD_COUNT + j];
                }
                if(fabs(actual_result - result[i]) > 0.1){
                        printf("Error at i=%d, expected:%f, got:%f\n", i, actual_result, result[i]);
                        exit(0);
                }
        }

        printf("test passed!\n");
        return 0;

}

The code works but only up to the size of 2^17 of input array. When I try 2^18 results calculated by the GPU differs from that calculated by the CPU. The printed output I get from an array of size 2^18 is:

Error at i=65, expected:17170172.000000, got:17170176.000000

There is a difference of 4.0 here. I don’t understand why. Any help would be appreciated.

Robert_Crovella · March 22, 2018, 1:13am

You’re using float datatype, which has approximately 5-6 digits of decimal significant figures. Beyond that results are just “noise”. Furthermore, since float addition is not associative, the order of actual floating point additions matters for a reproducible result. So if your CPU and GPU code don’t do the same operations in the same order (they don’t) then differences are possible, depending on the data.

[url]Floating Point and IEEE 754 :: CUDA Toolkit Documentation

Topic		Replies	Views
float reduction, cpu and cuda answers differ CUDA Programming and Performance	4	3331	April 1, 2008
Would like to share my speedy reduction code Very simple code! CUDA Programming and Performance	0	1092	July 29, 2010
CUDA slower than CPU? CUDA Programming and Performance	7	870	August 18, 2023
Reduction operation returns incorrect result CUDA Programming and Performance	1	414	November 18, 2018
GPU version silhouette calculation is equal or even slower then CPU version, anything wrong? CUDA Programming and Performance cuda	1	780	August 7, 2021
Parallel reduction problem CUDA Programming and Performance	1	5083	November 29, 2010
CUDA reduction precision issues CUDA Programming and Performance cuda	1	418	December 13, 2022
CUDA code giving wrong result CUDA Programming and Performance	0	403	May 4, 2020
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2230	November 23, 2009
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1551	November 8, 2023

Basic reduction with CUDA

Related topics