Reduction results differs every time

surekenlev32 · April 9, 2019, 10:46am

I write this code:

#include <iostream>
#include <fstream>

template <typename T1, typename T2>
__global__ void vec_rm(T1 *dev_in, T1 *dev_out, T2 *dev_size)
{
    extern __shared__ double arr[];
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < *dev_size)
    {
    int tid = threadIdx.x;
    arr[tid] = dev_in[i];
    __syncthreads();
    for (int s = 1; s < blockDim.x; s*= 2)
    {
        if (tid % (2*s) == 0)
        {
            arr[tid] += arr[tid+s];
        }
        __syncthreads();
    }

    printf("dev_in[%d] = %lf\n", i, dev_in[i]);
    if (tid == 0)
    {
    dev_out[blockIdx.x] = arr[0];
    printf("dev_block[%d] = %lf\n", blockIdx.x, arr[0]);
    }
    }
}

int main()
{
    double a[] = {1,2,3,90,28,45,-8};
    int size = 7, TC = 3, BL = size / TC,*dev_size;
    double sum = 0,*dev_a, *dev_result;
    double result[BL];
    cudaMalloc((void**)&dev_result, BL*sizeof(double));
    if (size%TC == 0)
    {
        cudaMalloc( (void**)&dev_a, size*sizeof(double));
        cudaMemcpy( dev_a, a, size*sizeof(double), cudaMemcpyHostToDevice);
        cudaMalloc((void**)&dev_size, sizeof(int));
        cudaMemcpy(dev_size, &size, sizeof(int), cudaMemcpyHostToDevice);
        vec_rm<<<BL,TC, TC*sizeof(double)>>>(dev_a, dev_result, dev_size);
    }
    else
    {
        int gpu_size = size/TC*TC;
        cudaMalloc( (void**)&dev_a, gpu_size*sizeof(double));
        cudaMemcpy( dev_a, a, gpu_size*sizeof(double), cudaMemcpyHostToDevice);
        cudaMalloc((void**)&dev_size, sizeof(int));
        cudaMemcpy(dev_size, &gpu_size, sizeof(int), cudaMemcpyHostToDevice);
        vec_rm<<<BL,TC, TC*sizeof(double)>>>(dev_a, dev_result, dev_size);
        for (int k = gpu_size; k < size; k++)
        {
            sum += a[k];
        }
        std::cout << "CPU sum = " << sum << std::endl;
    }
    cudaMemcpy(result, dev_result, BL*sizeof(double), cudaMemcpyDeviceToHost);
    for (int k = 0; k < BL; ++k)
    {
        sum += result[k];
    }
    std::cout << "CPU+GPU sum = " << sum << std::endl;
    return 0;
}

And I get different result every launch, I try to find out error, but I don’t see race condition or something else.

saulocpp · April 9, 2019, 12:21pm

There is a working code here:
[url]https://devtalk.nvidia.com/default/topic/1038617/understanding-and-adjusting-mark-harriss-array-reduction/?offset=8#5277985[/url]

You can just copy/paste/compile/compare to yours and fix accordingly.
Or even better, use Thrust’s reduction.

Robert_Crovella · April 9, 2019, 2:30pm

If you want to start to understand what is wrong with your code:

Run your code with cuda-memcheck
Apply the method here:
[url]cuda - Unspecified launch failure on Memcpy - Stack Overflow
to start to debug your code
After fixing your code, when cuda-memcheck reports no errors, then check your code with the cuda-memcheck racecheck tool. Refer to the cuda-memcheck documentation to learn how to use this sub-tool:

In the future, my recommendation is that you do proper CUDA error checking (google that) and run your code with cuda-memcheck, before asking others for help. Even if you don’t understand the error output, it will be useful to others who may try to help you.

Topic		Replies	Views
Inconsistent results for reduction, except while printf or cudamemcheck CUDA Programming and Performance	29	2685	September 13, 2016
Basic reduction with CUDA CUDA Programming and Performance	1	571	March 22, 2018
Vector Reduction CUDA Programming and Performance	3	19855	March 9, 2011
Reduction random errors Reduction kernel turns weird values CUDA Programming and Performance	2	834	February 7, 2011
Reduction operation returns incorrect result CUDA Programming and Performance	1	452	November 18, 2018
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1897	January 14, 2009
float reduction, cpu and cuda answers differ CUDA Programming and Performance	4	3424	April 1, 2008
Parallel reduction problem CUDA Programming and Performance	1	5138	November 29, 2010
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2329	November 23, 2009
Regarding Vector Reduction To find sum of all components of a vector CUDA Programming and Performance	1	3296	July 22, 2009

Reduction results differs every time

Related topics