CUDA dot product atomics problem

timgr · February 26, 2012, 1:11am

I’m trying to implement the classic dot-product kernel for double precision arrays with atomic computation of the final sum across the various blocks. I used the atomicAdd for double precision as stated in page 116 of the programming guide.Probably i’m doing something wrong.The partial sums across the threads in every block are computed correctly but afterwords the atomic operation doesn’t seem to be working properly since every time i run my kernel with the same data,i receive different results. I’ll be grateful if somebody could spot the mistake or provide an alternative solution! Here is my kernel:

global void cuda_dot_kernel(int *n,double *a, double *b, double *dot_res)

{

shared double cache[threadsPerBlock]; //thread shared memory

int global_tid=threadIdx.x + blockIdx.x * blockDim.x;

int i=0,cacheIndex=0;

double temp = 0;

cacheIndex = threadIdx.x;

while (global_tid < (*n)) {
temp += a[global_tid] * b[global_tid];

global_tid += blockDim.x * gridDim.x;
}

cache[cacheIndex] = temp;

__syncthreads();

for (i=blockDim.x/2; i>0; i>>=1) {
if (threadIdx.x < i) {

    cache[threadIdx.x] += cache[threadIdx.x + i];

}

__syncthreads();
}

__syncthreads();

if (cacheIndex==0) {
*dot_res=cuda_atomicAdd(dot_res,cache[0]);
}

}

And here is my atomicAdd function:

device double cuda_atomicAdd(double *address, double val)

{

double assumed,old=*address;

do {
assumed=old;

old= __longlong_as_double(atomicCAS((unsigned long long int*)address,

                                    __double_as_longlong(assumed),

                                    __double_as_longlong(val+assumed)));
}while (assumed!=old);

return old;

}

tera · February 26, 2012, 2:16am

Welcome to the wonderful world of floating point arithmetics! The rounding error of the sum of more than two floating point numbers depends on the order of the operations as floating point addition is not associative. The order however is undefined by definition if you have any need for atomic operations at all.

To get reproducible results extend the reduction scheme all the way to the end result by either a) doing the sum over block results on the CPU or b) launching a second kernel with a single block to sum the block results or c) have the last finishing block do the sum. c) is more complicated than the other options, but it is described in appendix B.5 of the Programming Guide.

timgr · February 26, 2012, 10:03am

Thanks a lot for your fast reply!So the obvious reason i get unreproducible results is the roundoff errors?But by using double precision variables,have i not minimized the effect of rounding on my results?I mean,what’s the difference between adding the gridDim.x numbers on the CPU or by an atomic operation on the GPU.With double precision,the 64 bytes-variables additions are not implemented with the same algorithm?

tera · February 26, 2012, 11:10am

Floating point addition gives the same results on the CPU and GPU provided the order of operations is the same (which it often isn’t because you want to arrange the algorithm on the GPU for maximal parallelism).

Well there still is the possibility that something is wrong in your program. Do you clear [font=“Courier New”]*dot_res[/font] before invoking your kernel?

timgr · February 26, 2012, 11:19am

Yeah!I actually found my mistake…When calling the cuda_atomicAdd function, i was returning the result back to the *dot_res variable,something wrong and unnecessary, since the value of *dot_res was updated internally!The right code would be:

Kernel Call:

Atomic function:

device void cuda_atomicAdd(double *address, double val)

{

double assumed,old=*address;

do {

    assumed=old;

    old= __longlong_as_double(atomicCAS((unsigned long long int*)address,

                                        __double_as_longlong(assumed),

                                        __double_as_longlong(val+assumed)));

}while (assumed!=old);

}

With this minor change,everything works flawlessly!Thanks anyway

Topic		Replies	Views
Different results using diffrent memory types CUDA Programming and Performance	22	4393	April 14, 2010
atomicAdd and concurrent kernels CUDA Programming and Performance	5	2318	August 6, 2013
floating point precision on CUDA CUDA Programming and Performance	11	14820	June 8, 2010
double precision atomicAdd() problem CUDA Programming and Performance	3	3085	February 1, 2024
problem with dot product code CUDA Programming and Performance	11	6460	June 24, 2008
float asssociative Debugging error CUDA Programming and Performance	10	2234	April 12, 2010
atomic add operation CUDA Programming and Performance	2	4371	July 22, 2014
Several questions on cuda (arithmetic, rounding, for loop ad performance) CUDA Programming and Performance	8	3511	April 13, 2023
AtomicAdd result incorrect CUDA Programming and Performance	3	1608	December 29, 2018
Complex addition as Atomic operation CUDA Programming and Performance	10	4988	August 29, 2010

CUDA dot product atomics problem

Related topics