atomic add operation

lwan61c1t3 · July 22, 2014, 8:55pm

Hi, All,

I am trying to sum up previously calculated values in different threads within the same thread block, and then write the value to a single variable.

As shown in the following code, I used a self-defined double precision atomicAdd(), as introduced in ( Speed of double precision CUDA atomic operations on Kepler K20 - CUDA Programming and Performance - NVIDIA Developer Forums ). Double precision is necessary in my case.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <time.h>

// double precision atomic add function
__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull = (unsigned long long int*)address;

    unsigned long long int old = *address_as_ull, assumed;

    do{ assumed = old;
		old = atomicCAS(address_as_ull, assumed,__double_as_longlong(val +__longlong_as_double(assumed)));
    } while (assumed != old);

    return __longlong_as_double(old);
}

// kernel function
__global__ void addKernel(double *dev_add)
{
    int i = threadIdx.x;

	__shared__ double add_result;

	add_result = 0.1245274925749275269245279673295762;

    double previously_calculated_value = double(i);

	__syncthreads();

    atomicAdd(&add_result, previously_calculated_value);  // double precision atomic add
    //atomicAdd(&add_result, float(previously_calculated_value));  // single precision atomic add
	//add_result += double(i);

	if(i==31)
	{
		//atomicAdd(&add_result, double(i));
		*dev_add = (double)add_result;
	}
}

int main()
{
	int size = 1;

	double *dev_add;
	double c;

	cudaMalloc((void**)&dev_add,  size * sizeof(double));
  
	clock_t t1, t2;
    t1 = clock();

    // Launch a kernel on the GPU 
    addKernel <<< 1000000, 32 >>> (dev_add);

    // copy calculation result back to host
    cudaMemcpy(&c, dev_add, size * sizeof(double), cudaMemcpyDeviceToHost);

	t2 = clock();
    printf("%.3f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);
	printf("add result: %.20f", c);
   
    cudaFree(dev_add);

	getchar();
    
    return 1;
}

However, the usage of self-defined double precision atomicAdd() slowes the code down over 70 times. Even comparing with the single precision atomicAdd() provided by Nvidia, it is still 20 times slower.

So, I am wondering whether there is a better way to do that, rather than the method used in my code, to minimize the penalty of double precision atomicAdd() usage. I am using VS2010, CUDA6.0, and running code on K20C.

Many thanks in advance!

Robert_Crovella · July 22, 2014, 9:29pm

The usual suggestion for situations like this is to use a parallel reduction. You may want to study the parallel reduction sample code and pdf:

http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction

Here’s a reworked version of your code with a parallel reduction in place of the atomic operations, I think you’ll find it runs faster:

#include <stdio.h>
#include <time.h>
#define nTPB 32
// kernel function
__global__ void addKernel(double *dev_add)
{
  int i = threadIdx.x;
  __shared__ double add_result;
  __shared__ double result[nTPB];
  add_result = 0.1245274925749275269245279673295762;
  double previously_calculated_value = double(i);
  result[i] = previously_calculated_value;
  __syncthreads();
  for (int j = blockDim.x/2; j > 0; j >>=1){
    if (i < j) result[i] += result[i+j];
    __syncthreads();
    }

  if(i==0)
  {
    *dev_add = (double)add_result+result[0];
  }
}
int main()
{
  int size = 1;
  double *dev_add;
  double c;
  cudaMalloc((void**)&dev_add, size * sizeof(double));
  clock_t t1, t2;
  t1 = clock();
  // Launch a kernel on the GPU
  addKernel <<< 1000000, nTPB >>> (dev_add);
  // copy calculation result back to host
  cudaMemcpy(&c, dev_add, size * sizeof(double), cudaMemcpyDeviceToHost);
  t2 = clock();
  printf("%.3f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);
  printf("add result: %.20f\n", c);
  cudaFree(dev_add);
  return 0;
}

lwan61c1t3 · July 22, 2014, 9:56pm

txbob:

The usual suggestion for situations like this is to use a parallel reduction. You may want to study the parallel reduction sample code and pdf:

http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction

Here’s a reworked version of your code with a parallel reduction in place of the atomic operations, I think you’ll find it runs faster:

#include <stdio.h>
#include <time.h>
#define nTPB 32
// kernel function
__global__ void addKernel(double *dev_add)
{
  int i = threadIdx.x;
  __shared__ double add_result;
  __shared__ double result[nTPB];
  add_result = 0.1245274925749275269245279673295762;
  double previously_calculated_value = double(i);
  result[i] = previously_calculated_value;
  __syncthreads();
  for (int j = blockDim.x/2; j > 0; j >>=1){
    if (i < j) result[i] += result[i+j];
    __syncthreads();
    }

  if(i==0)
  {
    *dev_add = (double)add_result+result[0];
  }
}
int main()
{
  int size = 1;
  double *dev_add;
  double c;
  cudaMalloc((void**)&dev_add, size * sizeof(double));
  clock_t t1, t2;
  t1 = clock();
  // Launch a kernel on the GPU
  addKernel <<< 1000000, nTPB >>> (dev_add);
  // copy calculation result back to host
  cudaMemcpy(&c, dev_add, size * sizeof(double), cudaMemcpyDeviceToHost);
  t2 = clock();
  printf("%.3f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);
  printf("add result: %.20f\n", c);
  cudaFree(dev_add);
  return 0;
}

Thank you so much! It is definitely much faster! ^_^

Topic		Replies	Views
Complex addition as Atomic operation CUDA Programming and Performance	10	4985	August 29, 2010
atomicAdd(float,float) - atomicMul(float,float) ... CUDA Programming and Performance	13	56398	July 29, 2010
CUDA dot product atomics problem CUDA Programming and Performance	4	1846	February 26, 2012
+= slower than atomicAdd - is there an alternate method? CUDA Programming and Performance	1	3065	December 3, 2012
Using reduction instead of atomics? CUDA Programming and Performance	9	5632	March 9, 2015
Threads and Race Condition CUDA Programming and Performance	11	2969	April 30, 2012
Result of atomicAdd in kernel CUDA Programming and Performance	10	52	July 22, 2024
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	8870	August 3, 2017
Atomic add running faster than naive add CUDA Programming and Performance	1	611	February 2, 2023
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	675	April 4, 2017

atomic add operation

Related topics