can you give me sample code for atomicAdd()?

Manjunath_Gudisi · June 5, 2009, 6:34am

Hello,

can you give me sample test code using atomicAdd() function?

dont tell look into Histogram64 SDK sample.

tmurray · June 5, 2009, 6:40am

histogram256? simpleAtomicIntrinsics?

Sarnath · June 5, 2009, 6:57am

Whats so difficult about it… Just say “atomicAdd(gmem_pointer, 1)” in your code.

Compile with “nvcc -arch=compute_13” option and run it on 1.3 hardware…

OR

If u hav 1.1, use “nvcc -arch=compute_11” and run it on 1.1 hardware.

Sarnath · June 5, 2009, 6:57am

Note shared mem atomics r available only on 1.2 and above devices.

Manjunath_Gudisi · June 5, 2009, 7:36am

this is my kernal

[codebox]global void Sum( int sum , int size, int index)

{

int idx = blockIdx.x * blockDim.x  + threadIdx.x ;	

int i = *index;

sum[i] = idx; 

atomicAdd(index,1);

}[/codebox]

and main function is

[codebox]/************************************************************

************/

/* CUDA_main */

/************************************************************

************/

int main(int argc, char* argv)

{

/*if(!InitCUDA()) {

	return 0;

}*/

CUT_DEVICE_INIT(argc, argv);

int W = 256;

int H = 256;

int *hSum ,*dSum , size = 50 ;

int* d_index=0;

int  h_index=0;



hSum = (int*)malloc(sizeof(int)*W*H) ;

memset( hSum, 0, sizeof(int)*W*H);

CUDA_SAFE_CALL( cudaMalloc( (void**) &dSum, sizeof(int)*W*H ) );

CUDA_SAFE_CALL( cudaMalloc( (void**) &d_index, sizeof(int) ) );

CUDA_SAFE_CALL(cudaMemcpy(dSum, hSum , sizeof(int)*W*H, cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL(cudaMemcpy(d_index, &h_index , sizeof(int),  cudaMemcpyHostToDevice) );

Sum<<<W,H>>>( dSum , size, d_index );

CUDA_SAFE_CALL(cudaMemcpy(hSum, dSum, sizeof(int)*W*H, cudaMemcpyDeviceToHost) );

FILE* fp = fopen( "C:\\ThreadIdxs.txt", "wt" );

if( fp )

{

	for( int i=0; i<W*H; ++i )

		fprintf( fp, "%d\n", hSum[i] );

	fclose(fp);

}

CUDA_SAFE_CALL(cudaMemcpy( &h_index , d_index, sizeof(int),  cudaMemcpyDeviceToHost) );

free(hSum);	

CUDA_SAFE_CALL(cudaFree(dSum));

CUDA_SAFE_CALL(cudaFree(d_index));

CUT_EXIT(argc, argv);

return 0;

}[/codebox]

after executing the kernal function , h_index is showing 65513 which is perfect.

but in the kernal function, I want to store the each value of “int* index” into an “sum” array.

try to copy this code and execute it.

then file is created in C:\ThreadIdxs.txt.

open the file and see the values.

the values in the should be value of “int* index” for a perticular thread, what is shown in file is 0 and some values.

why is it…

Sarnath · June 5, 2009, 7:43am

int idx = blockIdx.x * blockDim.x  + threadIdx.x;

				int i = *index;	

				sum[i] = idx; 	

				atomicAdd(index,1);

Note that “index” is being read by all kindsaa threads when “int i = *index” is executed. That statement is NOT executed serially.

Only the addition of “index” is done atomiclly. It is a read-write race. You need to implement a spinlock and inside that lock, one thread has to read *index, update sum[i] and then add 1 to index. You should release the lock after that.

Implementation of spinlock seems to be trivial. But it has some complications. It is being discussed now in a separate thread.

avidday · June 5, 2009, 8:01am

Execution happens in groups of threads. The first groups of threads scheduled are absolutely guaranteed to read the identical initial value from index before one of them acquires a lock for the atomic add and increments it. As more threads are launched, the results will become less predictable depending on how the GPU prioritizes queued reads and atomic operations on index.

This isn’t a serial execution model. If you find the results of this at all surprising, it is probably time to go back and read Chapter 3 of the user guide.

avidday · June 5, 2009, 8:10am

I suspect

int idx = blockIdx.x * blockDim.x + threadIdx.x;

int i = atomicAdd(index,1);

sum[i] = idx;

has a lot more hope of working as you imagine your kernel should, but it will be very slow.

avidday · June 5, 2009, 10:01am

And as I suspected, it does work:

#include <stdio.h>

#include "cuda.h"

__global__ void Sum( int *sum , int size, int* index)

{

	register int i = atomicAdd(index,1);

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	sum[i] = idx;

}

int main(int argc, char* argv[])

{

	int W = 256;

	int H = 256;

	int *hSum ,*dSum , size = 50;

	int* d_index=0;

	int h_index=0;

	hSum = (int*)malloc(sizeof(int)*W*H);

	memset( hSum, 0, sizeof(int)*W*H);

	cudaMalloc( (void**) &dSum, sizeof(int)*W*H );

	cudaMalloc( (void**) &d_index, sizeof(int) );

	cudaMemcpy(dSum, hSum , sizeof(int)*W*H, cudaMemcpyHostToDevice);

	cudaMemcpy(d_index, &h_index , sizeof(int), cudaMemcpyHostToDevice);

	Sum<<<W,H>>>( dSum , size, d_index );

	cudaMemcpy(hSum, dSum, sizeof(int)*W*H, cudaMemcpyDeviceToHost);

	cudaMemcpy(&h_index , d_index, sizeof(int), cudaMemcpyDeviceToHost);

	fprintf(stderr, "%d\n", h_index);

	for( int i=0; i<W*H; ++i )

		fprintf( stdout, " %d %d\n", i, hSum[i] );

	free(hSum);

	cudaFree(dSum);

	cudaFree(d_index);

	return 0;

}

And to prove it, here is a nice graph of its output (keep in mind that this was run on a 9500GT in my development box, a 1.1 capable device with more multiprocessors or a 1.3 will look different).

Sarnath · June 5, 2009, 10:43am

Coool. Simple one.

Topic		Replies	Views
CUDA BUG? atomicAdd CUDA Programming and Performance	1	6120	March 21, 2009
AtomicAdd algorithm CUDA Programming and Performance	7	3816	August 25, 2009
Strange Performance Problem using atomicAdd CUDA Programming and Performance	4	1605	September 12, 2010
atomicAdd(float,float) - atomicMul(float,float) ... CUDA Programming and Performance	13	56646	July 29, 2010
Why does the atomicAdd work faster than â€˜+=â€™? CUDA Programming and Performance	3	12189	November 2, 2011
Atomic functions problem CUDA Programming and Performance	8	1832	May 30, 2009
atomicAdd crash CUDA Programming and Performance	8	1313	August 25, 2016
Atomic add for complex numbers CUDA Programming and Performance	2	2920	October 7, 2015
Really simple while loop issues CUDA Programming and Performance	4	3183	October 27, 2014
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	8962	August 3, 2017

can you give me sample code for atomicAdd()?

Related topics