can you give me sample code for atomicAdd()?

Hello,

can you give me sample test code using atomicAdd() function?

dont tell look into Histogram64 SDK sample.

histogram256? simpleAtomicIntrinsics?

Whats so difficult about it… Just say “atomicAdd(gmem_pointer, 1)” in your code.

Compile with “nvcc -arch=compute_13” option and run it on 1.3 hardware…

OR

If u hav 1.1, use “nvcc -arch=compute_11” and run it on 1.1 hardware.

Note shared mem atomics r available only on 1.2 and above devices.

this is my kernal

[codebox]global void Sum( int sum , int size, int index)

{

int idx = blockIdx.x * blockDim.x  + threadIdx.x ;	

int i = *index;

sum[i] = idx; 

atomicAdd(index,1);

}[/codebox]

and main function is

[codebox]/************************************************************

************/

/* CUDA_main */

/************************************************************

************/

int main(int argc, char* argv)

{

/*if(!InitCUDA()) {

	return 0;

}*/

CUT_DEVICE_INIT(argc, argv);

int W = 256;

int H = 256;

int *hSum ,*dSum , size = 50 ;

int* d_index=0;

int  h_index=0;



hSum = (int*)malloc(sizeof(int)*W*H) ;

memset( hSum, 0, sizeof(int)*W*H);

CUDA_SAFE_CALL( cudaMalloc( (void**) &dSum, sizeof(int)*W*H ) );

CUDA_SAFE_CALL( cudaMalloc( (void**) &d_index, sizeof(int) ) );

CUDA_SAFE_CALL(cudaMemcpy(dSum, hSum , sizeof(int)*W*H, cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL(cudaMemcpy(d_index, &h_index , sizeof(int),  cudaMemcpyHostToDevice) );

Sum<<<W,H>>>( dSum , size, d_index );

CUDA_SAFE_CALL(cudaMemcpy(hSum, dSum, sizeof(int)*W*H, cudaMemcpyDeviceToHost) );

FILE* fp = fopen( "C:\\ThreadIdxs.txt", "wt" );

if( fp )

{

	for( int i=0; i<W*H; ++i )

		fprintf( fp, "%d\n", hSum[i] );

	fclose(fp);

}

CUDA_SAFE_CALL(cudaMemcpy( &h_index , d_index, sizeof(int),  cudaMemcpyDeviceToHost) );

free(hSum);	

CUDA_SAFE_CALL(cudaFree(dSum));

CUDA_SAFE_CALL(cudaFree(d_index));

CUT_EXIT(argc, argv);

return 0;

}[/codebox]

after executing the kernal function , h_index is showing 65513 which is perfect.

but in the kernal function, I want to store the each value of “int* index” into an “sum” array.

try to copy this code and execute it.

then file is created in C:\ThreadIdxs.txt.

open the file and see the values.

the values in the should be value of “int* index” for a perticular thread, what is shown in file is 0 and some values.

why is it…

int idx = blockIdx.x * blockDim.x  + threadIdx.x;

				int i = *index;	

				sum[i] = idx; 	

				atomicAdd(index,1);

Note that “index” is being read by all kindsaa threads when “int i = *index” is executed. That statement is NOT executed serially.

Only the addition of “index” is done atomiclly. It is a read-write race. You need to implement a spinlock and inside that lock, one thread has to read *index, update sum[i] and then add 1 to index. You should release the lock after that.

Implementation of spinlock seems to be trivial. But it has some complications. It is being discussed now in a separate thread.

Execution happens in groups of threads. The first groups of threads scheduled are absolutely guaranteed to read the identical initial value from index before one of them acquires a lock for the atomic add and increments it. As more threads are launched, the results will become less predictable depending on how the GPU prioritizes queued reads and atomic operations on index.

This isn’t a serial execution model. If you find the results of this at all surprising, it is probably time to go back and read Chapter 3 of the user guide.

I suspect

int idx = blockIdx.x * blockDim.x + threadIdx.x;

int i = atomicAdd(index,1);

sum[i] = idx;

has a lot more hope of working as you imagine your kernel should, but it will be very slow.

And as I suspected, it does work:

#include <stdio.h>

#include "cuda.h"

__global__ void Sum( int *sum , int size, int* index)

{

	register int i = atomicAdd(index,1);

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	sum[i] = idx;

}

int main(int argc, char* argv[])

{

	int W = 256;

	int H = 256;

	int *hSum ,*dSum , size = 50;

	int* d_index=0;

	int h_index=0;

	hSum = (int*)malloc(sizeof(int)*W*H);

	memset( hSum, 0, sizeof(int)*W*H);

	cudaMalloc( (void**) &dSum, sizeof(int)*W*H );

	cudaMalloc( (void**) &d_index, sizeof(int) );

	cudaMemcpy(dSum, hSum , sizeof(int)*W*H, cudaMemcpyHostToDevice);

	cudaMemcpy(d_index, &h_index , sizeof(int), cudaMemcpyHostToDevice);

	Sum<<<W,H>>>( dSum , size, d_index );

	cudaMemcpy(hSum, dSum, sizeof(int)*W*H, cudaMemcpyDeviceToHost);

	cudaMemcpy(&h_index , d_index, sizeof(int), cudaMemcpyDeviceToHost);

	fprintf(stderr, "%d\n", h_index);

	for( int i=0; i<W*H; ++i )

		fprintf( stdout, " %d %d\n", i, hSum[i] );

	free(hSum);

	cudaFree(dSum);

	cudaFree(d_index);

	return 0;

}

And to prove it, here is a nice graph of its output (keep in mind that this was run on a 9500GT in my development box, a 1.1 capable device with more multiprocessors or a 1.3 will look different).
plot2.png

Coool. Simple one.