CUDA BUG? atomicAdd

Noel_Lopes · March 21, 2009, 7:42pm

Hi,

First let me say that I’m a newbie programming CUDA, so I might be missing something here. Anyway the code follows (using CUDA 2.1 - 64 bit version):

#define QUANT_NUMBERS_TO_SUM 100
#define BLOCK_SIZE 10

global void sum_kernel(int * numbersToSum, int * output) {
extern shared int data;

int x = blockIdx.x * blockDim.x + threadIdx.x; // global index
    
data[threadIdx.x] = numbersToSum[x]; 
__syncthreads();

int nextInterval;
for (int interval = 1; interval < blockDim.x; interval = nextInterval) {
    nextInterval = 2 * interval;
    
    int positionSum = threadIdx.x + interval;       
    if (threadIdx.x % nextInterval == 0 && positionSum < blockDim.x) data[threadIdx.x] += data[positionSum];
    __syncthreads();
}

if (threadIdx.x == 0) atomicAdd(output, data[threadIdx.x]);

}

global void init(int * numbersToSum, int * output) {
int x = blockIdx.x * blockDim.x + threadIdx.x; // global index

if (x < QUANT_NUMBERS_TO_SUM) {
	numbersToSum[x] = x + 1;

	if (x == 0) output[0] = 0;
}

}

void Sum() {
int * dNumbers;
int * dSum;

cudaMalloc((void **) &dNumbers, QUANT_NUMBERS_TO_SUM * sizeof(int));
cudaMalloc((void **) &dSum, sizeof(int));

init<<<1, QUANT_NUMBERS_TO_SUM>>>(dNumbers, dSum);
cudaThreadSynchronize();

int nBlocks = (QUANT_NUMBERS_TO_SUM + BLOCK_SIZE - 1) / BLOCK_SIZE;
sum_kernel<<<nBlocks, BLOCK_SIZE>>>(dNumbers, dSum);
cudaThreadSynchronize();

int hSum[1];

cudaMemcpy(hSum, dSum, sizeof(int), cudaMemcpyDeviceToHost);

printf("Sum = %d", hSum[0]);

}

The sum kernel works fine for a BLOCK_SIZE up to 8. Greater values make weird results come up (correct result should be 5050).
Code seems correct, so anyone can please point me out my mistake or is this a CUDA bug?

Thanks

Noel_Lopes · March 21, 2009, 9:18pm

I think I found the problem. When launching the kernel I should have specified the size of the shared memory.

Topic		Replies	Views
can you give me sample code for atomicAdd()? CUDA Programming and Performance	9	48312	June 5, 2009
Really simple while loop issues CUDA Programming and Performance	4	3150	October 27, 2014
CUDA - calculation of a sum CUDA Programming and Performance	7	5483	April 30, 2010
Error while adding CUDA /C++ .. can you help please?? CUDA Programming and Performance	5	859	January 11, 2014
Summation CUDA Programming and Performance	10	8317	November 20, 2008
do not understand thread/block division CUDA Programming and Performance	10	2798	April 23, 2012
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1410	November 8, 2023
Atomic add for complex numbers CUDA Programming and Performance	2	2901	October 7, 2015
AtomicAdd() functions CUDA Programming and Performance	1	752	December 9, 2016
How to synchronize a Kernel with many for loops CUDA Programming and Performance	12	11986	November 28, 2011

CUDA BUG? atomicAdd

Related topics