Atomic operation problem

Hello,
I have a T10P engineering sample and I’ve been trying to apply the new atomic functions to shared memory in a histogramming application. I found some discrepancies in my results. Upon further testing, I found that when I have several threads in different warps all accessing the same shared memory via the atomicAdd function only some of the adds are successful. If the threads are in the same warp there is no problem. Am I doing something wrong? Is this the expected behavior of the atomic functions? Or is this a bug in the current software or sample hardware. I’ll post a sample code which reproduces this result in a few minutes. Thank you for your time.

Ben

//This code reproduces the problem I have described

/All it does is add increment a variable in shared memory repeatedly. The incremement is done by the atomicAdd function. It is called from each thread in a block of NTHREADS. This is repeated NREPEATS tims, so that the total value at the end of the run should be NTHREADSNREPEATS if all atomic adds register*/

#include <stdio.h>
#include <math.h>
#include <cuda_runtime.h>

/* with NTHREADS=32 there is not problem. With 64, only half of the atomic adds register. With 128 slightly more than half register, and the result is not reproducible. */

#define NTHREADS 32
//#define NTHREADS 64
//#define NTHREADS 128

#define NREPEATS 128

global void device_routine( unsigned int* idev ) {

int j;
shared unsigned int ishared[1];

ishared[0]=0ULL;

for (j=0;j<NREPEATS;j++) {
atomicAdd(ishared, 1U);
}

idev[0]=ishared[0];
}

int main() {
unsigned int* idev;
unsigned int i;
cudaMalloc((void**)&idev,sizeof(unsigned int));

device_routine<<<1,NTHREADS>>>(idev);

cudaMemcpy(&i, idev, sizeof(unsigned int),cudaMemcpyDeviceToHost);

// if all is working, these should be the same number
printf(“done %i %i \n”, i, NTHREADS*NREPEATS);
}

// THANKS FOR YOUR TIME!

Hello again,
I’ve just added to my confusion by running the code in the previous post in emulation mode. It appears that in emulation atomicAdds from only 1 thread register, regardless of how many threads are in the block. Thank you in advance for your help.

Ben