AtomicAdd() functions

Hello guys, currently I am using atomicadd() functions to add value to an array. However, it works fine when the number of thread is less (around 800) but it doesn’t work correctly when I am dealing with 2.4millions of threads. Here’s is part of my code:

#define TILEWIDTH 32

for (i = 5; i >= 0; i–)

xy = (xy << 1) + (p[threadIdx.y * 32 * 4 + threadIdx.x * 4 + i].t >= (uint)d_p[(blockIdx.z + 3)(d_roi_height)(d_roi_width) + (threadIdx.y+1+(blockIdx.yTILEWIDTH))(d_roi_width) + (threadIdx.x+1)+(blockIdx.xTILEWIDTH)]);

printf(“Block id x: %d\tBlock id y: %d\tBlock id z: %d\tthreadid.x: %d\tthreadid.y: %d\txy:%d\n”, blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, xy);
atomicAdd(&d_basic_xy[xy], 1.0f);

The printf function is used to check the value of xy. From the printf, xy=0 appear 100 times but the atomicadd function only add 8 into it.(suppose to be 100)

Anyone knows how to solve it? Is it because the number of threads is too huge hence the kernel is kicked out by the clock before it finished all its work?

I can’t re-produce the difficulty.

#include <cuda_runtime.h>
#include <iostream>

__global__ void myKernel(unsigned int* sum) {
  atomicAdd(sum, 1U);

int main() {

  unsigned int* sum;
  cudaMallocManaged(&sum, sizeof(unsigned int));
  *sum = 0U;

  std::cout << *sum << std::endl;


it works. show me exact result: 10485760 (= 323210 * 32*32) equals to the number of threads.