Atomic counter as array index

bgeneto · June 20, 2014, 1:16pm

Hi! Unfortunately I’m struggling to understand why we can’t use an atomic incremented counter to index an array allocated in global memory. Does anyone here have a good explanation why the following (minimal, no error checks) code fail to fill (store) the array properly although count is correctly set? __syncthreads() does not seems to help. What would be the proper CUDA way to accomplish such a task? TIA.
(note: compiled with -arch=sm_20)

__global__ void kernel(int *count, float *stor){
    int tx = threadIdx.x + blockIdx.x*blockDim.x;
    int ty = threadIdx.y + blockIdx.y*blockDim.y;
    if (tx%2) {
        stor[*count] = tx;
        atomicAdd(count, 1);
    }
    if (ty%2) {
        stor[*count] = ty;
        atomicAdd(count, 1);
    }
}

int main(){
    int *count;
    float *stor;
    dim3 numThreads(8,8);
    dim3 numBlocks(2,2);
    int size = numBlocks.x*numBlocks.y*numThreads.x*numThreads.y;
    size_t nbytes = size*sizeof(float);

    cudaMallocHost(&count, sizeof(int));
    cudaMallocHost(&stor, nbytes);
    memset(count, 0, sizeof(int));
    kernel<<<numBlocks,numThreads>>>(count, stor);
    cudaDeviceSynchronize();
    printf("count = %d\n", *count);
    for (int i=0; i<size; i++){
        printf("%f\t", stor[i]);
    }
    return 0;
}

Gert-Jan · June 20, 2014, 1:46pm

On line 5 and 9 the variable *count will be read by many, many threads in parallel, and they will all get the same value. Hence the array “stor” is only updated in this location. The increments of “count” are done sequentially, as they are done by the atomicAdd function.

One way to correctly use atomic incremented counters as an index is like this:

int i = atomicAdd(count, 1);
stor[i] = some_value;

atomicAdd will return the value which was stored at “count” before “1” was added. Therefor every thread will get a different value for “i”, and the array “stor” will be completely filled.

bgeneto · June 20, 2014, 3:01pm

Thank you so much, you really helped a lot!

Topic		Replies	Views
Updating Global Array by multiple thread/blocks CUDA Programming and Performance	3	4274	July 23, 2010
what's the best way to define a counter that can be accessed by all threads CUDA Programming and Performance	4	1458	May 3, 2010
Incrementing a "counter" CUDA Programming and Performance	5	2014	January 28, 2013
can one force two operations to occur atomically together? CUDA Programming and Performance	2	1555	June 30, 2015
incorrect results from atomicAdd (maybe the method is incorrect) CUDA Programming and Performance	1	3813	May 2, 2010
question on atomic operation CUDA Programming and Performance	2	3439	November 16, 2011
Multithreading Counter CUDA Programming and Performance	4	1774	August 14, 2009
Increment a device variable each time a kernel function gets executed. CUDA Programming and Performance	3	2122	April 6, 2012
Synchronour array write CUDA Programming and Performance	6	812	March 20, 2017
How to retreive data from eaxh thread? CUDA Programming and Performance	8	4847	May 15, 2010

Atomic counter as array index

Related topics