Atomic counter as array index

Hi! Unfortunately I’m struggling to understand why we can’t use an atomic incremented counter to index an array allocated in global memory. Does anyone here have a good explanation why the following (minimal, no error checks) code fail to fill (store) the array properly although count is correctly set? __syncthreads() does not seems to help. What would be the proper CUDA way to accomplish such a task? TIA.
(note: compiled with -arch=sm_20)

__global__ void kernel(int *count, float *stor){
    int tx = threadIdx.x + blockIdx.x*blockDim.x;
    int ty = threadIdx.y + blockIdx.y*blockDim.y;
    if (tx%2) {
        stor[*count] = tx;
        atomicAdd(count, 1);
    }
    if (ty%2) {
        stor[*count] = ty;
        atomicAdd(count, 1);
    }
}

int main(){
    int *count;
    float *stor;
    dim3 numThreads(8,8);
    dim3 numBlocks(2,2);
    int size = numBlocks.x*numBlocks.y*numThreads.x*numThreads.y;
    size_t nbytes = size*sizeof(float);

    cudaMallocHost(&count, sizeof(int));
    cudaMallocHost(&stor, nbytes);
    memset(count, 0, sizeof(int));
    kernel<<<numBlocks,numThreads>>>(count, stor);
    cudaDeviceSynchronize();
    printf("count = %d\n", *count);
    for (int i=0; i<size; i++){
        printf("%f\t", stor[i]);
    }
    return 0;
}

On line 5 and 9 the variable *count will be read by many, many threads in parallel, and they will all get the same value. Hence the array “stor” is only updated in this location. The increments of “count” are done sequentially, as they are done by the atomicAdd function.

One way to correctly use atomic incremented counters as an index is like this:

int i = atomicAdd(count, 1);
stor[i] = some_value;

atomicAdd will return the value which was stored at “count” before “1” was added. Therefor every thread will get a different value for “i”, and the array “stor” will be completely filled.

Thank you so much, you really helped a lot!