Atomic Functions

Ok guys first of all thanks for read my question.

I want to know if exists other way to solve the following problem, without use atomic functions all the time…

Suppose that I have a vector initially with values zeros, and I have a lot of threads, and each thread randomly and independent of other must add the value 1 in the any index in the array. In this point, we have in the problem that if any thread access the same index memory of each other, one of the threads must wait the other do their work first and after must do your work. This must be done so that the result stay consistent.

In this problem, using CUDA C I have used atomic functions to do the increment with all the Threads in all moment. Not only when two or more Threads are in the the same memory address. I want to know if is possible to do this work in CUDA without use the atomic functions in all the increments. Because this functions take too time.


I could give you one approach to solve this.

Suppose you have an array A of size 4.

Init A: [0,0,0,0]

Launch a kernel that generates random index for each thread. Suppose we have 10 threads.


Sort I: [0,0,1,1,1,3,3,3,3,3]

Perform Run-Length encoding on I: [(2,0),(3,1),(5,3)]

Launch the last kernel to fill the A buffer.


Run-Length encoding and sorting could be find in thrust library.