Which is faster, atomicAdd() or thrust::inclusive_scan<T>()?

So, I was wondering, what if we had a boolean array and we wanted go get the number of true instances.

For the sake of clarification, we need both sets of data, the array and the total number.

I’m curious, is it faster to use an atomicAdd() operation in the kernel that’s assigning to the array or is it faster to split it up into two kernels, one doing the assigning and the other being thrust::inclusive_scan() and the usage of pfx_sum.back() to get the total number.

Edit : My guess is that the prefix sum is faster because threads aren’t tripping over each other, so to speak. 1 million threads all trying to add to the same location sounds like there’d be a lot of thread blocking.

While it is possible to write faster reduction code than thrust, you probably would be better off using thrust.

As far as atomics the better idea would be to sum and store via __shfl() in shared memory the number of true values loaded from global memory from threads in a given block.Then when all threads in that block are done with work do the atomic update with that block local sub-sum to the global full-sum value.
This way you reduce the number of atomic operations to the number of thread blocks, which is much better than having 1 million threads trying to update the same memory location. You would also want each thread in the block to examine more than one memory location in the array to compute their sub-sub-sum.

Wouldn’t this be pretty simple just to benchmark and compare?

For a large scale reduction, a true parallel reduction is usually faster than the atomic add method. One exception would be if your array is exceptionally sparse. To take it to an extreme, if you are counting the boolean true values in an array, and you only have one true value in the entire array, the atomic method will be quicker.

Regarding thrust, if you are just counting a fixed value in an array, thrust::count should be less expensive than either a prefix sum (thrust::inclusive_scan) or a reduction (thrust::reduce)

Here’s a worked example:

Oh shoot, I didn’t even know about thrust::count! That’s exactly what I need! Thank you.