For my selection algorithm, I want a thread block to copy the elements between a certain range like this:
__shared__ uint count;
for (i = threadIdx.x; i < size; i += blockDim.x)
{
AnyType v = in[i];
if (pivot0 <= v && v < pivot1)
out[atomicAdd(&count, 1)] = v;
}
I would like to get rid of the atomicAdd, because count is frequently contested (up to WARP_SIZE times). I propose for each thread to have its own local count and output pointer (computed as prefix sum from previous Count call):
uint count = 0;
AnyType *out_local = &out[prefix_sum[threadIdx.x]];
for (i = threadIdx.x; i < size; i += blockDim.x)
{
AnyType v = in[i];
if (pivot0 <= v && v < pivot1)
out_local[count++] = v;
}
Although, this eliminates atomicAdd, it incurs the extra cost of needing a prefix sum of all threads in a block, rather than a prefix sum of all block counts earlier. A potential improvement would be to reduce the prefix sum count to WARP_SIZE instead of blockDim.x by having each thread that map to the same warp lane share the same count. It seems this method will need to use atomicAdd again due to concurrency even between threads of the same lane, but maybe with less contention.
Although I haven’t tested it yet, I suspect the previous method with its delicate tradeoffs will achieve the best performance. Is there any better way? Is it possible to not use atomicAdd by taking advantage of some guaranteed serial behavior between certain threads?