Simple histogram accumulation using __shfl

Thanks!

This may give you some ideas if the number of bins is small:

https://devblogs.nvidia.com/parallelforall/voting-and-shuffling-optimize-atomic-operations/
http://on-demand.gputechconf.com/gtc/2015/presentation/S5151-Elmar-Westphal.pdf