I posted moderngpu 2.0 this week:
It’s the best code I’ve ever written. I encourage all users to give it a try.
The new points of power are the load-balancing search transform and segreduce functions, transform_lbs and lbs_segreduce.
It also has an experimental dynamic work-creation mechanism, lbs_workcreate. This is a two-pass function that combines load-balancing search, stream compaction and prefix scan to allow work-items to generate new segments of work. There is work-efficient breadth-first search demo using this mechanism that is economically written and is magically load-balanced by the lbs_workcreate pattern.
The source is only about 5500 lines yet the library has more functional coverage than any other general-purpose CUDA library.