Filtering items and amortising at the block level when adding to multiple different lists

With Cuda9/CUB/Pascal+ I want to filter an array of values and depending on the filter predicate result add them to different chunked linked lists using atomicCAS.

ie:

if(Predicate(item) == 0) AddToList0();
if(Predicate(item) == 1) AddToList1();

I have a reasonable idea how to approach amortising the adding to the lists in a warp aggregated manner (https://devblogs.nvidia.com/voting-and-shuffling-optimize-atomic-operations/ was helpful). I noticed Volta onwards gets the handy match-all/any at the warp level so I assume solutions in this direction are probably best direction - although I must support Pascal so I can’t use that.

But for aggregating the adding of list chunks across whole blocks I am not completely sure the same sort of approach makes sense as I can’t find code examples of something doing similar?

I assume right now I should:

  1. Do warp level list aggregation into shared memory - involves counting number of different lists (keys of lists), summing to find the warp level list lengths, organising them in shared memory somehow.
  2. Do block level aggregation of the per warp lists similar to (1) somehow - will involve rounds of broadcasting each warps list key through shared memory to other warps in the block? Block level summing etc.
  3. Lead thread in lead warp reserves variable length list chunk item space with an atomicAdd from a rolling memory pool, needed threads in block do CUBStore to fill in list chunk, lead thread in lead warp does atomicCAS to amortised list add

Is this the right approach? Are there any examples I can look at of something doing the same using CUB to see the specific details of the process? It’s the block level aggregation part and shared memory organisation for the sublists and their stitching together etc that I am worried about mostly.

Actually it probably makes more sense in this case to forget the warp aggregation part and instead do a block level sort of {listIDKeyFromPredicate, insertValue} pairs? Then do a block level histogram to find the allocation counts? Then do a block level discontinuity to find lead threads in the block to do the atomicCAS linked lists inserts?